[jira] [Commented] (DRILL-5674) Drill should support .zip compression

2019-10-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16957573#comment-16957573
 ] 

ASF GitHub Bot commented on DRILL-5674:
---

paul-rogers commented on issue #1879: DRILL-5674: Support ZIP compression
URL: https://github.com/apache/drill/pull/1879#issuecomment-545275685
 
 
   Thanks for the answers to my questions. LGTM.
   +1
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill should support .zip compression
> -
>
> Key: DRILL-5674
> URL: https://issues.apache.org/jira/browse/DRILL-5674
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.17.0
>
>
> Zip is a very common compression format. Create a compressed CSV file with 
> column headers: data.csv.zip.
> Define a storage plugin config for the file, call it "dfs.myws", set 
> delimiter = ",", extract header = true, skip header = false.
> Run a simple query:
> SELECT * FROM dfs.myws.`data.csv.zip`
> The result is garbage as the CSV reader is trying to parse Zipped data as if 
> it were text.
> DRILL-5506 asks how to do this; the responder said to add a library to the 
> path. Better would be to simply support zip out-of-the-box as a default 
> format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7417) Add user logged in/out event in info level logs

2019-10-22 Thread Sorabh Hamirwasia (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sorabh Hamirwasia updated DRILL-7417:
-
Component/s: Security

> Add user logged in/out event in info level logs
> ---
>
> Key: DRILL-7417
> URL: https://issues.apache.org/jira/browse/DRILL-7417
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Security
>Reporter: Sorabh Hamirwasia
>Assignee: Sorabh Hamirwasia
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7417) Add user logged in/out event in info level logs

2019-10-22 Thread Sorabh Hamirwasia (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sorabh Hamirwasia updated DRILL-7417:
-
Summary: Add user logged in/out event in info level logs  (was: Test Task)

> Add user logged in/out event in info level logs
> ---
>
> Key: DRILL-7417
> URL: https://issues.apache.org/jira/browse/DRILL-7417
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Sorabh Hamirwasia
>Assignee: Sorabh Hamirwasia
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (DRILL-7417) Test Task

2019-10-22 Thread Sorabh Hamirwasia (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sorabh Hamirwasia reopened DRILL-7417:
--
  Assignee: Sorabh Hamirwasia

> Test Task
> -
>
> Key: DRILL-7417
> URL: https://issues.apache.org/jira/browse/DRILL-7417
> Project: Apache Drill
>  Issue Type: Task
>Reporter: Sorabh Hamirwasia
>Assignee: Sorabh Hamirwasia
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7417) Test Task

2019-10-22 Thread Sorabh Hamirwasia (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sorabh Hamirwasia updated DRILL-7417:
-
Issue Type: Improvement  (was: Task)

> Test Task
> -
>
> Key: DRILL-7417
> URL: https://issues.apache.org/jira/browse/DRILL-7417
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Sorabh Hamirwasia
>Assignee: Sorabh Hamirwasia
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7418) MetadataDirectGroupScan improvements

2019-10-22 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7418:

Description: 
When count is converted to direct scan (case when statistics or table metadata 
are available and there is no need to perform count operation), 
{{MetadataDirectGroupScan}} is used. Proposed {{MetadataDirectGroupScan}} 
enhancements:
1. Show table selection root instead listing all table files. If table has lots 
of files, query plan gets polluted with all files enumeration. Since files are 
not used for calculation (only metadata), they are not relevant and can be 
excluded from the plan.

Before:
{noformat}
| 00-00Screen
00-01  Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
00-02DirectScan(groupscan=[files = 
[/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_0.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_5.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_4.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_9.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_3.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_6.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_7.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_10.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_2.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_1.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_8.parquet], 
numFiles = 11, usedMetadataSummaryFile = false, DynamicPojoRecordReader{records 
= [[1560060, 2880404, 2880404, 0]]}])
{noformat}


After:
{noformat}
| 00-00Screen
00-01  Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
00-02DirectScan(groupscan=[selectionRoot = 
/drill/testdata/metadata_cache/store_sales_null_blocks_all, numFiles = 11, 
usedMetadataSummaryFile = false, DynamicPojoRecordReader{records = [[1560060, 
2880404, 2880404, 0]]}])
{noformat}

For Hive tables which were scanned directly, selection root is not available 
thus will be omitted.

2. Submission of physical plan which contains {{MetadataDirectGroupScan}} fails 
with deserialization errors, proper ser / de should be implemented.

  was:
When count is converted to direct scan (case when statistics or table metadata 
are available and there is no need to perform count operation), 
{{MetadataDirectGroupScan}} is used. Proposed {{MetadataDirectGroupScan}} 
enhancements:
1. Show table selection root instead listing all table files. If table has lots 
of files, query plan gets polluted with all files enumeration. Since files are 
not used for calculation (only metadata), they are not relevant and can be 
excluded from the plan.

Before:
{noformat}
| 00-00Screen
00-01  Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
00-02DirectScan(groupscan=[files = 
[/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_0.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_5.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_4.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_9.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_3.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_6.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_7.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_10.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_2.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_1.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_8.parquet], 
numFiles = 11, usedMetadataSummaryFile = false, DynamicPojoRecordReader{records 
= [[1560060, 2880404, 2880404, 0]]}])
{noformat}


After:
{noformat}
| 00-00Screen
00-01  Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
00-02DirectScan(groupscan=[selectionRoot = 
/drill/testdata/metadata_cache/store_sales_null_blocks_all, numFiles = 11, 
usedMetadataSummaryFile = false, DynamicPojoRecordReader{records = [[1560060, 
2880404, 2880404, 0]]}])
{noformat}

2. Submission of physical plan which contains {{MetadataDirectGroupScan}} fails 
with deserialization errors, proper ser / de should be implemented.


> MetadataDirectGroupScan improvements
> 
>
> Key: DRILL-7418
> URL: https://issues.apache.org/jira/browse/DRILL-7418
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.16.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Minor
> Fix For: 1.17.0
>
>
> When count is conve

[jira] [Updated] (DRILL-7418) MetadataDirectGroupScan improvements

2019-10-22 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7418:

Description: 
When count is converted to direct scan (case when statistics or table metadata 
are available and there is no need to perform count operation), 
{{MetadataDirectGroupScan}} is used. Proposed {{MetadataDirectGroupScan}} 
enhancements:
1. Show table selection root instead listing all table files. If table has lots 
of files, query plan gets polluted with all files enumeration. Since files are 
not used for calculation (only metadata), they are not relevant and can be 
excluded from the plan.

Before:
{noformat}
| 00-00Screen
00-01  Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
00-02DirectScan(groupscan=[files = 
[/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_0.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_5.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_4.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_9.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_3.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_6.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_7.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_10.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_2.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_1.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_8.parquet], 
numFiles = 11, usedMetadataSummaryFile = false, DynamicPojoRecordReader{records 
= [[1560060, 2880404, 2880404, 0]]}])
{noformat}


After:
{noformat}
| 00-00Screen
00-01  Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
00-02DirectScan(groupscan=[selectionRoot = 
/drill/testdata/metadata_cache/store_sales_null_blocks_all, numFiles = 11, 
usedMetadataSummaryFile = false, DynamicPojoRecordReader{records = [[1560060, 
2880404, 2880404, 0]]}])
{noformat}

2. Submission of physical plan which contains {{MetadataDirectGroupScan}} fails 
with deserialization errors, proper ser / de should be implemented.

  was:
When count is converted to direct scan (case when statistics or table metadata 
are available and there is no need to perform count operation), 
{{MetadataDirectGroupScan}} is used. Proposed {{MetadataDirectGroupScan}} 
enhancements:
1. Show table selection root instead listing all table files. If users= has 
lots of files, query plan gets polluted with files enumeration. Since files are 
not used for calculation (only metadata), they are not relevant and can be 
excluded from plan.

Before:
{noformat}
| 00-00Screen
00-01  Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
00-02DirectScan(groupscan=[files = 
[/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_0.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_5.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_4.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_9.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_3.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_6.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_7.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_10.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_2.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_1.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_8.parquet], 
numFiles = 11, usedMetadataSummaryFile = false, DynamicPojoRecordReader{records 
= [[1560060, 2880404, 2880404, 0]]}])
{noformat}


After:
{noformat}
| 00-00Screen
00-01  Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
00-02DirectScan(groupscan=[selectionRoot = 
/drill/testdata/metadata_cache/store_sales_null_blocks_all, numFiles = 11, 
usedMetadataSummaryFile = false, DynamicPojoRecordReader{records = [[1560060, 
2880404, 2880404, 0]]}])
{noformat}

2. Submission of physical plan which contains {{MetadataDirectGroupScan}} fails 
with deserialization errors, proper ser / de should be implemented.


> MetadataDirectGroupScan improvements
> 
>
> Key: DRILL-7418
> URL: https://issues.apache.org/jira/browse/DRILL-7418
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.16.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Minor
> Fix For: 1.17.0
>
>
> When count is converted to direct scan (case when statistics or table 
> metadata are available and there is no need to perform

[jira] [Updated] (DRILL-7418) MetadataDirectGroupScan improvements

2019-10-22 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7418:

Description: 
When count is converted to direct scan (case when statistics or table metadata 
are available and there is no need to perform count operation), 
{{MetadataDirectGroupScan}} is used. Proposed {{MetadataDirectGroupScan}} 
enhancements:
1. Show table selection root instead listing all table files. If users= has 
lots of files, query plan gets polluted with files enumeration. Since files are 
not used for calculation (only metadata), they are not relevant and can be 
excluded from plan.

Before:
{noformat}
| 00-00Screen
00-01  Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
00-02DirectScan(groupscan=[files = 
[/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_0.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_5.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_4.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_9.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_3.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_6.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_7.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_10.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_2.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_1.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_8.parquet], 
numFiles = 11, usedMetadataSummaryFile = false, DynamicPojoRecordReader{records 
= [[1560060, 2880404, 2880404, 0]]}])
{noformat}


After:
{noformat}
| 00-00Screen
00-01  Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
00-02DirectScan(groupscan=[selectionRoot = 
/drill/testdata/metadata_cache/store_sales_null_blocks_all, numFiles = 11, 
usedMetadataSummaryFile = false, DynamicPojoRecordReader{records = [[1560060, 
2880404, 2880404, 0]]}])
{noformat}

2. Submission of physical plan which contains {{MetadataDirectGroupScan}} fails 
with deserialization errors, proper ser / de should be implemented.

  was:
When count is converted to direct scan (case when statistics or table metadata 
are available and there is no need to perform count operation), 
{{MetadataDirectGroupScan}} is used. Proposed {{MetadataDirectGroupScan}} 
enhancements:
1. show table root instead listing all table files. If users= has lots of 
files, query plan gets polluted with files enumeration. Since files are not 
used for calculation (only metadata), they are not relevant and can be excluded 
from plan.

Before:
{noformat}
| 00-00Screen
00-01  Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
00-02DirectScan(groupscan=[files = 
[/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_0.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_5.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_4.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_9.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_3.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_6.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_7.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_10.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_2.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_1.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_8.parquet], 
numFiles = 11, usedMetadataSummaryFile = false, DynamicPojoRecordReader{records 
= [[1560060, 2880404, 2880404, 0]]}])
{noformat}


After:
{noformat}
| 00-00Screen
00-01  Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
00-02DirectScan(groupscan=[selectionRoot = 
/drill/testdata/metadata_cache/store_sales_null_blocks_all, numFiles = 11, 
usedMetadataSummaryFile = false, DynamicPojoRecordReader{records = [[1560060, 
2880404, 2880404, 0]]}])
{noformat}

2. Submission of physical plan which contains {{MetadataDirectGroupScan}} fails 
with deserialization errors, proper ser / de should be implemented.


> MetadataDirectGroupScan improvements
> 
>
> Key: DRILL-7418
> URL: https://issues.apache.org/jira/browse/DRILL-7418
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.16.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Minor
> Fix For: 1.17.0
>
>
> When count is converted to direct scan (case when statistics or table 
> metadata are available and there is no need to perform count operation)

[jira] [Updated] (DRILL-7418) MetadataDirectGroupScan improvements

2019-10-22 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7418:

Description: 
When count is converted to direct scan (case when statistics or table metadata 
are available and there is no need to perform count operation), 
{{MetadataDirectGroupScan}} is used. Proposed {{MetadataDirectGroupScan}} 
enhancements:
1. show table root instead listing all table files. If users= has lots of 
files, query plan gets polluted with files enumeration. Since files are not 
used for calculation (only metadata), they are not relevant and can be excluded 
from plan.

Before:
{noformat}
| 00-00Screen
00-01  Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
00-02DirectScan(groupscan=[files = 
[/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_0.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_5.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_4.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_9.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_3.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_6.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_7.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_10.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_2.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_1.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_8.parquet], 
numFiles = 11, usedMetadataSummaryFile = false, DynamicPojoRecordReader{records 
= [[1560060, 2880404, 2880404, 0]]}])
{noformat}


After:
{noformat}
| 00-00Screen
00-01  Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
00-02DirectScan(groupscan=[selectionRoot = 
/drill/testdata/metadata_cache/store_sales_null_blocks_all, numFiles = 11, 
usedMetadataSummaryFile = false, DynamicPojoRecordReader{records = [[1560060, 
2880404, 2880404, 0]]}])
{noformat}

2. Submission of physical plan which contains {{MetadataDirectGroupScan}} fails 
with deserialization errors, proper ser / de should be implemented.

  was:
When count is converted to direct scan (case when statistics and table metadata 
are available and there is no need to perform count operation), 
{{MetadataDirectGroupScan}} is used. Proposed {{MetadataDirectGroupScan}} 
enhancements:
1. show table root instead listing all table files. If users= has lots of 
files, query plan gets polluted with files enumeration. Since files are not 
used for calculation (only metadata), they are not relevant and can be excluded 
from plan.

Before:
{noformat}
| 00-00Screen
00-01  Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
00-02DirectScan(groupscan=[files = 
[/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_0.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_5.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_4.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_9.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_3.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_6.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_7.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_10.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_2.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_1.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_8.parquet], 
numFiles = 11, usedMetadataSummaryFile = false, DynamicPojoRecordReader{records 
= [[1560060, 2880404, 2880404, 0]]}])
{noformat}


After:
{noformat}
| 00-00Screen
00-01  Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
00-02DirectScan(groupscan=[selectionRoot = 
/drill/testdata/metadata_cache/store_sales_null_blocks_all, numFiles = 11, 
usedMetadataSummaryFile = false, DynamicPojoRecordReader{records = [[1560060, 
2880404, 2880404, 0]]}])
{noformat}

2. Submission of physical plan which contains {{MetadataDirectGroupScan}} fails 
with deserialization errors, proper ser / de should be implemented.


> MetadataDirectGroupScan improvements
> 
>
> Key: DRILL-7418
> URL: https://issues.apache.org/jira/browse/DRILL-7418
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.16.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Minor
> Fix For: 1.17.0
>
>
> When count is converted to direct scan (case when statistics or table 
> metadata are available and there is no need to perform count operation), 
> {{Me

[jira] [Created] (DRILL-7418) MetadataDirectGroupScan improvements

2019-10-22 Thread Arina Ielchiieva (Jira)
Arina Ielchiieva created DRILL-7418:
---

 Summary: MetadataDirectGroupScan improvements
 Key: DRILL-7418
 URL: https://issues.apache.org/jira/browse/DRILL-7418
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.16.0
Reporter: Arina Ielchiieva
Assignee: Arina Ielchiieva
 Fix For: 1.17.0


When count is converted to direct scan (case when statistics and table metadata 
are available and there is no need to perform count operation), 
{{MetadataDirectGroupScan}} is used. Proposed {{MetadataDirectGroupScan}} 
enhancements:
1. show table root instead listing all table files. If users= has lots of 
files, query plan gets polluted with files enumeration. Since files are not 
used for calculation (only metadata), they are not relevant and can be excluded 
from plan.

Before:
{noformat}
| 00-00Screen
00-01  Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
00-02DirectScan(groupscan=[files = 
[/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_0.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_5.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_4.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_9.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_3.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_6.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_7.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_10.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_2.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_1.parquet, 
/drill/testdata/metadata_cache/store_sales_null_blocks_all/0_0_8.parquet], 
numFiles = 11, usedMetadataSummaryFile = false, DynamicPojoRecordReader{records 
= [[1560060, 2880404, 2880404, 0]]}])
{noformat}


After:
{noformat}
| 00-00Screen
00-01  Project(EXPR$0=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
00-02DirectScan(groupscan=[selectionRoot = 
/drill/testdata/metadata_cache/store_sales_null_blocks_all, numFiles = 11, 
usedMetadataSummaryFile = false, DynamicPojoRecordReader{records = [[1560060, 
2880404, 2880404, 0]]}])
{noformat}

2. Submission of physical plan which contains {{MetadataDirectGroupScan}} fails 
with deserialization errors, proper ser / de should be implemented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7017) lz4 codec for (un)compression

2019-10-22 Thread benj (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16957093#comment-16957093
 ] 

benj commented on DRILL-7017:
-

Not sure to understand because lz4 is already (by default) in 
jars/3rdparty/lz4-1.3.0.jar in Apache Drill and it doesn't work.
Even with adding "org.apache.hadoop.io.compress.Lz4Codec" in 
io.compression.codecs in core-site.xml and 
Djava.library.path=/usr/hdp/.../lib/native/
{code:sql}
SELECT * FROM dfs.test.`a.csvh.lz4`;
Error: EXECUTION_ERROR ERROR: native lz4 library not available
{code}


> lz4 codec for (un)compression
> -
>
> Key: DRILL-7017
> URL: https://issues.apache.org/jira/browse/DRILL-7017
> Project: Apache Drill
>  Issue Type: Wish
>  Components: Storage - Text & CSV
>Affects Versions: 1.15.0
>Reporter: benj
>Priority: Major
>
> I didn't find in the documentation what compression formats are supported. 
> But as it's possible to use drill on compressed file, like
> {code:java}
> SELECT * FROM tmp.`myfile.csv.gz`;
> {code}
> It will be useful to have the possibility to use this functionality for lz4 
> file ([https://github.com/lz4/lz4])
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-6096) Provide mechanisms to specify field delimiters and quoted text for TextRecordWriter

2019-10-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956989#comment-16956989
 ] 

ASF GitHub Bot commented on DRILL-6096:
---

arina-ielchiieva commented on issue #1873: DRILL-6096: Provide mechanism to 
configure text writer configuration
URL: https://github.com/apache/drill/pull/1873#issuecomment-544929316
 
 
   Sounds interesting, basically, you are suggesting to create implicit schema 
file when writing a text files (i.e. creating table), thus format options will 
be picked up directly from schema file when reading table back.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Provide mechanisms to specify field delimiters and quoted text for 
> TextRecordWriter
> ---
>
> Key: DRILL-6096
> URL: https://issues.apache.org/jira/browse/DRILL-6096
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.12.0
>Reporter: Kunal Khatua
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.17.0
>
>
> Currently, there is no way for a user to specify the field delimiter for the 
> writing records as a text output. Further more, if the fields contain the 
> delimiter, we have no mechanism of specifying quotes.
> By default, quotes should be used to enclose non-numeric fields being written.
> *Description of the implemented changes:*
> 2 options are added to control text writer output:
> {{store.text.writer.add_header}} - indicates if header should be added in 
> created text file. Default is true.
> {{store.text.writer.force_quotes}} - indicates if all value should be quoted. 
> Default is false. It means only values that contain special characters (line 
> / field separators) will be quoted.
> Line / field separators, quote / escape characters can be configured using 
> text format configuration using Web UI. User can create special format only 
> for writing data and then use it when creating files. Though such format can 
> be always used to read back written data.
> {noformat}
>   "formats": {
> "write_text": {
>   "type": "text",
>   "extensions": [
> "txt"
>   ],
>   "lineDelimiter": "\n",
>   "fieldDelimiter": "!",
>   "quote": "^",
>   "escape": "^",
> }
>},
> ...
> {noformat}
> Next set specified format and create text file:
> {noformat}
> alter session set `store.format` = 'write_text';
> create table dfs.tmp.t as select 1 as id from (values(1));
> {noformat}
> Notes:
> 1. To write data univocity-parsers are used, they limit line separator length 
> to not more than 2 characters, though Drill allows setting more 2 chars as 
> line separator since Drill can read data splitting by line separator of any 
> length, during data write exception will be thrown.
> 2. {{extractHeader}} in text format configuration does not affect if header 
> will be written to text file, only {{store.text.writer.add_header}} controls 
> this action. {{extractHeader}} is used only when reading the data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (DRILL-6215) Use prepared statement instead of Statement in JdbcRecordReader class

2019-10-22 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-6215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva reassigned DRILL-6215:
---

Assignee: Igor Guzenko  (was: Khurram Faraaz)

> Use prepared statement instead of Statement in JdbcRecordReader class
> -
>
> Key: DRILL-6215
> URL: https://issues.apache.org/jira/browse/DRILL-6215
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - JDBC
>Affects Versions: 1.12.0
>Reporter: Khurram Faraaz
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: Future
>
>
> Use prepared statement instead of Statement in JdbcRecordReader class, which 
> is more efficient and less vulnerable to SQL injection attacks.
> Apache Drill 1.13.0-SNAPSHOT, commit : 
> 9073aed67d89e8b2188870d6c812706085c9c41b
> Findbugs reports the below bug and suggests that we use prepared statement 
> instead of Statement.
> {noformat}
> In class org.apache.drill.exec.store.jdbc.JdbcRecordReader
> In method 
> org.apache.drill.exec.store.jdbc.JdbcRecordReader.setup(OperatorContext, 
> OutputMutator)
> At JdbcRecordReader.java:[line 170]
> org.apache.drill.exec.store.jdbc.JdbcRecordReader.setup(OperatorContext, 
> OutputMutator) passes a nonconstant String to an execute method on an SQL 
> statement
> The method invokes the execute method on an SQL statement with a String that 
> seems to be dynamically generated. 
> Consider using a prepared statement instead. 
> It is more efficient and less vulnerable to SQL injection attacks.
> {noformat}
> LOC - 
> https://github.com/apache/drill/blob/a9ea4ec1c5645ddab4b7aef9ac060ff5f109b696/contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/JdbcRecordReader.java#L170
> {noformat}
> To run with findbugs:
> mvn clean install -Pfindbugs -DskipTests
> Findbugs will wirite the output to finbugsXml.html in the target directory of 
> each module. 
> For example the java-exec module report is located at: 
> ./exec/java-exec/target/findbugs/findbugsXml.html
> Use 
> find . -name "findbugsXml.html"
> to locate the files.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-5674) Drill should support .zip compression

2019-10-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956976#comment-16956976
 ] 

ASF GitHub Bot commented on DRILL-5674:
---

arina-ielchiieva commented on issue #1879: DRILL-5674: Support ZIP compression
URL: https://github.com/apache/drill/pull/1879#issuecomment-544925048
 
 
   @paul-rogers addressed code review comments.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill should support .zip compression
> -
>
> Key: DRILL-5674
> URL: https://issues.apache.org/jira/browse/DRILL-5674
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.17.0
>
>
> Zip is a very common compression format. Create a compressed CSV file with 
> column headers: data.csv.zip.
> Define a storage plugin config for the file, call it "dfs.myws", set 
> delimiter = ",", extract header = true, skip header = false.
> Run a simple query:
> SELECT * FROM dfs.myws.`data.csv.zip`
> The result is garbage as the CSV reader is trying to parse Zipped data as if 
> it were text.
> DRILL-5506 asks how to do this; the responder said to add a library to the 
> path. Better would be to simply support zip out-of-the-box as a default 
> format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-5674) Drill should support .zip compression

2019-10-22 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-5674:

Reviewer: Paul Rogers  (was: Vova Vysotskyi)

> Drill should support .zip compression
> -
>
> Key: DRILL-5674
> URL: https://issues.apache.org/jira/browse/DRILL-5674
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.17.0
>
>
> Zip is a very common compression format. Create a compressed CSV file with 
> column headers: data.csv.zip.
> Define a storage plugin config for the file, call it "dfs.myws", set 
> delimiter = ",", extract header = true, skip header = false.
> Run a simple query:
> SELECT * FROM dfs.myws.`data.csv.zip`
> The result is garbage as the CSV reader is trying to parse Zipped data as if 
> it were text.
> DRILL-5506 asks how to do this; the responder said to add a library to the 
> path. Better would be to simply support zip out-of-the-box as a default 
> format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-5674) Drill should support .zip compression

2019-10-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956975#comment-16956975
 ] 

ASF GitHub Bot commented on DRILL-5674:
---

arina-ielchiieva commented on pull request #1879: DRILL-5674: Support ZIP 
compression
URL: https://github.com/apache/drill/pull/1879#discussion_r337452079
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/ZipCodec.java
 ##
 @@ -0,0 +1,141 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.dfs;
+
+import org.apache.hadoop.io.compress.CompressionInputStream;
+import org.apache.hadoop.io.compress.CompressionOutputStream;
+import org.apache.hadoop.io.compress.DefaultCodec;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+import java.util.zip.ZipEntry;
+import java.util.zip.ZipInputStream;
+import java.util.zip.ZipOutputStream;
+
+/**
+ * ZIP codec implementation which cna read or create single entry.
+ * 
+ * Note: Do not rename this class. Class naming must be 'ZipCodec' so it can 
be mapped by
+ * {@link org.apache.hadoop.io.compress.CompressionCodecFactory} to the 'zip' 
extension.
+ */
+public class ZipCodec extends DefaultCodec {
+
+  private static final String EXTENSION = ".zip";
 
 Review comment:
   `org.apache.hadoop.io.compress` supports gzip, bzip2 out of box. We are 
adding only zip codec implementation since it is missing in 
`org.apache.hadoop.io.compress` library.
   
   Regarding `.tar.gz`, I am not sure how to support since it basically two 
compressions one over another and mostly is used for folders. Since Drill 
allows only to read compressed files not folders, I think for now we don't need 
it.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill should support .zip compression
> -
>
> Key: DRILL-5674
> URL: https://issues.apache.org/jira/browse/DRILL-5674
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.17.0
>
>
> Zip is a very common compression format. Create a compressed CSV file with 
> column headers: data.csv.zip.
> Define a storage plugin config for the file, call it "dfs.myws", set 
> delimiter = ",", extract header = true, skip header = false.
> Run a simple query:
> SELECT * FROM dfs.myws.`data.csv.zip`
> The result is garbage as the CSV reader is trying to parse Zipped data as if 
> it were text.
> DRILL-5506 asks how to do this; the responder said to add a library to the 
> path. Better would be to simply support zip out-of-the-box as a default 
> format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-5674) Drill should support .zip compression

2019-10-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956973#comment-16956973
 ] 

ASF GitHub Bot commented on DRILL-5674:
---

arina-ielchiieva commented on pull request #1879: DRILL-5674: Support ZIP 
compression
URL: https://github.com/apache/drill/pull/1879#discussion_r337444662
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
 ##
 @@ -59,33 +51,44 @@
 import org.apache.drill.exec.store.dfs.FormatSelection;
 import org.apache.drill.exec.store.dfs.MagicString;
 import org.apache.drill.exec.store.dfs.MetadataContext;
-import org.apache.drill.exec.store.mock.MockStorageEngine;
 import org.apache.drill.exec.store.parquet.metadata.Metadata;
 import org.apache.drill.exec.store.parquet.metadata.ParquetTableMetadataDirs;
 import org.apache.drill.exec.util.DrillFileSystemUtil;
 import org.apache.drill.shaded.guava.com.google.common.base.Stopwatch;
 import org.apache.drill.shaded.guava.com.google.common.collect.ImmutableSet;
-import org.apache.drill.shaded.guava.com.google.common.collect.Lists;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FSDataInputStream;
 import org.apache.hadoop.fs.FileStatus;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
 import org.apache.parquet.format.converter.ParquetMetadataConverter;
 import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.TimeUnit;
+import java.util.regex.Pattern;
 
 Review comment:
   Agree, I guess we definitely should decide on import order, For now reverted 
the change.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill should support .zip compression
> -
>
> Key: DRILL-5674
> URL: https://issues.apache.org/jira/browse/DRILL-5674
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.17.0
>
>
> Zip is a very common compression format. Create a compressed CSV file with 
> column headers: data.csv.zip.
> Define a storage plugin config for the file, call it "dfs.myws", set 
> delimiter = ",", extract header = true, skip header = false.
> Run a simple query:
> SELECT * FROM dfs.myws.`data.csv.zip`
> The result is garbage as the CSV reader is trying to parse Zipped data as if 
> it were text.
> DRILL-5506 asks how to do this; the responder said to add a library to the 
> path. Better would be to simply support zip out-of-the-box as a default 
> format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-5674) Drill should support .zip compression

2019-10-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956970#comment-16956970
 ] 

ASF GitHub Bot commented on DRILL-5674:
---

arina-ielchiieva commented on pull request #1879: DRILL-5674: Support ZIP 
compression
URL: https://github.com/apache/drill/pull/1879#discussion_r337453494
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSystemPlugin.java
 ##
 @@ -57,7 +61,9 @@
  */
 public class FileSystemPlugin extends AbstractStoragePlugin {
 
-  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(FileSystemPlugin.class);
+  private static final Logger logger = 
LoggerFactory.getLogger(FileSystemPlugin.class);
+
+  private static final List BUILT_IN_CODECS = 
Collections.singletonList(ZipCodec.class.getCanonicalName());
 
 Review comment:
   `org.apache.hadoop.io.compress` library supports gzip / bzip2 out of box. 
Here we only need to add codecs that are missing in this library. I have 
updated parameter name and added comment to avoid the confusion. 
`TestCompressedFiles` contains tests for all supported formats.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill should support .zip compression
> -
>
> Key: DRILL-5674
> URL: https://issues.apache.org/jira/browse/DRILL-5674
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.17.0
>
>
> Zip is a very common compression format. Create a compressed CSV file with 
> column headers: data.csv.zip.
> Define a storage plugin config for the file, call it "dfs.myws", set 
> delimiter = ",", extract header = true, skip header = false.
> Run a simple query:
> SELECT * FROM dfs.myws.`data.csv.zip`
> The result is garbage as the CSV reader is trying to parse Zipped data as if 
> it were text.
> DRILL-5506 asks how to do this; the responder said to add a library to the 
> path. Better would be to simply support zip out-of-the-box as a default 
> format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-5674) Drill should support .zip compression

2019-10-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956968#comment-16956968
 ] 

ASF GitHub Bot commented on DRILL-5674:
---

arina-ielchiieva commented on pull request #1879: DRILL-5674: Support ZIP 
compression
URL: https://github.com/apache/drill/pull/1879#discussion_r337442794
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java
 ##
 @@ -386,17 +387,16 @@ public static void checkBackPaths(String parent, String 
combinedPath, String sub
 Preconditions.checkArgument(!combinedPath.isEmpty(), "Empty path (" + 
combinedPath + "( in file selection path.");
 
 if (!combinedPath.startsWith(parent)) {
-  StringBuilder msg = new StringBuilder();
-  msg.append("Invalid path : ").append(subpath).append(" takes you outside 
the workspace.");
-  throw new IllegalArgumentException(msg.toString());
+  throw new IllegalArgumentException(
+String.format("Invalid path [%s] takes you outside the workspace.", 
subpath));
 }
   }
 
   public List getFileStatuses() {
 return statuses;
   }
 
-  public boolean supportDirPrunig() {
+  public boolean supportDirPruning() {
 
 Review comment:
   Agree, renamed.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill should support .zip compression
> -
>
> Key: DRILL-5674
> URL: https://issues.apache.org/jira/browse/DRILL-5674
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.17.0
>
>
> Zip is a very common compression format. Create a compressed CSV file with 
> column headers: data.csv.zip.
> Define a storage plugin config for the file, call it "dfs.myws", set 
> delimiter = ",", extract header = true, skip header = false.
> Run a simple query:
> SELECT * FROM dfs.myws.`data.csv.zip`
> The result is garbage as the CSV reader is trying to parse Zipped data as if 
> it were text.
> DRILL-5506 asks how to do this; the responder said to add a library to the 
> path. Better would be to simply support zip out-of-the-box as a default 
> format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-5674) Drill should support .zip compression

2019-10-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956971#comment-16956971
 ] 

ASF GitHub Bot commented on DRILL-5674:
---

arina-ielchiieva commented on pull request #1879: DRILL-5674: Support ZIP 
compression
URL: https://github.com/apache/drill/pull/1879#discussion_r337457219
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcapng/PcapngFormatPlugin.java
 ##
 @@ -47,7 +47,7 @@ public PcapngFormatPlugin(String name, DrillbitContext 
context, Configuration fs
 
   public PcapngFormatPlugin(String name, DrillbitContext context, 
Configuration fsConf, StoragePluginConfig config, PcapngFormatConfig 
formatPluginConfig) {
 super(name, context, fsConf, config, formatPluginConfig, true,
-false, true, false,
+false, true, true,
 
 Review comment:
   1. Drill uses `BlockMapBuilder` to split file into blocks if possible. 
According to its code, it tries to split the file if `blockSplittable` is set 
to true and file IS NOT compressed. So even if format is block splittable but 
came as compressed file, it won't be split.
   
   
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/schedule/BlockMapBuilder.java#L115
   
   Looks like most of compressed formats are not splittable, that's why Drill 
considers any compressed file not splittable: 
https://i.stack.imgur.com/jpprr.jpg
   
   2. Regarding blockSplittable for Pcang format, you are right such format is 
not splittable, as well as Pcap, I have updated the value of `blockSplittable` 
to `false` for both formats.
   
   https://blog.marouni.fr/pcap2seq/
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill should support .zip compression
> -
>
> Key: DRILL-5674
> URL: https://issues.apache.org/jira/browse/DRILL-5674
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.17.0
>
>
> Zip is a very common compression format. Create a compressed CSV file with 
> column headers: data.csv.zip.
> Define a storage plugin config for the file, call it "dfs.myws", set 
> delimiter = ",", extract header = true, skip header = false.
> Run a simple query:
> SELECT * FROM dfs.myws.`data.csv.zip`
> The result is garbage as the CSV reader is trying to parse Zipped data as if 
> it were text.
> DRILL-5506 asks how to do this; the responder said to add a library to the 
> path. Better would be to simply support zip out-of-the-box as a default 
> format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-5674) Drill should support .zip compression

2019-10-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956969#comment-16956969
 ] 

ASF GitHub Bot commented on DRILL-5674:
---

arina-ielchiieva commented on pull request #1879: DRILL-5674: Support ZIP 
compression
URL: https://github.com/apache/drill/pull/1879#discussion_r337450484
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/ZipCodec.java
 ##
 @@ -0,0 +1,141 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.dfs;
+
+import org.apache.hadoop.io.compress.CompressionInputStream;
+import org.apache.hadoop.io.compress.CompressionOutputStream;
+import org.apache.hadoop.io.compress.DefaultCodec;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+import java.util.zip.ZipEntry;
+import java.util.zip.ZipInputStream;
+import java.util.zip.ZipOutputStream;
+
+/**
+ * ZIP codec implementation which cna read or create single entry.
+ * 
+ * Note: Do not rename this class. Class naming must be 'ZipCodec' so it can 
be mapped by
+ * {@link org.apache.hadoop.io.compress.CompressionCodecFactory} to the 'zip' 
extension.
+ */
+public class ZipCodec extends DefaultCodec {
+
+  private static final String EXTENSION = ".zip";
+
+  @Override
+  public CompressionOutputStream createOutputStream(OutputStream out) throws 
IOException {
+return new ZipCompressionOutputStream(new ResetableZipOutputStream(out));
+  }
+
+  @Override
+  public CompressionInputStream createInputStream(InputStream in) throws 
IOException {
+return new ZipCompressionInputStream(new ZipInputStream(in));
+  }
+
+  @Override
+  public String getDefaultExtension() {
+return EXTENSION;
+  }
+
+  /**
+   * Reads only first entry from {@link ZipInputStream},
+   * other entries if present will be ignored.
+   */
+  private static class ZipCompressionInputStream extends 
CompressionInputStream {
+
+ZipCompressionInputStream(ZipInputStream in) throws IOException {
+  super(in);
+  // positions stream at the beginning of the first entry data
+  in.getNextEntry();
+}
+
+@Override
+public int read() throws IOException {
+  return in.read();
+}
+
+@Override
+public int read(byte[] b, int off, int len) throws IOException {
+  return in.read(b, off, len);
+}
+
+@Override
+public void resetState() throws IOException {
+  in.reset();
+}
+
+@Override
+public void close() throws IOException {
+  try {
+((ZipInputStream) in).closeEntry();
+  } finally {
+super.close();
+  }
+}
+  }
+
+  /**
+   * Extends {@link ZipOutputStream} to allow resetting compressor stream,
+   * required by {@link CompressionOutputStream} implementation.
+   */
+  private static class ResetableZipOutputStream extends ZipOutputStream {
+
+ResetableZipOutputStream(OutputStream out) {
+  super(out);
+}
+
+void resetState() {
+  def.reset();
+}
+  }
+
+  /**
+   * Writes given data into ZIP archive by placing all data in one entry with 
default naming.
+   */
+  private static class ZipCompressionOutputStream extends 
CompressionOutputStream {
+
+private static final String DEFAULT_ENTRY_NAME = "entry.out";
 
 Review comment:
   This stream is called by compression codec and we have no control to pass 
file name etc.
   Anyway, Drill is not using output stream, only input stream, I have added 
implementation only for testing purposes.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill should support .zip compression
> -
>
> Key: DRILL-5674
> URL: https://issues.apache.org/jira/browse/DRILL-5674
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CS

[jira] [Commented] (DRILL-5674) Drill should support .zip compression

2019-10-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956974#comment-16956974
 ] 

ASF GitHub Bot commented on DRILL-5674:
---

arina-ielchiieva commented on pull request #1879: DRILL-5674: Support ZIP 
compression
URL: https://github.com/apache/drill/pull/1879#discussion_r337443431
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcapng/package-info.java
 ##
 @@ -16,7 +16,7 @@
  * limitations under the License.
  */
 /**
- * For comments on realization of this format plugin look at :
+ * For comments on implementation of this format plugin look at:
 
 Review comment:
   Updated.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill should support .zip compression
> -
>
> Key: DRILL-5674
> URL: https://issues.apache.org/jira/browse/DRILL-5674
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.17.0
>
>
> Zip is a very common compression format. Create a compressed CSV file with 
> column headers: data.csv.zip.
> Define a storage plugin config for the file, call it "dfs.myws", set 
> delimiter = ",", extract header = true, skip header = false.
> Run a simple query:
> SELECT * FROM dfs.myws.`data.csv.zip`
> The result is garbage as the CSV reader is trying to parse Zipped data as if 
> it were text.
> DRILL-5506 asks how to do this; the responder said to add a library to the 
> path. Better would be to simply support zip out-of-the-box as a default 
> format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-5674) Drill should support .zip compression

2019-10-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956972#comment-16956972
 ] 

ASF GitHub Bot commented on DRILL-5674:
---

arina-ielchiieva commented on pull request #1879: DRILL-5674: Support ZIP 
compression
URL: https://github.com/apache/drill/pull/1879#discussion_r337452121
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FormatSelection.java
 ##
 @@ -63,6 +60,6 @@ public FileSelection getSelection(){
 
   @JsonIgnore
   public boolean supportDirPruning() {
 
 Review comment:
   Done.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill should support .zip compression
> -
>
> Key: DRILL-5674
> URL: https://issues.apache.org/jira/browse/DRILL-5674
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.17.0
>
>
> Zip is a very common compression format. Create a compressed CSV file with 
> column headers: data.csv.zip.
> Define a storage plugin config for the file, call it "dfs.myws", set 
> delimiter = ",", extract header = true, skip header = false.
> Run a simple query:
> SELECT * FROM dfs.myws.`data.csv.zip`
> The result is garbage as the CSV reader is trying to parse Zipped data as if 
> it were text.
> DRILL-5506 asks how to do this; the responder said to add a library to the 
> path. Better would be to simply support zip out-of-the-box as a default 
> format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover

2019-10-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956913#comment-16956913
 ] 

ASF GitHub Bot commented on DRILL-7414:
---

arina-ielchiieva commented on issue #1878: DRILL-7414: EVF incorrectly sets 
buffer writer index after rollover
URL: https://github.com/apache/drill/pull/1878#issuecomment-544896904
 
 
   Cherry-picked last two commits and squashed them, since there were no 
conflicts, changes were merged into master.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> EVF incorrectly sets buffer writer index after rollover
> ---
>
> Key: DRILL-7414
> URL: https://issues.apache.org/jira/browse/DRILL-7414
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> A full test run, with vector validation enabled and with the "new" scan 
> enabled,  revealed the following in {{TestMockPlugin.testSizeLimit()}}:
> {noformat}
> comments_s2 - VarCharVector: Row count = 838, but value count = 839
> {noformat}
> Adding vector validation to the result set loader overflow tests reveals that 
> the problem is in overflow. In 
> {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}:
> {noformat}
> a - RepeatedIntVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels 
> 32472 values
> c - RepeatedIntVector: Row count = 2952, but value count = 2953
> d - RepeatedIntVector: Row count = 2952, but value count = 2953
> {noformat}
> The problem is that EVF incorrectly sets the offset buffer writer index after 
> a rollover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover

2019-10-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956912#comment-16956912
 ] 

ASF GitHub Bot commented on DRILL-7414:
---

arina-ielchiieva commented on issue #1878: DRILL-7414: EVF incorrectly sets 
buffer writer index after rollover
URL: https://github.com/apache/drill/pull/1878#issuecomment-544896904
 
 
   Cherry-picked last tow commits and squashed them, since there were no 
conflicts, changes were merged into master.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> EVF incorrectly sets buffer writer index after rollover
> ---
>
> Key: DRILL-7414
> URL: https://issues.apache.org/jira/browse/DRILL-7414
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> A full test run, with vector validation enabled and with the "new" scan 
> enabled,  revealed the following in {{TestMockPlugin.testSizeLimit()}}:
> {noformat}
> comments_s2 - VarCharVector: Row count = 838, but value count = 839
> {noformat}
> Adding vector validation to the result set loader overflow tests reveals that 
> the problem is in overflow. In 
> {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}:
> {noformat}
> a - RepeatedIntVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels 
> 32472 values
> c - RepeatedIntVector: Row count = 2952, but value count = 2953
> d - RepeatedIntVector: Row count = 2952, but value count = 2953
> {noformat}
> The problem is that EVF incorrectly sets the offset buffer writer index after 
> a rollover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7413) Scan operator does not set the container record count

2019-10-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956911#comment-16956911
 ] 

ASF GitHub Bot commented on DRILL-7413:
---

arina-ielchiieva commented on issue #1877: DRILL-7413: Test and fix scan 
operator vectors
URL: https://github.com/apache/drill/pull/1877#issuecomment-544896612
 
 
   @paul-rogers please resolve the conflicts.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Scan operator does not set the container record count
> -
>
> Key: DRILL-7413
> URL: https://issues.apache.org/jira/browse/DRILL-7413
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> Enable the vector checking provided in DRILL-7403. Enable just for the JSON 
> reader. You will get the following error:
> {noformat}
> 12:36:57.399 [22549a3d-a937-df51-2e13-4b032ba143f9:frag:0:0] ERROR 
> o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
> ScanBatch
> ScanBatch: Container record count not set
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-6958) CTAS csv with option

2019-10-22 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-6958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6958:

Fix Version/s: 1.17.0

> CTAS csv with option
> 
>
> Key: DRILL-6958
> URL: https://issues.apache.org/jira/browse/DRILL-6958
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Text & CSV
>Affects Versions: 1.15.0, 1.16.0
>Reporter: benj
>Priority: Major
> Fix For: 1.17.0
>
>
> Currently, it may be difficult to produce well-formed CSV with CTAS (see 
> comment below).
> It appears necessary to have some additional/configuratble options to write 
> CSV file with CTAS :
>  * possibility to change/define the separator,
>  * possibility to write or not the header,
>  * possibility to force the write of only 1 file instead of lot of parts,
>  * possibility to force quoting
>  * possibility to use/change escape char
>  * ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-6842) Export to CSV using CREATE TABLE AS (CTAS) wrong parsed

2019-10-22 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva resolved DRILL-6842.
-
Resolution: Fixed

Fixed in the scope of DRILL-6096.

> Export to CSV using CREATE TABLE AS (CTAS) wrong parsed
> ---
>
> Key: DRILL-6842
> URL: https://issues.apache.org/jira/browse/DRILL-6842
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV, Storage - Writer
>Affects Versions: 1.14.0
> Environment: - Tested with latest version *Apache Drill* 1.14.0, and 
> building the latest version from master (Github repo), commit 
> ad61c6bc1dd24994e50fe7dfed043d5e57dba8f9 at _Nov 5, 2018_.
> - *Linux* x64, Ubuntu 16.04
> - *OpenJDK* Runtime Environment (build 
> 1.8.0_171-8u171-b11-0ubuntu0.17.10.1-b11)
> - Apache *Maven* 3.5.0
>Reporter: Mariano Ruiz
>Priority: Minor
>  Labels: csv, export
> Fix For: 1.17.0
>
> Attachments: Screenshot from 2018-11-09 14-18-43.png
>
>
> When you export to a CSV using CTAS the result of a query, most of the time 
> the generated file is OK, but if you have in the results text columns with 
> "," characters, the resulting CSV file is broken, because does not enclose 
> the cells with commas inside with the " character.
> Steps to reproduce the bug:
> Lets say you have the following table in some source of data, maybe a CSV 
> file too:
> {code:title=/tmp/input.csv}
> product_ean,product_name,product_brand
> 12345678900,IPhone X,Apple
> 9911100,"Samsung S9, Black",Samsung
> 1223456,Smartwatch XY,Some Brand
> {code}
> Note that the second row of data, in the column "product_name", it has a 
> value with a comma inside (_Samsung S9, Black_), so all the cell value is 
> enclosed with " characters, while the rest of the column cells aren't, 
> despite they could be enclosed too.
> So if you query this file, Drill will interpret correctly the file and does 
> not interpret that comma inside the cell as a separator like the rest of the 
> commas in the file:
> {code}
> 0: jdbc:drill:zk=local> SELECT * FROM dfs.`/tmp/input.csv`;
> +--+++
> | product_ean  |product_name| product_brand  |
> +--+++
> | 12345678900  | IPhone X   | Apple  |
> | 9911100  | Samsung S9, Black  | Samsung|
> | 1223456  | Smartwatch XY  | Some Brand |
> +--+++
> 3 rows selected (1.874 seconds)
> {code}
> But now, if you want to query the file and export the result as CSV using the 
> CTAS feature, using the following steps:
> {code}
> 0: jdbc:drill:zk=local> USE dfs.tmp;
> +---+--+
> |  ok   |   summary|
> +---+--+
> | true  | Default schema changed to [dfs.tmp]  |
> +---+--+
> 1 row selected (0.13 seconds)
> 0: jdbc:drill:zk=local> ALTER SESSION SET `store.format`='csv';
> +---++
> |  ok   |summary |
> +---++
> | true  | store.format updated.  |
> +---++
> 1 row selected (0.094 seconds)
> 0: jdbc:drill:zk=local> CREATE TABLE dfs.tmp.my_output AS SELECT * FROM 
> dfs.`/tmp/input.csv`;
> +---++
> | Fragment  | Number of records written  |
> +---++
> | 0_0   | 3  |
> +---++
> 1 row selected (0.453 seconds)
> {code}
> The output file is this:
> {code:title=/tmp/my_output/0_0_0.csv}
> product_ean,product_name,product_brand
> 12345678900,IPhone X,Apple
> 9911100,Samsung S9, Black,Samsung
> 1223456,Smartwatch XY,Some Brand
> {code}
> The text _Samsung S9, Black_ in the cell is not quoted, so any CSV 
> interpreter like an office tool, a Java/Python/... library will interpret it 
> as two cell instead of one. Even Apache Drill will interpret it wrong:
> {code}
> 0: jdbc:drill:zk=local> SELECT * FROM dfs.`/tmp/my_output/0_0_0.csv`;
> +--+++
> | product_ean  |  product_name  | product_brand  |
> +--+++
> | 12345678900  | IPhone X   | Apple  |
> | 9911100  | Samsung S9 |  Black |
> | 1223456  | Smartwatch XY  | Some Brand |
> +--+++
> 3 rows selected (0.175 seconds)
> {code}
> Note that the ending part _ Black_ was interpreted as a following cell, and 
> the real following cell is not showed, but it's not an error in the Drill 
> interpreter, it's an error of how Dri

[jira] [Resolved] (DRILL-6958) CTAS csv with option

2019-10-22 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-6958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva resolved DRILL-6958.
-
Resolution: Fixed

Fixed in the scope of DRILL-6096.

> CTAS csv with option
> 
>
> Key: DRILL-6958
> URL: https://issues.apache.org/jira/browse/DRILL-6958
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Text & CSV
>Affects Versions: 1.15.0, 1.16.0
>Reporter: benj
>Priority: Major
> Fix For: 1.17.0
>
>
> Currently, it may be difficult to produce well-formed CSV with CTAS (see 
> comment below).
> It appears necessary to have some additional/configuratble options to write 
> CSV file with CTAS :
>  * possibility to change/define the separator,
>  * possibility to write or not the header,
>  * possibility to force the write of only 1 file instead of lot of parts,
>  * possibility to force quoting
>  * possibility to use/change escape char
>  * ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-4788) Exporting from Parquet to CSV - commas in strings are not escaped

2019-10-22 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva resolved DRILL-4788.
-
Resolution: Fixed

Fixed in the scope of DRILL-6096.

> Exporting from Parquet to CSV - commas in strings are not escaped
> -
>
> Key: DRILL-4788
> URL: https://issues.apache.org/jira/browse/DRILL-4788
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Richard Patching
>Priority: Major
>  Labels: csv, csvparser, export
> Fix For: 1.17.0
>
>
> When exporting data from Parquet to CSV, if there is a column which contains 
> a comma, the text after the comma gets put into the next column instead of 
> being escaped.
> The only work around is to do REGEXP_REPLACE(COLUMN[0], ',',' ') which 
> replaced the comma in the string with a blank space. This is not ideal in 
> terms of keeping a true accurate record of the data we receive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-6842) Export to CSV using CREATE TABLE AS (CTAS) wrong parsed

2019-10-22 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6842:

Fix Version/s: 1.17.0

> Export to CSV using CREATE TABLE AS (CTAS) wrong parsed
> ---
>
> Key: DRILL-6842
> URL: https://issues.apache.org/jira/browse/DRILL-6842
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV, Storage - Writer
>Affects Versions: 1.14.0
> Environment: - Tested with latest version *Apache Drill* 1.14.0, and 
> building the latest version from master (Github repo), commit 
> ad61c6bc1dd24994e50fe7dfed043d5e57dba8f9 at _Nov 5, 2018_.
> - *Linux* x64, Ubuntu 16.04
> - *OpenJDK* Runtime Environment (build 
> 1.8.0_171-8u171-b11-0ubuntu0.17.10.1-b11)
> - Apache *Maven* 3.5.0
>Reporter: Mariano Ruiz
>Priority: Minor
>  Labels: csv, export
> Fix For: 1.17.0
>
> Attachments: Screenshot from 2018-11-09 14-18-43.png
>
>
> When you export to a CSV using CTAS the result of a query, most of the time 
> the generated file is OK, but if you have in the results text columns with 
> "," characters, the resulting CSV file is broken, because does not enclose 
> the cells with commas inside with the " character.
> Steps to reproduce the bug:
> Lets say you have the following table in some source of data, maybe a CSV 
> file too:
> {code:title=/tmp/input.csv}
> product_ean,product_name,product_brand
> 12345678900,IPhone X,Apple
> 9911100,"Samsung S9, Black",Samsung
> 1223456,Smartwatch XY,Some Brand
> {code}
> Note that the second row of data, in the column "product_name", it has a 
> value with a comma inside (_Samsung S9, Black_), so all the cell value is 
> enclosed with " characters, while the rest of the column cells aren't, 
> despite they could be enclosed too.
> So if you query this file, Drill will interpret correctly the file and does 
> not interpret that comma inside the cell as a separator like the rest of the 
> commas in the file:
> {code}
> 0: jdbc:drill:zk=local> SELECT * FROM dfs.`/tmp/input.csv`;
> +--+++
> | product_ean  |product_name| product_brand  |
> +--+++
> | 12345678900  | IPhone X   | Apple  |
> | 9911100  | Samsung S9, Black  | Samsung|
> | 1223456  | Smartwatch XY  | Some Brand |
> +--+++
> 3 rows selected (1.874 seconds)
> {code}
> But now, if you want to query the file and export the result as CSV using the 
> CTAS feature, using the following steps:
> {code}
> 0: jdbc:drill:zk=local> USE dfs.tmp;
> +---+--+
> |  ok   |   summary|
> +---+--+
> | true  | Default schema changed to [dfs.tmp]  |
> +---+--+
> 1 row selected (0.13 seconds)
> 0: jdbc:drill:zk=local> ALTER SESSION SET `store.format`='csv';
> +---++
> |  ok   |summary |
> +---++
> | true  | store.format updated.  |
> +---++
> 1 row selected (0.094 seconds)
> 0: jdbc:drill:zk=local> CREATE TABLE dfs.tmp.my_output AS SELECT * FROM 
> dfs.`/tmp/input.csv`;
> +---++
> | Fragment  | Number of records written  |
> +---++
> | 0_0   | 3  |
> +---++
> 1 row selected (0.453 seconds)
> {code}
> The output file is this:
> {code:title=/tmp/my_output/0_0_0.csv}
> product_ean,product_name,product_brand
> 12345678900,IPhone X,Apple
> 9911100,Samsung S9, Black,Samsung
> 1223456,Smartwatch XY,Some Brand
> {code}
> The text _Samsung S9, Black_ in the cell is not quoted, so any CSV 
> interpreter like an office tool, a Java/Python/... library will interpret it 
> as two cell instead of one. Even Apache Drill will interpret it wrong:
> {code}
> 0: jdbc:drill:zk=local> SELECT * FROM dfs.`/tmp/my_output/0_0_0.csv`;
> +--+++
> | product_ean  |  product_name  | product_brand  |
> +--+++
> | 12345678900  | IPhone X   | Apple  |
> | 9911100  | Samsung S9 |  Black |
> | 1223456  | Smartwatch XY  | Some Brand |
> +--+++
> 3 rows selected (0.175 seconds)
> {code}
> Note that the ending part _ Black_ was interpreted as a following cell, and 
> the real following cell is not showed, but it's not an error in the Drill 
> interpreter, it's an error of how Drill exported the result that now i

[jira] [Updated] (DRILL-4788) Exporting from Parquet to CSV - commas in strings are not escaped

2019-10-22 Thread Arina Ielchiieva (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-4788:

Fix Version/s: 1.17.0

> Exporting from Parquet to CSV - commas in strings are not escaped
> -
>
> Key: DRILL-4788
> URL: https://issues.apache.org/jira/browse/DRILL-4788
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Richard Patching
>Priority: Major
>  Labels: csv, csvparser, export
> Fix For: 1.17.0
>
>
> When exporting data from Parquet to CSV, if there is a column which contains 
> a comma, the text after the comma gets put into the next column instead of 
> being escaped.
> The only work around is to do REGEXP_REPLACE(COLUMN[0], ',',' ') which 
> replaced the comma in the string with a blank space. This is not ideal in 
> terms of keeping a true accurate record of the data we receive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-6096) Provide mechanisms to specify field delimiters and quoted text for TextRecordWriter

2019-10-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956899#comment-16956899
 ] 

ASF GitHub Bot commented on DRILL-6096:
---

asfgit commented on pull request #1873: DRILL-6096: Provide mechanism to 
configure text writer configuration
URL: https://github.com/apache/drill/pull/1873
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Provide mechanisms to specify field delimiters and quoted text for 
> TextRecordWriter
> ---
>
> Key: DRILL-6096
> URL: https://issues.apache.org/jira/browse/DRILL-6096
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.12.0
>Reporter: Kunal Khatua
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.17.0
>
>
> Currently, there is no way for a user to specify the field delimiter for the 
> writing records as a text output. Further more, if the fields contain the 
> delimiter, we have no mechanism of specifying quotes.
> By default, quotes should be used to enclose non-numeric fields being written.
> *Description of the implemented changes:*
> 2 options are added to control text writer output:
> {{store.text.writer.add_header}} - indicates if header should be added in 
> created text file. Default is true.
> {{store.text.writer.force_quotes}} - indicates if all value should be quoted. 
> Default is false. It means only values that contain special characters (line 
> / field separators) will be quoted.
> Line / field separators, quote / escape characters can be configured using 
> text format configuration using Web UI. User can create special format only 
> for writing data and then use it when creating files. Though such format can 
> be always used to read back written data.
> {noformat}
>   "formats": {
> "write_text": {
>   "type": "text",
>   "extensions": [
> "txt"
>   ],
>   "lineDelimiter": "\n",
>   "fieldDelimiter": "!",
>   "quote": "^",
>   "escape": "^",
> }
>},
> ...
> {noformat}
> Next set specified format and create text file:
> {noformat}
> alter session set `store.format` = 'write_text';
> create table dfs.tmp.t as select 1 as id from (values(1));
> {noformat}
> Notes:
> 1. To write data univocity-parsers are used, they limit line separator length 
> to not more than 2 characters, though Drill allows setting more 2 chars as 
> line separator since Drill can read data splitting by line separator of any 
> length, during data write exception will be thrown.
> 2. {{extractHeader}} in text format configuration does not affect if header 
> will be written to text file, only {{store.text.writer.add_header}} controls 
> this action. {{extractHeader}} is used only when reading the data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7403) Validate batch checks, vector integretity in unit tests

2019-10-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956901#comment-16956901
 ] 

ASF GitHub Bot commented on DRILL-7403:
---

asfgit commented on pull request #1871: DRILL-7403: Validate batch checks, 
vector integretity in unit tests
URL: https://github.com/apache/drill/pull/1871
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Validate batch checks, vector integretity in unit tests
> ---
>
> Key: DRILL-7403
> URL: https://issues.apache.org/jira/browse/DRILL-7403
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> Drill provides a {{BatchValidator}} that checks vectors. It is disabled by 
> default. This enhancement adds more checks, including checks for row counts 
> (of which there are surprisingly many.)
> Since most operators will fail if the check is enabled, this enhancement also 
> adds a table to keep track of which operators pass the checks (and for which 
> checks should be enabled) and those that still need work. This allows the 
> checks to exist in the code, and to be enabled incrementally as we fix the 
> various problems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover

2019-10-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956900#comment-16956900
 ] 

ASF GitHub Bot commented on DRILL-7414:
---

asfgit commented on pull request #1878: DRILL-7414: EVF incorrectly sets buffer 
writer index after rollover
URL: https://github.com/apache/drill/pull/1878
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> EVF incorrectly sets buffer writer index after rollover
> ---
>
> Key: DRILL-7414
> URL: https://issues.apache.org/jira/browse/DRILL-7414
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> A full test run, with vector validation enabled and with the "new" scan 
> enabled,  revealed the following in {{TestMockPlugin.testSizeLimit()}}:
> {noformat}
> comments_s2 - VarCharVector: Row count = 838, but value count = 839
> {noformat}
> Adding vector validation to the result set loader overflow tests reveals that 
> the problem is in overflow. In 
> {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}:
> {noformat}
> a - RepeatedIntVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels 
> 32472 values
> c - RepeatedIntVector: Row count = 2952, but value count = 2953
> d - RepeatedIntVector: Row count = 2952, but value count = 2953
> {noformat}
> The problem is that EVF incorrectly sets the offset buffer writer index after 
> a rollover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7403) Validate batch checks, vector integretity in unit tests

2019-10-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956784#comment-16956784
 ] 

ASF GitHub Bot commented on DRILL-7403:
---

arina-ielchiieva commented on issue #1871: DRILL-7403: Validate batch checks, 
vector integretity in unit tests
URL: https://github.com/apache/drill/pull/1871#issuecomment-544844982
 
 
   +1
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Validate batch checks, vector integretity in unit tests
> ---
>
> Key: DRILL-7403
> URL: https://issues.apache.org/jira/browse/DRILL-7403
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> Drill provides a {{BatchValidator}} that checks vectors. It is disabled by 
> default. This enhancement adds more checks, including checks for row counts 
> (of which there are surprisingly many.)
> Since most operators will fail if the check is enabled, this enhancement also 
> adds a table to keep track of which operators pass the checks (and for which 
> checks should be enabled) and those that still need work. This allows the 
> checks to exist in the code, and to be enabled incrementally as we fix the 
> various problems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover

2019-10-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956783#comment-16956783
 ] 

ASF GitHub Bot commented on DRILL-7414:
---

arina-ielchiieva commented on issue #1878: DRILL-7414: EVF incorrectly sets 
buffer writer index after rollover
URL: https://github.com/apache/drill/pull/1878#issuecomment-544844844
 
 
   +1
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> EVF incorrectly sets buffer writer index after rollover
> ---
>
> Key: DRILL-7414
> URL: https://issues.apache.org/jira/browse/DRILL-7414
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> A full test run, with vector validation enabled and with the "new" scan 
> enabled,  revealed the following in {{TestMockPlugin.testSizeLimit()}}:
> {noformat}
> comments_s2 - VarCharVector: Row count = 838, but value count = 839
> {noformat}
> Adding vector validation to the result set loader overflow tests reveals that 
> the problem is in overflow. In 
> {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}:
> {noformat}
> a - RepeatedIntVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels 
> 32472 values
> c - RepeatedIntVector: Row count = 2952, but value count = 2953
> d - RepeatedIntVector: Row count = 2952, but value count = 2953
> {noformat}
> The problem is that EVF incorrectly sets the offset buffer writer index after 
> a rollover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)