[jira] [Commented] (HIVE-9664) Hive "add jar" command should be able to download and add jars from a repository

2017-01-15 Thread anishek (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823584#comment-15823584
 ] 

anishek commented on HIVE-9664:
---

So is hhtp scheme not supported ?

{code}
add [FILE|JAR|ARCHIVE]  *
{code}



> Hive "add jar" command should be able to download and add jars from a 
> repository
> 
>
> Key: HIVE-9664
> URL: https://issues.apache.org/jira/browse/HIVE-9664
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 0.14.0
>Reporter: Anant Nag
>Assignee: Anant Nag
>  Labels: hive, patch
> Fix For: 1.2.0
>
> Attachments: HIVE-9664.4.patch, HIVE-9664.5.patch, HIVE-9664.patch, 
> HIVE-9664.patch, HIVE-9664.patch
>
>
> Currently Hive's "add jar" command takes a local path to the dependency jar. 
> This clutters the local file-system as users may forget to remove this jar 
> later
> It would be nice if Hive supported a Gradle like notation to download the jar 
> from a repository.
> Example:  add jar org:module:version
> 
> It should also be backward compatible and should take jar from the local 
> file-system as well. 
> RB:  https://reviews.apache.org/r/31628/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15147) LLAP: use LLAP cache for non-columnar formats in a somewhat general way

2017-01-15 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated HIVE-15147:
---
Assignee: Sergey Shelukhin  (was: Gopal V)

> LLAP: use LLAP cache for non-columnar formats in a somewhat general way
> ---
>
> Key: HIVE-15147
> URL: https://issues.apache.org/jira/browse/HIVE-15147
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-15147.01.patch, HIVE-15147.patch, 
> HIVE-15147.WIP.noout.patch, perf-top-cache.png, pre-cache.svg, 
> writerimpl-addrow.png
>
>
> The primary goal for the first pass is caching text files. Nothing would 
> prevent other formats from using the same path, in principle, although, as 
> was originally done with ORC, it may be better to have native caching support 
> optimized for each particular format.
> Given that caching pure text is not smart, and we already have ORC-encoded 
> cache that is columnar due to ORC file structure, we will transform data into 
> columnar ORC.
> The general idea is to treat all the data in the world as merely ORC that was 
> compressed with some poor compression codec, such as csv. Using the original 
> IF and serde, as well as an ORC writer (with some heavyweight optimizations 
> disabled, potentially), we can "uncompress" the csv/whatever data into its 
> "original" ORC representation, then cache it efficiently, by column, and also 
> reuse a lot of the existing code.
> Various other points:
> 1) Caching granularity will have to be somehow determined (i.e. how do we 
> slice the file horizontally, to avoid caching entire columns). As with ORC 
> uncompressed files, the specific offsets don't really matter as long as they 
> are consistent between reads. The problem is that the file offsets will 
> actually need to be propagated to the new reader from the original 
> inputformat. Row counts are easier to use but there's a problem of how to 
> actually map them to missing ranges to read from disk.
> 2) Obviously, for row-based formats, if any one column that is to be read has 
> been evicted or is otherwise missing, "all the columns" have to be read for 
> the corresponding slice to cache and read that one column. The vague plan is 
> to handle this implicitly, similarly to how ORC reader handles CB-RG overlaps 
> - it will just so happen that a missing column in disk range list to retrieve 
> will expand the disk-range-to-read into the whole horizontal slice of the 
> file.
> 3) Granularity/etc. won't work for gzipped text. If anything at all is 
> evicted, the entire file has to be re-read. Gzipped text is a ridiculous 
> feature, so this is by design.
> 4) In future, it would be possible to also build some form or 
> metadata/indexes for this cached data to do PPD, etc. This is out of the 
> scope for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15147) LLAP: use LLAP cache for non-columnar formats in a somewhat general way

2017-01-15 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated HIVE-15147:
---
Attachment: pre-cache.svg
writerimpl-addrow.png
perf-top-cache.png

> LLAP: use LLAP cache for non-columnar formats in a somewhat general way
> ---
>
> Key: HIVE-15147
> URL: https://issues.apache.org/jira/browse/HIVE-15147
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sergey Shelukhin
>Assignee: Gopal V
> Attachments: HIVE-15147.01.patch, HIVE-15147.patch, 
> HIVE-15147.WIP.noout.patch, perf-top-cache.png, pre-cache.svg, 
> writerimpl-addrow.png
>
>
> The primary goal for the first pass is caching text files. Nothing would 
> prevent other formats from using the same path, in principle, although, as 
> was originally done with ORC, it may be better to have native caching support 
> optimized for each particular format.
> Given that caching pure text is not smart, and we already have ORC-encoded 
> cache that is columnar due to ORC file structure, we will transform data into 
> columnar ORC.
> The general idea is to treat all the data in the world as merely ORC that was 
> compressed with some poor compression codec, such as csv. Using the original 
> IF and serde, as well as an ORC writer (with some heavyweight optimizations 
> disabled, potentially), we can "uncompress" the csv/whatever data into its 
> "original" ORC representation, then cache it efficiently, by column, and also 
> reuse a lot of the existing code.
> Various other points:
> 1) Caching granularity will have to be somehow determined (i.e. how do we 
> slice the file horizontally, to avoid caching entire columns). As with ORC 
> uncompressed files, the specific offsets don't really matter as long as they 
> are consistent between reads. The problem is that the file offsets will 
> actually need to be propagated to the new reader from the original 
> inputformat. Row counts are easier to use but there's a problem of how to 
> actually map them to missing ranges to read from disk.
> 2) Obviously, for row-based formats, if any one column that is to be read has 
> been evicted or is otherwise missing, "all the columns" have to be read for 
> the corresponding slice to cache and read that one column. The vague plan is 
> to handle this implicitly, similarly to how ORC reader handles CB-RG overlaps 
> - it will just so happen that a missing column in disk range list to retrieve 
> will expand the disk-range-to-read into the whole horizontal slice of the 
> file.
> 3) Granularity/etc. won't work for gzipped text. If anything at all is 
> evicted, the entire file has to be re-read. Gzipped text is a ridiculous 
> feature, so this is by design.
> 4) In future, it would be possible to also build some form or 
> metadata/indexes for this cached data to do PPD, etc. This is out of the 
> scope for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HIVE-15147) LLAP: use LLAP cache for non-columnar formats in a somewhat general way

2017-01-15 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V reassigned HIVE-15147:
--

Assignee: Gopal V  (was: Sergey Shelukhin)

> LLAP: use LLAP cache for non-columnar formats in a somewhat general way
> ---
>
> Key: HIVE-15147
> URL: https://issues.apache.org/jira/browse/HIVE-15147
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sergey Shelukhin
>Assignee: Gopal V
> Attachments: HIVE-15147.01.patch, HIVE-15147.patch, 
> HIVE-15147.WIP.noout.patch, perf-top-cache.png, pre-cache.svg, 
> writerimpl-addrow.png
>
>
> The primary goal for the first pass is caching text files. Nothing would 
> prevent other formats from using the same path, in principle, although, as 
> was originally done with ORC, it may be better to have native caching support 
> optimized for each particular format.
> Given that caching pure text is not smart, and we already have ORC-encoded 
> cache that is columnar due to ORC file structure, we will transform data into 
> columnar ORC.
> The general idea is to treat all the data in the world as merely ORC that was 
> compressed with some poor compression codec, such as csv. Using the original 
> IF and serde, as well as an ORC writer (with some heavyweight optimizations 
> disabled, potentially), we can "uncompress" the csv/whatever data into its 
> "original" ORC representation, then cache it efficiently, by column, and also 
> reuse a lot of the existing code.
> Various other points:
> 1) Caching granularity will have to be somehow determined (i.e. how do we 
> slice the file horizontally, to avoid caching entire columns). As with ORC 
> uncompressed files, the specific offsets don't really matter as long as they 
> are consistent between reads. The problem is that the file offsets will 
> actually need to be propagated to the new reader from the original 
> inputformat. Row counts are easier to use but there's a problem of how to 
> actually map them to missing ranges to read from disk.
> 2) Obviously, for row-based formats, if any one column that is to be read has 
> been evicted or is otherwise missing, "all the columns" have to be read for 
> the corresponding slice to cache and read that one column. The vague plan is 
> to handle this implicitly, similarly to how ORC reader handles CB-RG overlaps 
> - it will just so happen that a missing column in disk range list to retrieve 
> will expand the disk-range-to-read into the whole horizontal slice of the 
> file.
> 3) Granularity/etc. won't work for gzipped text. If anything at all is 
> evicted, the entire file has to be re-read. Gzipped text is a ridiculous 
> feature, so this is by design.
> 4) In future, it would be possible to also build some form or 
> metadata/indexes for this cached data to do PPD, etc. This is out of the 
> scope for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15147) LLAP: use LLAP cache for non-columnar formats in a somewhat general way

2017-01-15 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823555#comment-15823555
 ] 

Gopal V commented on HIVE-15147:


LGTM - +1.

The cached hit-rate makes a dramatic improvement to the performance, but 
there's a cliff of performance loss whenever data gets evicted (or for the 
initial load-rate).

Running tpc-h Q1 on 10Gb of data on 1 node, with 50x gains between the 1st and 
2nd run.

{code}
1st run : Time taken: 102.598 seconds, Fetched: 1 row(s)
2nd run: Time taken: 2.674 seconds, Fetched: 1 row(s)
{code}

Further improvements to target in later work.

Most of the time in the 1st run is spent in compressing incompressible string 
columns, followed by a class inheritance check inside WriterImpl::addRow() 
repnz and then the LazyStruct::parse() cache-misses (LazySimpleDeserializeRead 
to be used).

!perf-top-cache.png!

!writerimpl-addrow.png!

I'm also attaching [^pre-cache.svg] call-tree with weights.

> LLAP: use LLAP cache for non-columnar formats in a somewhat general way
> ---
>
> Key: HIVE-15147
> URL: https://issues.apache.org/jira/browse/HIVE-15147
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-15147.01.patch, HIVE-15147.patch, 
> HIVE-15147.WIP.noout.patch
>
>
> The primary goal for the first pass is caching text files. Nothing would 
> prevent other formats from using the same path, in principle, although, as 
> was originally done with ORC, it may be better to have native caching support 
> optimized for each particular format.
> Given that caching pure text is not smart, and we already have ORC-encoded 
> cache that is columnar due to ORC file structure, we will transform data into 
> columnar ORC.
> The general idea is to treat all the data in the world as merely ORC that was 
> compressed with some poor compression codec, such as csv. Using the original 
> IF and serde, as well as an ORC writer (with some heavyweight optimizations 
> disabled, potentially), we can "uncompress" the csv/whatever data into its 
> "original" ORC representation, then cache it efficiently, by column, and also 
> reuse a lot of the existing code.
> Various other points:
> 1) Caching granularity will have to be somehow determined (i.e. how do we 
> slice the file horizontally, to avoid caching entire columns). As with ORC 
> uncompressed files, the specific offsets don't really matter as long as they 
> are consistent between reads. The problem is that the file offsets will 
> actually need to be propagated to the new reader from the original 
> inputformat. Row counts are easier to use but there's a problem of how to 
> actually map them to missing ranges to read from disk.
> 2) Obviously, for row-based formats, if any one column that is to be read has 
> been evicted or is otherwise missing, "all the columns" have to be read for 
> the corresponding slice to cache and read that one column. The vague plan is 
> to handle this implicitly, similarly to how ORC reader handles CB-RG overlaps 
> - it will just so happen that a missing column in disk range list to retrieve 
> will expand the disk-range-to-read into the whole horizontal slice of the 
> file.
> 3) Granularity/etc. won't work for gzipped text. If anything at all is 
> evicted, the entire file has to be re-read. Gzipped text is a ridiculous 
> feature, so this is by design.
> 4) In future, it would be possible to also build some form or 
> metadata/indexes for this cached data to do PPD, etc. This is out of the 
> scope for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HIVE-15147) LLAP: use LLAP cache for non-columnar formats in a somewhat general way

2017-01-15 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823555#comment-15823555
 ] 

Gopal V edited comment on HIVE-15147 at 1/16/17 7:12 AM:
-

LGTM - +1 tests pending.

The cached hit-rate makes a dramatic improvement to the performance, but 
there's a cliff of performance loss whenever data gets evicted (or for the 
initial load-rate).

Running tpc-h Q1 on 10Gb of data on 1 node, with 50x gains between the 1st and 
2nd run.

{code}
1st run : Time taken: 102.598 seconds, Fetched: 1 row(s)
2nd run: Time taken: 2.674 seconds, Fetched: 1 row(s)
{code}

Further improvements to target in later work.

Most of the time in the 1st run is spent in compressing incompressible string 
columns, followed by a class inheritance check inside WriterImpl::addRow() 
repnz and then the LazyStruct::parse() cache-misses (LazySimpleDeserializeRead 
to be used).

!perf-top-cache.png!

!writerimpl-addrow.png!

I'm also attaching [^pre-cache.svg] call-tree with weights.


was (Author: gopalv):
LGTM - +1.

The cached hit-rate makes a dramatic improvement to the performance, but 
there's a cliff of performance loss whenever data gets evicted (or for the 
initial load-rate).

Running tpc-h Q1 on 10Gb of data on 1 node, with 50x gains between the 1st and 
2nd run.

{code}
1st run : Time taken: 102.598 seconds, Fetched: 1 row(s)
2nd run: Time taken: 2.674 seconds, Fetched: 1 row(s)
{code}

Further improvements to target in later work.

Most of the time in the 1st run is spent in compressing incompressible string 
columns, followed by a class inheritance check inside WriterImpl::addRow() 
repnz and then the LazyStruct::parse() cache-misses (LazySimpleDeserializeRead 
to be used).

!perf-top-cache.png!

!writerimpl-addrow.png!

I'm also attaching [^pre-cache.svg] call-tree with weights.

> LLAP: use LLAP cache for non-columnar formats in a somewhat general way
> ---
>
> Key: HIVE-15147
> URL: https://issues.apache.org/jira/browse/HIVE-15147
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-15147.01.patch, HIVE-15147.patch, 
> HIVE-15147.WIP.noout.patch
>
>
> The primary goal for the first pass is caching text files. Nothing would 
> prevent other formats from using the same path, in principle, although, as 
> was originally done with ORC, it may be better to have native caching support 
> optimized for each particular format.
> Given that caching pure text is not smart, and we already have ORC-encoded 
> cache that is columnar due to ORC file structure, we will transform data into 
> columnar ORC.
> The general idea is to treat all the data in the world as merely ORC that was 
> compressed with some poor compression codec, such as csv. Using the original 
> IF and serde, as well as an ORC writer (with some heavyweight optimizations 
> disabled, potentially), we can "uncompress" the csv/whatever data into its 
> "original" ORC representation, then cache it efficiently, by column, and also 
> reuse a lot of the existing code.
> Various other points:
> 1) Caching granularity will have to be somehow determined (i.e. how do we 
> slice the file horizontally, to avoid caching entire columns). As with ORC 
> uncompressed files, the specific offsets don't really matter as long as they 
> are consistent between reads. The problem is that the file offsets will 
> actually need to be propagated to the new reader from the original 
> inputformat. Row counts are easier to use but there's a problem of how to 
> actually map them to missing ranges to read from disk.
> 2) Obviously, for row-based formats, if any one column that is to be read has 
> been evicted or is otherwise missing, "all the columns" have to be read for 
> the corresponding slice to cache and read that one column. The vague plan is 
> to handle this implicitly, similarly to how ORC reader handles CB-RG overlaps 
> - it will just so happen that a missing column in disk range list to retrieve 
> will expand the disk-range-to-read into the whole horizontal slice of the 
> file.
> 3) Granularity/etc. won't work for gzipped text. If anything at all is 
> evicted, the entire file has to be re-read. Gzipped text is a ridiculous 
> feature, so this is by design.
> 4) In future, it would be possible to also build some form or 
> metadata/indexes for this cached data to do PPD, etc. This is out of the 
> scope for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15519) BitSet not computed properly for ColumnBuffer subset

2017-01-15 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-15519:
--
Attachment: (was: HIVE-15519.6.patch)

> BitSet not computed properly for ColumnBuffer subset
> 
>
> Key: HIVE-15519
> URL: https://issues.apache.org/jira/browse/HIVE-15519
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, JDBC
>Reporter: Bharat Viswanadham
>Assignee: Rui Li
>Priority: Critical
> Attachments: data_type_test(1).txt, HIVE-15519.1.patch, 
> HIVE-15519.2.patch, HIVE-15519.3.patch, HIVE-15519.4.patch, 
> HIVE-15519.5-branch-1.patch, HIVE-15519.6.patch
>
>
> Hive decimal type column precision is returning as zero, even though column 
> has precision set.
> Example: col67 decimal(18,2) scale is returning as zero for that column.
> Tried with below program.
> {code}
>System.out.println("Opening connection");   
> Class.forName("org.apache.hive.jdbc.HiveDriver");
>Connection con = 
> DriverManager.getConnection("jdbc:hive2://x.x.x.x:1/default");
>   DatabaseMetaData dbMeta = con.getMetaData();
>ResultSet rs = dbMeta.getColumns(null, "DEFAULT", "data_type_test",null);
>  while (rs.next()) {
> if (rs.getString("COLUMN_NAME").equalsIgnoreCase("col48") || 
> rs.getString("COLUMN_NAME").equalsIgnoreCase("col67") || 
> rs.getString("COLUMN_NAME").equalsIgnoreCase("col68") || 
> rs.getString("COLUMN_NAME").equalsIgnoreCase("col122")){
>  System.out.println(rs.getString("COLUMN_NAME") + "\t" + 
> rs.getString("COLUMN_SIZE") + "\t" + rs.getInt("DECIMAL_DIGITS"));
> }
>}
>rs.close();
>con.close();
>   } catch (Exception e) {
>e.printStackTrace();
>;
>   }
> {code}
> Default fetch size is 50. if any column no is under 50 with decimal type, 
> precision is returning properly, when the column no is greater than 50, scale 
> is returning as zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15519) BitSet not computed properly for ColumnBuffer subset

2017-01-15 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-15519:
--
Attachment: HIVE-15519.6.patch

> BitSet not computed properly for ColumnBuffer subset
> 
>
> Key: HIVE-15519
> URL: https://issues.apache.org/jira/browse/HIVE-15519
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, JDBC
>Reporter: Bharat Viswanadham
>Assignee: Rui Li
>Priority: Critical
> Attachments: data_type_test(1).txt, HIVE-15519.1.patch, 
> HIVE-15519.2.patch, HIVE-15519.3.patch, HIVE-15519.4.patch, 
> HIVE-15519.5-branch-1.patch, HIVE-15519.6.patch
>
>
> Hive decimal type column precision is returning as zero, even though column 
> has precision set.
> Example: col67 decimal(18,2) scale is returning as zero for that column.
> Tried with below program.
> {code}
>System.out.println("Opening connection");   
> Class.forName("org.apache.hive.jdbc.HiveDriver");
>Connection con = 
> DriverManager.getConnection("jdbc:hive2://x.x.x.x:1/default");
>   DatabaseMetaData dbMeta = con.getMetaData();
>ResultSet rs = dbMeta.getColumns(null, "DEFAULT", "data_type_test",null);
>  while (rs.next()) {
> if (rs.getString("COLUMN_NAME").equalsIgnoreCase("col48") || 
> rs.getString("COLUMN_NAME").equalsIgnoreCase("col67") || 
> rs.getString("COLUMN_NAME").equalsIgnoreCase("col68") || 
> rs.getString("COLUMN_NAME").equalsIgnoreCase("col122")){
>  System.out.println(rs.getString("COLUMN_NAME") + "\t" + 
> rs.getString("COLUMN_SIZE") + "\t" + rs.getInt("DECIMAL_DIGITS"));
> }
>}
>rs.close();
>con.close();
>   } catch (Exception e) {
>e.printStackTrace();
>;
>   }
> {code}
> Default fetch size is 50. if any column no is under 50 with decimal type, 
> precision is returning properly, when the column no is greater than 50, scale 
> is returning as zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15166) Provide beeline option to set the jline history max size

2017-01-15 Thread Eric Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823492#comment-15823492
 ] 

Eric Lin commented on HIVE-15166:
-

Hi [~aihuaxu],

Should I create a JIRA review for you?

> Provide beeline option to set the jline history max size
> 
>
> Key: HIVE-15166
> URL: https://issues.apache.org/jira/browse/HIVE-15166
> Project: Hive
>  Issue Type: Improvement
>  Components: Beeline
>Affects Versions: 2.1.0
>Reporter: Eric Lin
>Assignee: Eric Lin
>Priority: Minor
> Attachments: HIVE-15166.patch
>
>
> Currently Beeline does not provide an option to limit the max size for 
> beeline history file, in the case that each query is very big, it will flood 
> the history file and slow down beeline on start up and shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15166) Provide beeline option to set the jline history max size

2017-01-15 Thread Eric Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823486#comment-15823486
 ] 

Eric Lin commented on HIVE-15166:
-

[~aihuaxu],

Thanks for the comment. Please give me sometime to review it. It has been a 
while since I submitted the patch. I will provide a new patch soon.

Thanks

> Provide beeline option to set the jline history max size
> 
>
> Key: HIVE-15166
> URL: https://issues.apache.org/jira/browse/HIVE-15166
> Project: Hive
>  Issue Type: Improvement
>  Components: Beeline
>Affects Versions: 2.1.0
>Reporter: Eric Lin
>Assignee: Eric Lin
>Priority: Minor
> Attachments: HIVE-15166.patch
>
>
> Currently Beeline does not provide an option to limit the max size for 
> beeline history file, in the case that each query is very big, it will flood 
> the history file and slow down beeline on start up and shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15544) Support scalar subqueries

2017-01-15 Thread Vineet Garg (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823382#comment-15823382
 ] 

Vineet Garg commented on HIVE-15544:


RB is at https://reviews.apache.org/r/3/. Current patch has few outstanding 
issues as described in RB.

> Support scalar subqueries
> -
>
> Key: HIVE-15544
> URL: https://issues.apache.org/jira/browse/HIVE-15544
> Project: Hive
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>  Labels: sub-query
> Attachments: HIVE-15544.1.patch, HIVE-15544.2.patch, 
> HIVE-15544.3.patch
>
>
> Currently HIVE only support IN/EXISTS/NOT IN/NOT EXISTS subqueries. HIVE 
> doesn't allow sub-queries such as:
> {code}
> explain select  a.ca_state state, count(*) cnt
>  from customer_address a
>  ,customer c
>  ,store_sales s
>  ,date_dim d
>  ,item i
>  where   a.ca_address_sk = c.c_current_addr_sk
>   and c.c_customer_sk = s.ss_customer_sk
>   and s.ss_sold_date_sk = d.d_date_sk
>   and s.ss_item_sk = i.i_item_sk
>   and d.d_month_seq = 
>(select distinct (d_month_seq)
> from date_dim
>where d_year = 2000
>   and d_moy = 2 )
>   and i.i_current_price > 1.2 * 
>  (select avg(j.i_current_price) 
>from item j 
>where j.i_category = i.i_category)
>  group by a.ca_state
>  having count(*) >= 10
>  order by cnt 
>  limit 100;
> {code}
> We initially plan to support such scalar subqueries in filter i.e. WHERE and 
> HAVING



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15544) Support scalar subqueries

2017-01-15 Thread Vineet Garg (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-15544:
---
Status: Patch Available  (was: Open)

> Support scalar subqueries
> -
>
> Key: HIVE-15544
> URL: https://issues.apache.org/jira/browse/HIVE-15544
> Project: Hive
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>  Labels: sub-query
> Attachments: HIVE-15544.1.patch, HIVE-15544.2.patch, 
> HIVE-15544.3.patch
>
>
> Currently HIVE only support IN/EXISTS/NOT IN/NOT EXISTS subqueries. HIVE 
> doesn't allow sub-queries such as:
> {code}
> explain select  a.ca_state state, count(*) cnt
>  from customer_address a
>  ,customer c
>  ,store_sales s
>  ,date_dim d
>  ,item i
>  where   a.ca_address_sk = c.c_current_addr_sk
>   and c.c_customer_sk = s.ss_customer_sk
>   and s.ss_sold_date_sk = d.d_date_sk
>   and s.ss_item_sk = i.i_item_sk
>   and d.d_month_seq = 
>(select distinct (d_month_seq)
> from date_dim
>where d_year = 2000
>   and d_moy = 2 )
>   and i.i_current_price > 1.2 * 
>  (select avg(j.i_current_price) 
>from item j 
>where j.i_category = i.i_category)
>  group by a.ca_state
>  having count(*) >= 10
>  order by cnt 
>  limit 100;
> {code}
> We initially plan to support such scalar subqueries in filter i.e. WHERE and 
> HAVING



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15544) Support scalar subqueries

2017-01-15 Thread Vineet Garg (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-15544:
---
Attachment: HIVE-15544.3.patch

> Support scalar subqueries
> -
>
> Key: HIVE-15544
> URL: https://issues.apache.org/jira/browse/HIVE-15544
> Project: Hive
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>  Labels: sub-query
> Attachments: HIVE-15544.1.patch, HIVE-15544.2.patch, 
> HIVE-15544.3.patch
>
>
> Currently HIVE only support IN/EXISTS/NOT IN/NOT EXISTS subqueries. HIVE 
> doesn't allow sub-queries such as:
> {code}
> explain select  a.ca_state state, count(*) cnt
>  from customer_address a
>  ,customer c
>  ,store_sales s
>  ,date_dim d
>  ,item i
>  where   a.ca_address_sk = c.c_current_addr_sk
>   and c.c_customer_sk = s.ss_customer_sk
>   and s.ss_sold_date_sk = d.d_date_sk
>   and s.ss_item_sk = i.i_item_sk
>   and d.d_month_seq = 
>(select distinct (d_month_seq)
> from date_dim
>where d_year = 2000
>   and d_moy = 2 )
>   and i.i_current_price > 1.2 * 
>  (select avg(j.i_current_price) 
>from item j 
>where j.i_category = i.i_category)
>  group by a.ca_state
>  having count(*) >= 10
>  order by cnt 
>  limit 100;
> {code}
> We initially plan to support such scalar subqueries in filter i.e. WHERE and 
> HAVING



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15544) Support scalar subqueries

2017-01-15 Thread Vineet Garg (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-15544:
---
Status: Open  (was: Patch Available)

> Support scalar subqueries
> -
>
> Key: HIVE-15544
> URL: https://issues.apache.org/jira/browse/HIVE-15544
> Project: Hive
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>  Labels: sub-query
> Attachments: HIVE-15544.1.patch, HIVE-15544.2.patch, 
> HIVE-15544.3.patch
>
>
> Currently HIVE only support IN/EXISTS/NOT IN/NOT EXISTS subqueries. HIVE 
> doesn't allow sub-queries such as:
> {code}
> explain select  a.ca_state state, count(*) cnt
>  from customer_address a
>  ,customer c
>  ,store_sales s
>  ,date_dim d
>  ,item i
>  where   a.ca_address_sk = c.c_current_addr_sk
>   and c.c_customer_sk = s.ss_customer_sk
>   and s.ss_sold_date_sk = d.d_date_sk
>   and s.ss_item_sk = i.i_item_sk
>   and d.d_month_seq = 
>(select distinct (d_month_seq)
> from date_dim
>where d_year = 2000
>   and d_moy = 2 )
>   and i.i_current_price > 1.2 * 
>  (select avg(j.i_current_price) 
>from item j 
>where j.i_category = i.i_category)
>  group by a.ca_state
>  having count(*) >= 10
>  order by cnt 
>  limit 100;
> {code}
> We initially plan to support such scalar subqueries in filter i.e. WHERE and 
> HAVING



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15269) Dynamic Min-Max runtime-filtering for Tez

2017-01-15 Thread Deepak Jaiswal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Jaiswal updated HIVE-15269:
--
Attachment: HIVE-15269.12.patch

Some more fixes as part of semi join reduction effort.

> Dynamic Min-Max runtime-filtering for Tez
> -
>
> Key: HIVE-15269
> URL: https://issues.apache.org/jira/browse/HIVE-15269
> Project: Hive
>  Issue Type: New Feature
>Reporter: Jason Dere
>Assignee: Deepak Jaiswal
> Attachments: HIVE-15269.10.patch, HIVE-15269.11.patch, 
> HIVE-15269.12.patch, HIVE-15269.1.patch, HIVE-15269.2.patch, 
> HIVE-15269.3.patch, HIVE-15269.4.patch, HIVE-15269.5.patch, 
> HIVE-15269.6.patch, HIVE-15269.7.patch, HIVE-15269.8.patch, HIVE-15269.9.patch
>
>
> If a dimension table and fact table are joined:
> {noformat}
> select *
> from store join store_sales on (store.id = store_sales.store_id)
> where store.s_store_name = 'My Store'
> {noformat}
> One optimization that can be done is to get the min/max store id values that 
> come out of the scan/filter of the store table, and send this min/max value 
> (via Tez edge) to the task which is scanning the store_sales table.
> We can add a BETWEEN(min, max) predicate to the store_sales TableScan, where 
> this predicate can be pushed down to the storage handler (for example for ORC 
> formats). Pushing a min/max predicate to the ORC reader would allow us to 
> avoid having to entire whole row groups during the table scan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HIVE-15581) Unable to use advanced aggregation with multiple inserts clause

2017-01-15 Thread James Ball (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Ball resolved HIVE-15581.
---
Resolution: Won't Fix

Close as issue specific to MapReduce.

> Unable to use advanced aggregation with multiple inserts clause
> ---
>
> Key: HIVE-15581
> URL: https://issues.apache.org/jira/browse/HIVE-15581
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 1.2.1
>Reporter: James Ball
>  Labels: newbie
>
> ■Use Cases
> - Use multiple insert clauses within a single query to insert multiple static 
> (user-defined) partitions into a single table.
> - Use advanced aggregation (cube) features within each insert clause to 
> include subtotals of columns for each partition
> ■Expected Behaviour
> - Subtotals are inserted for all combinations of the set of columns
> ■Observed Behaviour
> - No subtotals are not inserted for any combination of the set of columns
> ■Sample Queries
> {code:sql}
> // Create test tables
> create table if not exists
>   table1
>   (
>   column1 string,
>   column2 string,
>   column3 int
>   )
>   stored as orc
>   tblproperties
>   (
>   "orc.compress" = "SNAPPY"
>   );
> create table if not exists
>   table2
>   (
>   column1 string,
>   column2 string,
>   column3 int
>   )
>   partitioned by
>   (
>   partition1 string
>   )
>   stored as orc
>   tblproperties
>   (
>   "orc.compress" = "SNAPPY"
>   );
> create table if not exists
>   table3
>   (
>   column1 string,
>   column2 string,
>   column3 int
>   )
>   partitioned by
>   (
>   partition1 string
>   )
>   stored as orc
>   tblproperties
>   (
>   "orc.compress" = "SNAPPY"
>   );
> {code}
> {code:sql}
> // Insert test values
> insert overwrite table
>   table1
>   values
>   ('value1', 'value1', 1),
>   ('value2', 'value2', 1),
>   ('value3', 'value3', 1);
> {code}
> {code:sql}
> // Single insert clause with multiple inserts syntax
> // Subtotals are inserted into target table
> from
>   table1
> insert overwrite table
>   table2
>   partition
>   (
>   partition1 = 'value1'
>   )
>   select
>   column1,
>   column2,
>   sum(column3) as column3
>   group by
>   column1,
>   column2
>   with cube;
> {code}
> {code:sql}
> // Multiple insert clauses with multiple inserts syntax
> // Subtotals are not inserted into target table
> from
>   table1
> insert overwrite table
>   table3
>   partition
>   (
>   partition1 = 'value1'
>   )
>   select
>   column1,
>   column2,
>   sum(column3) as column3
>   group by
>   column1,
>   column2
>   with cube
> insert overwrite table
>   table3
>   partition
>   (
>   partition1 = 'value2'
>   )
>   select
>   column1,
>   column2,
>   sum(column3) as column3
>   group by
>   column1,
>   column2
>   with cube;
> {code}
> ■Executions Plans
> - Single insert clause with multiple inserts syntax
> {noformat}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
>   Stage-2 depends on stages: Stage-0
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Map Operator Tree:
>   TableScan
> alias: table1
> Statistics: Num rows: 3 Data size: 552 Basic stats: COMPLETE 
> Column stats: NONE
> Select Operator
>   expressions: column1 (type: string), column2 (type: string), 
> column3 (type: int)
>   outputColumnNames: column1, column2, column3
>   Statistics: Num rows: 3 Data size: 552 Basic stats: COMPLETE 
> Column stats: NONE
>   Group By Operator
> aggregations: sum(column3)
> keys: column1 (type: string), column2 (type: string), '0' 
> (type: string)
> mode: hash
> outputColumnNames: _col0, _col1, _col2, _col3
> Statistics: Num rows: 12 Data size: 2208 Basic stats: 
> COMPLETE Column stats: NONE
> Reduce Output Operator
>   key expressions: _col0 (type: string), _col1 (type: 
> string), _col2 (type: string)
>   sort order: +++
>   Map-reduce partition columns: 

[jira] [Commented] (HIVE-15581) Unable to use advanced aggregation with multiple inserts clause

2017-01-15 Thread James Ball (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823343#comment-15823343
 ] 

James Ball commented on HIVE-15581:
---

[~cartershanklin]
I tried Tez and the query was executed correctly, so it may well be a problem 
specific to MapReduce.
Thank you for the suggestion.

Results w/ Tez:
{noformat}
table3.column1  table3.column2  table3.column3  table3.partition1
1   NULLNULL3   value1
2   NULLvalue1  1   value1
3   NULLvalue2  1   value1
4   NULLvalue3  1   value1
5   value1  NULL1   value1
6   value1  value1  1   value1
7   value2  NULL1   value1
8   value2  value2  1   value1
9   value3  NULL1   value1
10  value3  value3  1   value1
11  NULLNULL3   value2
12  NULLvalue1  1   value2
13  NULLvalue2  1   value2
14  NULLvalue3  1   value2
15  value1  NULL1   value2
16  value1  value1  1   value2
17  value2  NULL1   value2
18  value2  value2  1   value2
19  value3  NULL1   value2
20  value3  value3  1   value2
{noformat}

> Unable to use advanced aggregation with multiple inserts clause
> ---
>
> Key: HIVE-15581
> URL: https://issues.apache.org/jira/browse/HIVE-15581
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 1.2.1
>Reporter: James Ball
>  Labels: newbie
>
> ■Use Cases
> - Use multiple insert clauses within a single query to insert multiple static 
> (user-defined) partitions into a single table.
> - Use advanced aggregation (cube) features within each insert clause to 
> include subtotals of columns for each partition
> ■Expected Behaviour
> - Subtotals are inserted for all combinations of the set of columns
> ■Observed Behaviour
> - No subtotals are not inserted for any combination of the set of columns
> ■Sample Queries
> {code:sql}
> // Create test tables
> create table if not exists
>   table1
>   (
>   column1 string,
>   column2 string,
>   column3 int
>   )
>   stored as orc
>   tblproperties
>   (
>   "orc.compress" = "SNAPPY"
>   );
> create table if not exists
>   table2
>   (
>   column1 string,
>   column2 string,
>   column3 int
>   )
>   partitioned by
>   (
>   partition1 string
>   )
>   stored as orc
>   tblproperties
>   (
>   "orc.compress" = "SNAPPY"
>   );
> create table if not exists
>   table3
>   (
>   column1 string,
>   column2 string,
>   column3 int
>   )
>   partitioned by
>   (
>   partition1 string
>   )
>   stored as orc
>   tblproperties
>   (
>   "orc.compress" = "SNAPPY"
>   );
> {code}
> {code:sql}
> // Insert test values
> insert overwrite table
>   table1
>   values
>   ('value1', 'value1', 1),
>   ('value2', 'value2', 1),
>   ('value3', 'value3', 1);
> {code}
> {code:sql}
> // Single insert clause with multiple inserts syntax
> // Subtotals are inserted into target table
> from
>   table1
> insert overwrite table
>   table2
>   partition
>   (
>   partition1 = 'value1'
>   )
>   select
>   column1,
>   column2,
>   sum(column3) as column3
>   group by
>   column1,
>   column2
>   with cube;
> {code}
> {code:sql}
> // Multiple insert clauses with multiple inserts syntax
> // Subtotals are not inserted into target table
> from
>   table1
> insert overwrite table
>   table3
>   partition
>   (
>   partition1 = 'value1'
>   )
>   select
>   column1,
>   column2,
>   sum(column3) as column3
>   group by
>   column1,
>   column2
>   with cube
> insert overwrite table
>   table3
>   partition
>   (
>   partition1 = 'value2'
>   )
>   select
>   column1,
>   column2,
>   sum(column3) as column3
>   group by
>   column1,
>   column2
>   with cube;
> {code}
> ■Executions Plans
> - Single insert clause with multiple inserts syntax
> {noformat}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
>   Stage-2 depends on stages: Stage-0
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Map Operator Tree:
>   TableScan

[jira] [Commented] (HIVE-15478) Add file + checksum list for create table/partition during notification creation (whenever relevant)

2017-01-15 Thread Sushanth Sowmyan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823332#comment-15823332
 ] 

Sushanth Sowmyan commented on HIVE-15478:
-

(note : I commented on the review board with those open issues - let's discuss 
them before creating a follow up jira.)

> Add file + checksum list for create table/partition during notification 
> creation (whenever relevant)
> 
>
> Key: HIVE-15478
> URL: https://issues.apache.org/jira/browse/HIVE-15478
> Project: Hive
>  Issue Type: Sub-task
>  Components: repl
>Reporter: Vaibhav Gumashta
>Assignee: Daniel Dai
> Attachments: HIVE-15478.1.patch, HIVE-15478.2.patch
>
>
> Currently, file list is being generated during REPL DUMP which will result in 
> inconsistent data getting captured. This ticket is used for event dumping. 
> Bootstrap dump checksum will be in a different Jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15478) Add file + checksum list for create table/partition during notification creation (whenever relevant)

2017-01-15 Thread Sushanth Sowmyan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823331#comment-15823331
 ] 

Sushanth Sowmyan commented on HIVE-15478:
-

LGTM, +1 pending tests.

I do have some open issues regarding api changes introduced with this patch, 
but they can be taken up on a followup jira.

> Add file + checksum list for create table/partition during notification 
> creation (whenever relevant)
> 
>
> Key: HIVE-15478
> URL: https://issues.apache.org/jira/browse/HIVE-15478
> Project: Hive
>  Issue Type: Sub-task
>  Components: repl
>Reporter: Vaibhav Gumashta
>Assignee: Daniel Dai
> Attachments: HIVE-15478.1.patch, HIVE-15478.2.patch
>
>
> Currently, file list is being generated during REPL DUMP which will result in 
> inconsistent data getting captured. This ticket is used for event dumping. 
> Bootstrap dump checksum will be in a different Jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15554) Add task information to LLAP AM heartbeat

2017-01-15 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823320#comment-15823320
 ] 

Siddharth Seth commented on HIVE-15554:
---

Don't see a new patch / comment since my last comment... ?

> Add task information to LLAP AM heartbeat
> -
>
> Key: HIVE-15554
> URL: https://issues.apache.org/jira/browse/HIVE-15554
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-15554.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15623) Use customized version of netty for llap

2017-01-15 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823319#comment-15823319
 ] 

Siddharth Seth commented on HIVE-15623:
---

[~wzheng] - your best bet is to deploy a cluster, and look at the classpath 
generated for an LLAP cluster.

The patch may need to make additional fixes to skip getting netty from Tez, or 
a patch on Tez to not include netty as part of it's tar. Last I checked - there 
were 2 copies of netty in the classpath (3.6 and 4)

> Use customized version of netty for llap
> 
>
> Key: HIVE-15623
> URL: https://issues.apache.org/jira/browse/HIVE-15623
> Project: Hive
>  Issue Type: Task
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Wei Zheng
>Assignee: Wei Zheng
> Attachments: HIVE-15623.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15621) Remove use of JvmPauseMonitor in LLAP

2017-01-15 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823318#comment-15823318
 ] 

Siddharth Seth commented on HIVE-15621:
---

We should start the PauseMonitor (hive copy) in LLAP. Also, look at JvmMetrics 
- which is what publishes metrics from the hadoop jvmPauseMonitor. Should be 
possible to do the same in LLAP.

> Remove use of JvmPauseMonitor in LLAP
> -
>
> Key: HIVE-15621
> URL: https://issues.apache.org/jira/browse/HIVE-15621
> Project: Hive
>  Issue Type: Task
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Wei Zheng
>Assignee: Wei Zheng
> Attachments: HIVE-15621.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15269) Dynamic Min-Max runtime-filtering for Tez

2017-01-15 Thread Deepak Jaiswal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Jaiswal updated HIVE-15269:
--
Attachment: (was: HIVE-15269.12.patch)

> Dynamic Min-Max runtime-filtering for Tez
> -
>
> Key: HIVE-15269
> URL: https://issues.apache.org/jira/browse/HIVE-15269
> Project: Hive
>  Issue Type: New Feature
>Reporter: Jason Dere
>Assignee: Deepak Jaiswal
> Attachments: HIVE-15269.10.patch, HIVE-15269.11.patch, 
> HIVE-15269.1.patch, HIVE-15269.2.patch, HIVE-15269.3.patch, 
> HIVE-15269.4.patch, HIVE-15269.5.patch, HIVE-15269.6.patch, 
> HIVE-15269.7.patch, HIVE-15269.8.patch, HIVE-15269.9.patch
>
>
> If a dimension table and fact table are joined:
> {noformat}
> select *
> from store join store_sales on (store.id = store_sales.store_id)
> where store.s_store_name = 'My Store'
> {noformat}
> One optimization that can be done is to get the min/max store id values that 
> come out of the scan/filter of the store table, and send this min/max value 
> (via Tez edge) to the task which is scanning the store_sales table.
> We can add a BETWEEN(min, max) predicate to the store_sales TableScan, where 
> this predicate can be pushed down to the storage handler (for example for ORC 
> formats). Pushing a min/max predicate to the ORC reader would allow us to 
> avoid having to entire whole row groups during the table scan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (HIVE-15269) Dynamic Min-Max runtime-filtering for Tez

2017-01-15 Thread Deepak Jaiswal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Jaiswal updated HIVE-15269:
--
Comment: was deleted

(was: Removed a frivolous logic to remove semi join optimizations around map 
joins.
Handle cycles due to map side joins and semi join combo.)

> Dynamic Min-Max runtime-filtering for Tez
> -
>
> Key: HIVE-15269
> URL: https://issues.apache.org/jira/browse/HIVE-15269
> Project: Hive
>  Issue Type: New Feature
>Reporter: Jason Dere
>Assignee: Deepak Jaiswal
> Attachments: HIVE-15269.10.patch, HIVE-15269.11.patch, 
> HIVE-15269.1.patch, HIVE-15269.2.patch, HIVE-15269.3.patch, 
> HIVE-15269.4.patch, HIVE-15269.5.patch, HIVE-15269.6.patch, 
> HIVE-15269.7.patch, HIVE-15269.8.patch, HIVE-15269.9.patch
>
>
> If a dimension table and fact table are joined:
> {noformat}
> select *
> from store join store_sales on (store.id = store_sales.store_id)
> where store.s_store_name = 'My Store'
> {noformat}
> One optimization that can be done is to get the min/max store id values that 
> come out of the scan/filter of the store table, and send this min/max value 
> (via Tez edge) to the task which is scanning the store_sales table.
> We can add a BETWEEN(min, max) predicate to the store_sales TableScan, where 
> this predicate can be pushed down to the storage handler (for example for ORC 
> formats). Pushing a min/max predicate to the ORC reader would allow us to 
> avoid having to entire whole row groups during the table scan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15160) Can't order by an unselected column

2017-01-15 Thread Pengcheng Xiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pengcheng Xiong updated HIVE-15160:
---
Status: Patch Available  (was: Open)

> Can't order by an unselected column
> ---
>
> Key: HIVE-15160
> URL: https://issues.apache.org/jira/browse/HIVE-15160
> Project: Hive
>  Issue Type: Bug
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-15160.01.patch
>
>
> If a grouping key hasn't been selected, Hive complains. For comparison, 
> Postgres does not.
> Example. Notice i_item_id is not selected:
> {code}
> select  i_item_desc
>,i_category
>,i_class
>,i_current_price
>,sum(cs_ext_sales_price) as itemrevenue
>,sum(cs_ext_sales_price)*100/sum(sum(cs_ext_sales_price)) over
>(partition by i_class) as revenueratio
>  from catalog_sales
>  ,item
>  ,date_dim
>  where cs_item_sk = i_item_sk
>and i_category in ('Jewelry', 'Sports', 'Books')
>and cs_sold_date_sk = d_date_sk
>  and d_date between cast('2001-01-12' as date)
>   and (cast('2001-01-12' as date) + 30 days)
>  group by i_item_id
>  ,i_item_desc
>  ,i_category
>  ,i_class
>  ,i_current_price
>  order by i_category
>  ,i_class
>  ,i_item_id
>  ,i_item_desc
>  ,revenueratio
> limit 100;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15160) Can't order by an unselected column

2017-01-15 Thread Pengcheng Xiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pengcheng Xiong updated HIVE-15160:
---
Status: Open  (was: Patch Available)

> Can't order by an unselected column
> ---
>
> Key: HIVE-15160
> URL: https://issues.apache.org/jira/browse/HIVE-15160
> Project: Hive
>  Issue Type: Bug
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-15160.01.patch
>
>
> If a grouping key hasn't been selected, Hive complains. For comparison, 
> Postgres does not.
> Example. Notice i_item_id is not selected:
> {code}
> select  i_item_desc
>,i_category
>,i_class
>,i_current_price
>,sum(cs_ext_sales_price) as itemrevenue
>,sum(cs_ext_sales_price)*100/sum(sum(cs_ext_sales_price)) over
>(partition by i_class) as revenueratio
>  from catalog_sales
>  ,item
>  ,date_dim
>  where cs_item_sk = i_item_sk
>and i_category in ('Jewelry', 'Sports', 'Books')
>and cs_sold_date_sk = d_date_sk
>  and d_date between cast('2001-01-12' as date)
>   and (cast('2001-01-12' as date) + 30 days)
>  group by i_item_id
>  ,i_item_desc
>  ,i_category
>  ,i_class
>  ,i_current_price
>  order by i_category
>  ,i_class
>  ,i_item_id
>  ,i_item_desc
>  ,revenueratio
> limit 100;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15581) Unable to use advanced aggregation with multiple inserts clause

2017-01-15 Thread Carter Shanklin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823178#comment-15823178
 ] 

Carter Shanklin commented on HIVE-15581:


Have you tried it with Hive on Tez? Not saying it will help but if the problem 
is specific to Hive on MapReduce it's not likely it will ever be fixed.

> Unable to use advanced aggregation with multiple inserts clause
> ---
>
> Key: HIVE-15581
> URL: https://issues.apache.org/jira/browse/HIVE-15581
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 1.2.1
>Reporter: James Ball
>  Labels: newbie
>
> ■Use Cases
> - Use multiple insert clauses within a single query to insert multiple static 
> (user-defined) partitions into a single table.
> - Use advanced aggregation (cube) features within each insert clause to 
> include subtotals of columns for each partition
> ■Expected Behaviour
> - Subtotals are inserted for all combinations of the set of columns
> ■Observed Behaviour
> - No subtotals are not inserted for any combination of the set of columns
> ■Sample Queries
> {code:sql}
> // Create test tables
> create table if not exists
>   table1
>   (
>   column1 string,
>   column2 string,
>   column3 int
>   )
>   stored as orc
>   tblproperties
>   (
>   "orc.compress" = "SNAPPY"
>   );
> create table if not exists
>   table2
>   (
>   column1 string,
>   column2 string,
>   column3 int
>   )
>   partitioned by
>   (
>   partition1 string
>   )
>   stored as orc
>   tblproperties
>   (
>   "orc.compress" = "SNAPPY"
>   );
> create table if not exists
>   table3
>   (
>   column1 string,
>   column2 string,
>   column3 int
>   )
>   partitioned by
>   (
>   partition1 string
>   )
>   stored as orc
>   tblproperties
>   (
>   "orc.compress" = "SNAPPY"
>   );
> {code}
> {code:sql}
> // Insert test values
> insert overwrite table
>   table1
>   values
>   ('value1', 'value1', 1),
>   ('value2', 'value2', 1),
>   ('value3', 'value3', 1);
> {code}
> {code:sql}
> // Single insert clause with multiple inserts syntax
> // Subtotals are inserted into target table
> from
>   table1
> insert overwrite table
>   table2
>   partition
>   (
>   partition1 = 'value1'
>   )
>   select
>   column1,
>   column2,
>   sum(column3) as column3
>   group by
>   column1,
>   column2
>   with cube;
> {code}
> {code:sql}
> // Multiple insert clauses with multiple inserts syntax
> // Subtotals are not inserted into target table
> from
>   table1
> insert overwrite table
>   table3
>   partition
>   (
>   partition1 = 'value1'
>   )
>   select
>   column1,
>   column2,
>   sum(column3) as column3
>   group by
>   column1,
>   column2
>   with cube
> insert overwrite table
>   table3
>   partition
>   (
>   partition1 = 'value2'
>   )
>   select
>   column1,
>   column2,
>   sum(column3) as column3
>   group by
>   column1,
>   column2
>   with cube;
> {code}
> ■Executions Plans
> - Single insert clause with multiple inserts syntax
> {noformat}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
>   Stage-2 depends on stages: Stage-0
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Map Operator Tree:
>   TableScan
> alias: table1
> Statistics: Num rows: 3 Data size: 552 Basic stats: COMPLETE 
> Column stats: NONE
> Select Operator
>   expressions: column1 (type: string), column2 (type: string), 
> column3 (type: int)
>   outputColumnNames: column1, column2, column3
>   Statistics: Num rows: 3 Data size: 552 Basic stats: COMPLETE 
> Column stats: NONE
>   Group By Operator
> aggregations: sum(column3)
> keys: column1 (type: string), column2 (type: string), '0' 
> (type: string)
> mode: hash
> outputColumnNames: _col0, _col1, _col2, _col3
> Statistics: Num rows: 12 Data size: 2208 Basic stats: 
> COMPLETE Column stats: NONE
> Reduce Output Operator
>   key expressions: _col0 (type: 

[jira] [Commented] (HIVE-15631) Optimize for hive client logs , you can filter the log for each session itself.

2017-01-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823090#comment-15823090
 ] 

ASF GitHub Bot commented on HIVE-15631:
---

GitHub user Tartarus0zm opened a pull request:

https://github.com/apache/hive/pull/132

#HIVE-15631

if set hive.log.reload.variable.enable is true, and use hive.session.id in 
hive-log4j2.properties ,
 then will print hive.session.id on logs

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Tartarus0zm/hive reload_log4j_variable

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/hive/pull/132.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #132


commit 5cb250b803ce36ee02b0d6c637edfe4b0a9376f3
Author: Tartarus 
Date:   2017-01-15T09:54:08Z

#HIVE-15631
if set hive.log.reload.variable.enable is true, and use hive.session.id in 
hive-log4j2.properties ,
 then will print hive.session.id on logs




> Optimize for hive client logs , you can filter the log for each session 
> itself.
> ---
>
> Key: HIVE-15631
> URL: https://issues.apache.org/jira/browse/HIVE-15631
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI, Clients, Hive
>Reporter: tartarus
>Assignee: tartarus
> Attachments: HIVE_15631.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We have several hadoop cluster, about 15 thousand nodes. Every day we use 
> hive to submit above 100 thousand jobs. 
> So we have a large file of hive logs on every client host every day, but i 
> don not know the logs of my session submitted was which line. 
> So i hope to print the hive.session.id on every line of logs, and then i 
> could use grep to find the logs of my session submitted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15631) Optimize for hive client logs , you can filter the log for each session itself.

2017-01-15 Thread tartarus (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tartarus updated HIVE-15631:

Attachment: HIVE_15631.patch

> Optimize for hive client logs , you can filter the log for each session 
> itself.
> ---
>
> Key: HIVE-15631
> URL: https://issues.apache.org/jira/browse/HIVE-15631
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI, Clients, Hive
>Reporter: tartarus
>Assignee: tartarus
> Attachments: HIVE_15631.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We have several hadoop cluster, about 15 thousand nodes. Every day we use 
> hive to submit above 100 thousand jobs. 
> So we have a large file of hive logs on every client host every day, but i 
> don not know the logs of my session submitted was which line. 
> So i hope to print the hive.session.id on every line of logs, and then i 
> could use grep to find the logs of my session submitted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15631) Optimize for hive client logs , you can filter the log for each session itself.

2017-01-15 Thread tartarus (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tartarus updated HIVE-15631:

Status: Patch Available  (was: Open)

> Optimize for hive client logs , you can filter the log for each session 
> itself.
> ---
>
> Key: HIVE-15631
> URL: https://issues.apache.org/jira/browse/HIVE-15631
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI, Clients, Hive
>Reporter: tartarus
>Assignee: tartarus
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We have several hadoop cluster, about 15 thousand nodes. Every day we use 
> hive to submit above 100 thousand jobs. 
> So we have a large file of hive logs on every client host every day, but i 
> don not know the logs of my session submitted was which line. 
> So i hope to print the hive.session.id on every line of logs, and then i 
> could use grep to find the logs of my session submitted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15627) Make hive.vectorized.adaptor.usage.mode=all vectorize all UDFs not just those in supportedGenericUDFs

2017-01-15 Thread Matt McCline (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt McCline updated HIVE-15627:

Attachment: HIVE-15627.04.patch

> Make hive.vectorized.adaptor.usage.mode=all vectorize all UDFs not just those 
> in supportedGenericUDFs
> -
>
> Key: HIVE-15627
> URL: https://issues.apache.org/jira/browse/HIVE-15627
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Matt McCline
>Assignee: Matt McCline
>Priority: Critical
> Attachments: HIVE-15627.01.patch, HIVE-15627.02.patch, 
> HIVE-15627.03.patch, HIVE-15627.04.patch
>
>
> Missed this when doing HIVE-14336.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15627) Make hive.vectorized.adaptor.usage.mode=all vectorize all UDFs not just those in supportedGenericUDFs

2017-01-15 Thread Matt McCline (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt McCline updated HIVE-15627:

Status: Patch Available  (was: In Progress)

> Make hive.vectorized.adaptor.usage.mode=all vectorize all UDFs not just those 
> in supportedGenericUDFs
> -
>
> Key: HIVE-15627
> URL: https://issues.apache.org/jira/browse/HIVE-15627
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Matt McCline
>Assignee: Matt McCline
>Priority: Critical
> Attachments: HIVE-15627.01.patch, HIVE-15627.02.patch, 
> HIVE-15627.03.patch, HIVE-15627.04.patch
>
>
> Missed this when doing HIVE-14336.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15627) Make hive.vectorized.adaptor.usage.mode=all vectorize all UDFs not just those in supportedGenericUDFs

2017-01-15 Thread Matt McCline (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt McCline updated HIVE-15627:

Status: In Progress  (was: Patch Available)

> Make hive.vectorized.adaptor.usage.mode=all vectorize all UDFs not just those 
> in supportedGenericUDFs
> -
>
> Key: HIVE-15627
> URL: https://issues.apache.org/jira/browse/HIVE-15627
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Matt McCline
>Assignee: Matt McCline
>Priority: Critical
> Attachments: HIVE-15627.01.patch, HIVE-15627.02.patch, 
> HIVE-15627.03.patch, HIVE-15627.04.patch
>
>
> Missed this when doing HIVE-14336.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15588) Vectorization: Fix deallocation of scratch columns in VectorUDFCoalesce, etc to prevent wrong reuse

2017-01-15 Thread Matt McCline (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt McCline updated HIVE-15588:

Status: In Progress  (was: Patch Available)

> Vectorization: Fix deallocation of scratch columns in VectorUDFCoalesce, etc 
> to prevent wrong reuse
> ---
>
> Key: HIVE-15588
> URL: https://issues.apache.org/jira/browse/HIVE-15588
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Matt McCline
>Assignee: Matt McCline
>Priority: Critical
> Attachments: HIVE-15588.01.patch, HIVE-15588.02.patch, 
> HIVE-15588.03.patch, HIVE-15588.04.patch
>
>
> Make sure we don't deallocate a scratch column too quickly and cause result 
> corruption due to scratch column reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15588) Vectorization: Fix deallocation of scratch columns in VectorUDFCoalesce, etc to prevent wrong reuse

2017-01-15 Thread Matt McCline (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt McCline updated HIVE-15588:

Attachment: HIVE-15588.04.patch

> Vectorization: Fix deallocation of scratch columns in VectorUDFCoalesce, etc 
> to prevent wrong reuse
> ---
>
> Key: HIVE-15588
> URL: https://issues.apache.org/jira/browse/HIVE-15588
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Matt McCline
>Assignee: Matt McCline
>Priority: Critical
> Attachments: HIVE-15588.01.patch, HIVE-15588.02.patch, 
> HIVE-15588.03.patch, HIVE-15588.04.patch
>
>
> Make sure we don't deallocate a scratch column too quickly and cause result 
> corruption due to scratch column reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15588) Vectorization: Fix deallocation of scratch columns in VectorUDFCoalesce, etc to prevent wrong reuse

2017-01-15 Thread Matt McCline (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt McCline updated HIVE-15588:

Status: Patch Available  (was: In Progress)

> Vectorization: Fix deallocation of scratch columns in VectorUDFCoalesce, etc 
> to prevent wrong reuse
> ---
>
> Key: HIVE-15588
> URL: https://issues.apache.org/jira/browse/HIVE-15588
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Matt McCline
>Assignee: Matt McCline
>Priority: Critical
> Attachments: HIVE-15588.01.patch, HIVE-15588.02.patch, 
> HIVE-15588.03.patch, HIVE-15588.04.patch
>
>
> Make sure we don't deallocate a scratch column too quickly and cause result 
> corruption due to scratch column reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15631) Optimize for hive client logs , you can filter the log for each session itself.

2017-01-15 Thread tartarus (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tartarus updated HIVE-15631:

Description: 
We have several hadoop cluster, about 15 thousand nodes. Every day we use hive 
to submit above 100 thousand jobs. 
So we have a large file of hive logs on every client host every day, but i don 
not know the logs of my session submitted was which line. 
So i hope to print the hive.session.id on every line of logs, and then i could 
use grep to find the logs of my session submitted. 

  was:We have several hadoop cluster, about 15 thousand nodes. Every day we use 
hive to submit above 100 thousand jobs. So we have a large file of hive logs on 
every client host every day, but i don not know the logs of my session 
submitted was which line. So i hope to print the hive.session.id on every line 
of logs, and then i could use grep to find the logs of my session submitted. 


> Optimize for hive client logs , you can filter the log for each session 
> itself.
> ---
>
> Key: HIVE-15631
> URL: https://issues.apache.org/jira/browse/HIVE-15631
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI, Clients, Hive
>Reporter: tartarus
>Assignee: tartarus
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We have several hadoop cluster, about 15 thousand nodes. Every day we use 
> hive to submit above 100 thousand jobs. 
> So we have a large file of hive logs on every client host every day, but i 
> don not know the logs of my session submitted was which line. 
> So i hope to print the hive.session.id on every line of logs, and then i 
> could use grep to find the logs of my session submitted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)