[jira] [Commented] (IMPALA-5073) Considering bypassing TCMalloc by default for buffer pool

2019-05-02 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831722#comment-16831722
 ] 

Ruslan Dautkhanov commented on IMPALA-5073:
---

sorry for the noise on some of these jiras.. it's just some feedback after 
working with Red Hat support on some of performance issue we're seeing. I can 
share more details from Red Hat if you guys are interested .thanks for any 
inputs. 

> Considering bypassing TCMalloc by default for buffer pool
> -
>
> Key: IMPALA-5073
> URL: https://issues.apache.org/jira/browse/IMPALA-5073
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.9.0
>Reporter: Tim Armstrong
>Priority: Minor
>  Labels: resource-management
>
> There would be some advantages to switch from from allocating buffers via 
> TCMalloc and instead using mmap directly - e.g. less contention for the page 
> heap lock.
> There are also downsides - virtual memory consumption could increase and we 
> may end up mapping and unmapping memory more frequently.
> We would also need to wire up the MemTrackers so they include this memory in 
> the process estimate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-5073) Considering bypassing TCMalloc by default for buffer pool

2019-05-02 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831718#comment-16831718
 ] 

Ruslan Dautkhanov commented on IMPALA-5073:
---

https://github.com/cloudera/Impala/blob/cdh5-2.11.0_5.14.4/be/src/runtime/bufferpool/system-allocator.cc#L81

mmap() call should have MAP_HUGETLB in 3rd argument isn't it? for huge page to 
be used. That call currently doesn't have it, so Impala's mmap()ed memory never 
uses huge pages?

Like it is described in 
https://d3s.mff.cuni.cz/legacy/teaching/advanced_operating_systems/slides/10_huge_pages.pdf
 for example

Thanks! 

 

> Considering bypassing TCMalloc by default for buffer pool
> -
>
> Key: IMPALA-5073
> URL: https://issues.apache.org/jira/browse/IMPALA-5073
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.9.0
>Reporter: Tim Armstrong
>Priority: Minor
>  Labels: resource-management
>
> There would be some advantages to switch from from allocating buffers via 
> TCMalloc and instead using mmap directly - e.g. less contention for the page 
> heap lock.
> There are also downsides - virtual memory consumption could increase and we 
> may end up mapping and unmapping memory more frequently.
> We would also need to wire up the MemTrackers so they include this memory in 
> the process estimate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-5073) Considering bypassing TCMalloc by default for buffer pool

2019-05-02 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831716#comment-16831716
 ] 

Ruslan Dautkhanov commented on IMPALA-5073:
---

Also somewhat related - it's interesting that how it's done in JVM with 
pretouches to move allocation cost to startup phase 

[https://shipilev.net/jvm/anatomy-quarks/2-transparent-huge-pages/]

{quote}
 To shift these costs to the JVM startup that will avoid surprising latency 
hiccups when application is running, you may instruct JVM to touch every single 
page in Java heap with -XX:+AlwaysPreTouch during initialization. It is a good 
idea to enable pre-touch for larger heaps anyway.

And there comes the funny part: enabling -XX:+UseTransparentHugePages actually 
makes -XX:+AlwaysPreTouch faster, because JVM now knows it has to touch the 
heap in larger quanta (say, a byte every 2M), rather than in smaller ones (say, 
a byte every 4K). 
{quote}


> Considering bypassing TCMalloc by default for buffer pool
> -
>
> Key: IMPALA-5073
> URL: https://issues.apache.org/jira/browse/IMPALA-5073
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.9.0
>Reporter: Tim Armstrong
>Priority: Minor
>  Labels: resource-management
>
> There would be some advantages to switch from from allocating buffers via 
> TCMalloc and instead using mmap directly - e.g. less contention for the page 
> heap lock.
> There are also downsides - virtual memory consumption could increase and we 
> may end up mapping and unmapping memory more frequently.
> We would also need to wire up the MemTrackers so they include this memory in 
> the process estimate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-5073) Considering bypassing TCMalloc by default for buffer pool

2019-05-02 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831702#comment-16831702
 ] 

Ruslan Dautkhanov commented on IMPALA-5073:
---

[~tarmstrong]

My 2 cents. Would it be possible to preallocate a fixed amount of memory for 
impalad at startup?

Like it's done in Oracle through `SGA_AGGREGATE_TARGET` (global shared buffers, 
cached query plans etc etc) and `PGA_AGGREGATE_TARGET` (for per-session memory 
demand like for sorting temp results etc). 

This increases startup time, which is okay, but there is no overhead to mmap() 
on a per-session / per-query level. 

For Oracle SGA always dominates (global db block buffer caches etc), but for 
Impala PGA-like memory structure would dominate I guess (per-query shuffling 
results etc) . 

Another advantage of preallocating memory is that you could use hugetlbfs 
optionally as amount of memory for that preallocated memory is static and known 
ahead of time. 

 

> Considering bypassing TCMalloc by default for buffer pool
> -
>
> Key: IMPALA-5073
> URL: https://issues.apache.org/jira/browse/IMPALA-5073
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.9.0
>Reporter: Tim Armstrong
>Priority: Minor
>  Labels: resource-management
>
> There would be some advantages to switch from from allocating buffers via 
> TCMalloc and instead using mmap directly - e.g. less contention for the page 
> heap lock.
> There are also downsides - virtual memory consumption could increase and we 
> may end up mapping and unmapping memory more frequently.
> We would also need to wire up the MemTrackers so they include this memory in 
> the process estimate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-5229) Try using TCMalloc + Huge Pages for buffers

2019-05-01 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-5229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831112#comment-16831112
 ] 

Ruslan Dautkhanov commented on IMPALA-5229:
---

Thank you for prompt response [~tarmstrong]

I agree it's hard to manage for Impala that's colocated with YARN and other 
workloads for example.

It makes sense for some dedicated scenarios only .. we do use hugetlbfs on 
dedicated Oracle servers for examples where we can tell exactly how much memory 
Oracle will use for its caches etc. 

 

 

> Try using TCMalloc + Huge Pages for buffers
> ---
>
> Key: IMPALA-5229
> URL: https://issues.apache.org/jira/browse/IMPALA-5229
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.9.0
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
>  Labels: perf, resource-management
> Fix For: Impala 2.9.0
>
>
> As well as mmap() + huge pages, we could support TCMalloc + huge pages. I 
> believe TCMalloc will support this well because:
> * We can allocate huge-page-aligned memory via posix_memalign()
> * TCMalloc, by default, always decommits large allocations upon freeing them. 
> So if we undo the HUGEPAGE madvise() before handing the pages back to 
> TCMalloc, then TCMalloc will decommit the huge pages backing the allocation, 
> and we won't get TCMalloc's page heap into a weird state.
> This could give us the perf benefits of huge pages without some of the 
> headaches associated with using mmap() directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-5229) Try using TCMalloc + Huge Pages for buffers

2019-05-01 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-5229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831068#comment-16831068
 ] 

Ruslan Dautkhanov commented on IMPALA-5229:
---

We have some performance issues and working RH Support.. 

{noformat}

I'm surprised that a system with so much RAM and Java isn't already using 
HugeTLB.

However the busiest cores might not be the problem and I could be looking in 
the wrong place, so we'll await your reply on that to target our investigation 
better. 

Also, you might want to ask Cloudera if you should be using the classic HugeTLB 
sort of pre-allocated HugePages? That will massively reduce your pagetable 
sizes and lead to less memory fragmentation too.

{noformat}

How do we enable Huge pages for Impala? 

Thank you!


> Try using TCMalloc + Huge Pages for buffers
> ---
>
> Key: IMPALA-5229
> URL: https://issues.apache.org/jira/browse/IMPALA-5229
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.9.0
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
>  Labels: perf, resource-management
> Fix For: Impala 2.9.0
>
>
> As well as mmap() + huge pages, we could support TCMalloc + huge pages. I 
> believe TCMalloc will support this well because:
> * We can allocate huge-page-aligned memory via posix_memalign()
> * TCMalloc, by default, always decommits large allocations upon freeing them. 
> So if we undo the HUGEPAGE madvise() before handing the pages back to 
> TCMalloc, then TCMalloc will decommit the huge pages backing the allocation, 
> and we won't get TCMalloc's page heap into a weird state.
> This could give us the perf benefits of huge pages without some of the 
> headaches associated with using mmap() directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-6088) Rack aware broadcast operator

2019-04-26 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-6088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827317#comment-16827317
 ] 

Ruslan Dautkhanov commented on IMPALA-6088:
---

Would be also interesting to look if something like Torrent protocol helps here 
too.
This Torrent protocol can take into account rack location, if available. 

Spark uses this to avoid -query coordinator- Spark Driver to be a network 
bottleneck 

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala

{quote}

* The driver divides the serialized object into small chunks and
 * stores those chunks in the BlockManager of the driver.

 * On each executor, the executor first attempts to fetch the object from its 
BlockManager. If
 * it does not exist, it then uses remote fetches to fetch the small chunks 
from the driver and/or
 * other executors if available. Once it gets the chunks, it puts the chunks in 
its own
 * BlockManager, ready for other executors to fetch from.

 * This prevents the driver from being the bottleneck in sending out multiple 
copies of the
 * broadcast data (one per executor).

{quote}

> Rack aware broadcast operator
> -
>
> Key: IMPALA-6088
> URL: https://issues.apache.org/jira/browse/IMPALA-6088
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Distributed Exec
>Reporter: Mostafa Mokhtar
>Priority: Major
>
> When conducting large scale experiments on a 6 rack cluster with aggregator 
> core network topology overall cluster bandwidth utilization was limited. 
> With aggregator core networks nodes and racks are not equidistant, which 
> means a broadcast operation can be inefficient as the broadcasting node needs 
> to send the same data N times to each node on a remote rack. 
> Ideally Rowbatches should be sent once per remote rack then a node on each 
> remote rack would broadcast within its rack. 
> Table below represent rack to rack latency for the 90% of operations, ration 
> between best and worst case is 7.3x 
> | ||  va||vc||vd1||   vd3||   ve|
> |va|  4,238|  4,290|  9,692|  8,897|  8,208|
> |vc|  9,290|  4,396|  30,952| 13,529| 14,578|
> |vd1| 9,131|  29,066| 4,346|  17,265| 16,849|
> |vd3| 7,409|  15,517| 17,265| 4,370|  4,687|
> |ve|  4,914|  16,894| 16,430| 4,713|  4,472|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7204) Add support for GROUP BY ROLLUP

2019-04-04 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810112#comment-16810112
 ] 

Ruslan Dautkhanov commented on IMPALA-7204:
---

cc [~grahn] - can you please let us know if you guys are planning to add 
support for this feature?

Our Account team said you might be the right person to ask )

Thanks for any ideas!


> Add support for GROUP BY ROLLUP
> ---
>
> Key: IMPALA-7204
> URL: https://issues.apache.org/jira/browse/IMPALA-7204
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend
>Affects Versions: Impala 3.0, Impala 2.12.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: GROUP_BY, sql
>
> Now suppose that we'd like to analyze our sales data, to study the amount of 
> sales that is occurring for different products, in different states and 
> regions. Using the ROLLUP feature of SQL 2003, we could issue the query:
> {code:sql}
> select region, state, product, sum(sales) total_sales
> from sales_history 
> group by rollup (region, state, product)
> {code}
> Semantically, the above query is equivalent to
>  
> {code:sql}
> select region, state, product, sum(sales) total_sales
> from sales_history 
> group by region, state, product
> union
> select region, state, null, sum(sales) total_sales
> from sales_history 
> group by region, state
> union
> select region, null, null, sum(sales) total_sales
> from sales_history 
> group by region
> union
> select null, null, null, sum(sales) total_sales
> from sales_history
>  
> {code}
> The query might produce results that looked something like:
> {noformat}
> REGION STATE PRODUCT TOTAL_SALES
> -- - --- ---
> null null null 6200
> EAST MA BOATS 100
> EAST MA CARS 1500
> EAST MA null 1600
> EAST NY BOATS 150
> EAST NY CARS 1000
> EAST NY null 1150
> EAST null null 2750
> WEST CA BOATS 750
> WEST CA CARS 500
> WEST CA null 1250
> WEST AZ BOATS 2000
> WEST AZ CARS 200
> WEST AZ null 2200
> WEST null null 3450
> {noformat}
> We have a lot of production queries that work around this missing Impala 
> functionality by having three UNION ALLs. Physical execution plan shows 
> Impala actually reads full fact table three times. So it could be a three 
> times improvement (or more, depending on number of columns that are being 
> rolled up).
> I can't find another SQL on Hadoop engine that doesn't support this feature. 
>  *Checked Spark, Hive, PIG, Flink and some other engines - they all do 
> support this basic SQL feature*.
> Would be great to have a matching feature in Impala too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-6042) Allow Impala shell to also use a global impalarc configuration

2018-10-29 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667434#comment-16667434
 ] 

Ruslan Dautkhanov commented on IMPALA-6042:
---

Another use case is to update PATH variable, to point to a Python2 home for 
example.

As impala-shell is not Python3-compatible (IMPALA-3343) and we're migrating all 
our applications to Python 3.

It would be great to have a way to globally update certain config variables and 
environment variables like PATH 
to have a workaround for this and some other issues.

Thank you.

> Allow Impala shell to also use a global impalarc configuration
> --
>
> Key: IMPALA-6042
> URL: https://issues.apache.org/jira/browse/IMPALA-6042
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Clients
>Reporter: Balazs Jeszenszky
>Priority: Minor
>  Labels: newbie, shell, usability
>
> Currently, impalarc files can be specified on a per-user basis (stored in 
> ~/.impalarc), and they aren't created by default. 
> The Impala shell should pick up /etc/impalarc as well, in addition to the 
> user-specific configurations.
> The intent here is to allow a "global" configuration of the shell by a system 
> administrator with common options like:
> {code}
> --ssl
> -l
> -k
> -u 
> -i 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-3446) Materialized views

2018-10-26 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16665602#comment-16665602
 ] 

Ruslan Dautkhanov commented on IMPALA-3446:
---

Any updates on this case?

Both HIVE-14484 and HIVE-14249 are resolved in Hive so technically materialized 
views are supported in upstream Hive now. 

Would be great to have same in Impala, as Impala's use cases are exactly around 
faster interactive response time, than for Hive.

Thank you.

 

> Materialized views
> --
>
> Key: IMPALA-3446
> URL: https://issues.apache.org/jira/browse/IMPALA-3446
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Frontend
>Affects Versions: Impala 2.5.0
>Reporter: Marcell Szabo
>Priority: Minor
>
> This JIRA is a placeholder for this big topic.
> Some user stories I can imagine under the epic:
> # materialized view is a CTAS where the SELECT is saved in the HMS and can be 
> rerun by a simple command
> # materialized view detects that the source data has changed and falls back 
> to be a view instead of SELECT *
> # materialized view detects that the source data has changed and reruns the 
> CTAS automatically
> # ... reruns only the necessary changes (e.g. if only some of the partitions 
> change)
> # Impala and Hive to have common semantics of materialized views 
> (https://issues.apache.org/jira/browse/HIVE-10459)
> # materialized view stores extra statistics that help the optimizer, storing 
> the full resultset is optional
> # query optimizer checks for every query whether part of the query plan can 
> be covered by an existing materialized view



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Comment Edited] (IMPALA-110) Add support for multiple distinct operators in the same query block

2018-10-24 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662704#comment-16662704
 ] 

Ruslan Dautkhanov edited comment on IMPALA-110 at 10/24/18 7:17 PM:


[~grahn]

Yes, we would like to keep this session level parameter too, so if it's on, 
then count(distinct) essentially runs as ndv.

Oracle Database 12.2 just introduced similar session setting level too :

[https://oracle-base.com/articles/12c/approximate-query-processing-12cr2] 

"You could refactor your code, or you could ask Oracle to convert your exact 
calls to approximate calls instead"

Please don't remove this great feature!

Thank you.

 


was (Author: tagar):
[~grahn]]

Yes, we would like to keep this session level parameter too, so if it's on, 
then count(distinct) essentially runs as ndv.

Oracle Database 12.2 just introduced similar session setting level too :

[https://oracle-base.com/articles/12c/approximate-query-processing-12cr2] 

"You could refactor your code, or you could ask Oracle to convert your exact 
calls to approximate calls instead"

Please don't remove this great feature!

Thank you.

 

> Add support for multiple distinct operators in the same query block
> ---
>
> Key: IMPALA-110
> URL: https://issues.apache.org/jira/browse/IMPALA-110
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend, Frontend
>Affects Versions: Impala 0.5, Impala 1.4, Impala 2.0, Impala 2.2, Impala 
> 2.3.0
>Reporter: Greg Rahn
>Assignee: Thomas Tauber-Marshall
>Priority: Major
>  Labels: sql-language
> Fix For: Impala 3.1.0
>
>
> Impala only allows a single (DISTINCT columns) expression in each query.
> {color:red}Note:
> If you do not need precise accuracy, you can produce an estimate of the 
> distinct values for a column by specifying NDV(column); a query can contain 
> multiple instances of NDV(column). To make Impala automatically rewrite 
> COUNT(DISTINCT) expressions to NDV(), enable the APPX_COUNT_DISTINCT query 
> option.
> {color}
> {code}
> [impala:21000] > select count(distinct i_class_id) from item;
> Query: select count(distinct i_class_id) from item
> Query finished, fetching results ...
> 16
> Returned 1 row(s) in 1.51s
> {code}
> {code}
> [impala:21000] > select count(distinct i_class_id), count(distinct 
> i_brand_id) from item;
> Query: select count(distinct i_class_id), count(distinct i_brand_id) from item
> ERROR: com.cloudera.impala.common.AnalysisException: Analysis exception (in 
> select count(distinct i_class_id), count(distinct i_brand_id) from item)
>   at 
> com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:133)
>   at 
> com.cloudera.impala.service.Frontend.createExecRequest(Frontend.java:221)
>   at 
> com.cloudera.impala.service.JniFrontend.createExecRequest(JniFrontend.java:89)
> Caused by: com.cloudera.impala.common.AnalysisException: all DISTINCT 
> aggregate functions need to have the same set of parameters as COUNT(DISTINCT 
> i_class_id); deviating function: COUNT(DISTINCT i_brand_id)
>   at 
> com.cloudera.impala.analysis.AggregateInfo.createDistinctAggInfo(AggregateInfo.java:196)
>   at 
> com.cloudera.impala.analysis.AggregateInfo.create(AggregateInfo.java:143)
>   at 
> com.cloudera.impala.analysis.SelectStmt.createAggInfo(SelectStmt.java:466)
>   at 
> com.cloudera.impala.analysis.SelectStmt.analyzeAggregation(SelectStmt.java:347)
>   at com.cloudera.impala.analysis.SelectStmt.analyze(SelectStmt.java:155)
>   at 
> com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:130)
>   ... 2 more
> {code}
> Hive supports this:
> {code}
> $ hive -e "select count(distinct i_class_id), count(distinct i_brand_id) from 
> item;"
> Logging initialized using configuration in 
> file:/etc/hive/conf.dist/hive-log4j.properties
> Hive history file=/tmp/grahn/hive_job_log_grahn_201303052234_1625576708.txt
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=
> In order to set a constant number of reducers:
>   set mapred.reduce.tasks=
> Starting Job = job_201302081514_0073, Tracking URL = 
> http://impala:50030/jobdetails.jsp?jobid=job_201302081514_0073
> Kill Command = /usr/lib/hadoop/bin/hadoop job  
> -Dmapred.job.tracker=m0525.mtv.cloudera.com:8021 -kill job_201302081514_0073
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 
> 1
> 2013-03-05 22:34:43,255 Stage-1 map = 0%,  reduce = 0%
> 2013-03-05 22:34:49,323 Stage-1 map = 

[jira] [Commented] (IMPALA-110) Add support for multiple distinct operators in the same query block

2018-10-24 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662704#comment-16662704
 ] 

Ruslan Dautkhanov commented on IMPALA-110:
--

[~grahn]]

Yes, we would like to keep this session level parameter too, so if it's on, 
then count(distinct) essentially runs as ndv.

Oracle Database 12.2 just introduced similar session setting level too :

[https://oracle-base.com/articles/12c/approximate-query-processing-12cr2] 

"You could refactor your code, or you could ask Oracle to convert your exact 
calls to approximate calls instead"

Please don't remove this great feature!

Thank you.

 

> Add support for multiple distinct operators in the same query block
> ---
>
> Key: IMPALA-110
> URL: https://issues.apache.org/jira/browse/IMPALA-110
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend, Frontend
>Affects Versions: Impala 0.5, Impala 1.4, Impala 2.0, Impala 2.2, Impala 
> 2.3.0
>Reporter: Greg Rahn
>Assignee: Thomas Tauber-Marshall
>Priority: Major
>  Labels: sql-language
> Fix For: Impala 3.1.0
>
>
> Impala only allows a single (DISTINCT columns) expression in each query.
> {color:red}Note:
> If you do not need precise accuracy, you can produce an estimate of the 
> distinct values for a column by specifying NDV(column); a query can contain 
> multiple instances of NDV(column). To make Impala automatically rewrite 
> COUNT(DISTINCT) expressions to NDV(), enable the APPX_COUNT_DISTINCT query 
> option.
> {color}
> {code}
> [impala:21000] > select count(distinct i_class_id) from item;
> Query: select count(distinct i_class_id) from item
> Query finished, fetching results ...
> 16
> Returned 1 row(s) in 1.51s
> {code}
> {code}
> [impala:21000] > select count(distinct i_class_id), count(distinct 
> i_brand_id) from item;
> Query: select count(distinct i_class_id), count(distinct i_brand_id) from item
> ERROR: com.cloudera.impala.common.AnalysisException: Analysis exception (in 
> select count(distinct i_class_id), count(distinct i_brand_id) from item)
>   at 
> com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:133)
>   at 
> com.cloudera.impala.service.Frontend.createExecRequest(Frontend.java:221)
>   at 
> com.cloudera.impala.service.JniFrontend.createExecRequest(JniFrontend.java:89)
> Caused by: com.cloudera.impala.common.AnalysisException: all DISTINCT 
> aggregate functions need to have the same set of parameters as COUNT(DISTINCT 
> i_class_id); deviating function: COUNT(DISTINCT i_brand_id)
>   at 
> com.cloudera.impala.analysis.AggregateInfo.createDistinctAggInfo(AggregateInfo.java:196)
>   at 
> com.cloudera.impala.analysis.AggregateInfo.create(AggregateInfo.java:143)
>   at 
> com.cloudera.impala.analysis.SelectStmt.createAggInfo(SelectStmt.java:466)
>   at 
> com.cloudera.impala.analysis.SelectStmt.analyzeAggregation(SelectStmt.java:347)
>   at com.cloudera.impala.analysis.SelectStmt.analyze(SelectStmt.java:155)
>   at 
> com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:130)
>   ... 2 more
> {code}
> Hive supports this:
> {code}
> $ hive -e "select count(distinct i_class_id), count(distinct i_brand_id) from 
> item;"
> Logging initialized using configuration in 
> file:/etc/hive/conf.dist/hive-log4j.properties
> Hive history file=/tmp/grahn/hive_job_log_grahn_201303052234_1625576708.txt
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=
> In order to set a constant number of reducers:
>   set mapred.reduce.tasks=
> Starting Job = job_201302081514_0073, Tracking URL = 
> http://impala:50030/jobdetails.jsp?jobid=job_201302081514_0073
> Kill Command = /usr/lib/hadoop/bin/hadoop job  
> -Dmapred.job.tracker=m0525.mtv.cloudera.com:8021 -kill job_201302081514_0073
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 
> 1
> 2013-03-05 22:34:43,255 Stage-1 map = 0%,  reduce = 0%
> 2013-03-05 22:34:49,323 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:50,337 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:51,351 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:52,360 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:53,370 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:54,379 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:55,389 Stage-1 map = 100%,  

[jira] [Commented] (IMPALA-7204) Add support for GROUP BY ROLLUP

2018-10-23 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661346#comment-16661346
 ] 

Ruslan Dautkhanov commented on IMPALA-7204:
---

thank you [~tarmstrong]

Since now IMPALA-110 is resolved, would it be possible to put this on Impala 
Roadmap ? thanks!

> Add support for GROUP BY ROLLUP
> ---
>
> Key: IMPALA-7204
> URL: https://issues.apache.org/jira/browse/IMPALA-7204
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend
>Affects Versions: Impala 3.0, Impala 2.12.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: GROUP_BY, sql
>
> Now suppose that we'd like to analyze our sales data, to study the amount of 
> sales that is occurring for different products, in different states and 
> regions. Using the ROLLUP feature of SQL 2003, we could issue the query:
> {code:sql}
> select region, state, product, sum(sales) total_sales
> from sales_history 
> group by rollup (region, state, product)
> {code}
> Semantically, the above query is equivalent to
>  
> {code:sql}
> select region, state, product, sum(sales) total_sales
> from sales_history 
> group by region, state, product
> union
> select region, state, null, sum(sales) total_sales
> from sales_history 
> group by region, state
> union
> select region, null, null, sum(sales) total_sales
> from sales_history 
> group by region
> union
> select null, null, null, sum(sales) total_sales
> from sales_history
>  
> {code}
> The query might produce results that looked something like:
> {noformat}
> REGION STATE PRODUCT TOTAL_SALES
> -- - --- ---
> null null null 6200
> EAST MA BOATS 100
> EAST MA CARS 1500
> EAST MA null 1600
> EAST NY BOATS 150
> EAST NY CARS 1000
> EAST NY null 1150
> EAST null null 2750
> WEST CA BOATS 750
> WEST CA CARS 500
> WEST CA null 1250
> WEST AZ BOATS 2000
> WEST AZ CARS 200
> WEST AZ null 2200
> WEST null null 3450
> {noformat}
> We have a lot of production queries that work around this missing Impala 
> functionality by having three UNION ALLs. Physical execution plan shows 
> Impala actually reads full fact table three times. So it could be a three 
> times improvement (or more, depending on number of columns that are being 
> rolled up).
> I can't find another SQL on Hadoop engine that doesn't support this feature. 
>  *Checked Spark, Hive, PIG, Flink and some other engines - they all do 
> support this basic SQL feature*.
> Would be great to have a matching feature in Impala too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-110) Add support for multiple distinct operators in the same query block

2018-10-23 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661238#comment-16661238
 ] 

Ruslan Dautkhanov commented on IMPALA-110:
--

[~twmarshall] , We have BI tools that have `count(distinct)` hardcoded, 
although we can still change session level settings.

It would be great to keep APPX_COUNT_DISTINCT supported for such use cases. 

> Add support for multiple distinct operators in the same query block
> ---
>
> Key: IMPALA-110
> URL: https://issues.apache.org/jira/browse/IMPALA-110
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend, Frontend
>Affects Versions: Impala 0.5, Impala 1.4, Impala 2.0, Impala 2.2, Impala 
> 2.3.0
>Reporter: Greg Rahn
>Assignee: Thomas Tauber-Marshall
>Priority: Major
>  Labels: sql-language
> Fix For: Impala 3.1.0
>
>
> Impala only allows a single (DISTINCT columns) expression in each query.
> {color:red}Note:
> If you do not need precise accuracy, you can produce an estimate of the 
> distinct values for a column by specifying NDV(column); a query can contain 
> multiple instances of NDV(column). To make Impala automatically rewrite 
> COUNT(DISTINCT) expressions to NDV(), enable the APPX_COUNT_DISTINCT query 
> option.
> {color}
> {code}
> [impala:21000] > select count(distinct i_class_id) from item;
> Query: select count(distinct i_class_id) from item
> Query finished, fetching results ...
> 16
> Returned 1 row(s) in 1.51s
> {code}
> {code}
> [impala:21000] > select count(distinct i_class_id), count(distinct 
> i_brand_id) from item;
> Query: select count(distinct i_class_id), count(distinct i_brand_id) from item
> ERROR: com.cloudera.impala.common.AnalysisException: Analysis exception (in 
> select count(distinct i_class_id), count(distinct i_brand_id) from item)
>   at 
> com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:133)
>   at 
> com.cloudera.impala.service.Frontend.createExecRequest(Frontend.java:221)
>   at 
> com.cloudera.impala.service.JniFrontend.createExecRequest(JniFrontend.java:89)
> Caused by: com.cloudera.impala.common.AnalysisException: all DISTINCT 
> aggregate functions need to have the same set of parameters as COUNT(DISTINCT 
> i_class_id); deviating function: COUNT(DISTINCT i_brand_id)
>   at 
> com.cloudera.impala.analysis.AggregateInfo.createDistinctAggInfo(AggregateInfo.java:196)
>   at 
> com.cloudera.impala.analysis.AggregateInfo.create(AggregateInfo.java:143)
>   at 
> com.cloudera.impala.analysis.SelectStmt.createAggInfo(SelectStmt.java:466)
>   at 
> com.cloudera.impala.analysis.SelectStmt.analyzeAggregation(SelectStmt.java:347)
>   at com.cloudera.impala.analysis.SelectStmt.analyze(SelectStmt.java:155)
>   at 
> com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:130)
>   ... 2 more
> {code}
> Hive supports this:
> {code}
> $ hive -e "select count(distinct i_class_id), count(distinct i_brand_id) from 
> item;"
> Logging initialized using configuration in 
> file:/etc/hive/conf.dist/hive-log4j.properties
> Hive history file=/tmp/grahn/hive_job_log_grahn_201303052234_1625576708.txt
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=
> In order to set a constant number of reducers:
>   set mapred.reduce.tasks=
> Starting Job = job_201302081514_0073, Tracking URL = 
> http://impala:50030/jobdetails.jsp?jobid=job_201302081514_0073
> Kill Command = /usr/lib/hadoop/bin/hadoop job  
> -Dmapred.job.tracker=m0525.mtv.cloudera.com:8021 -kill job_201302081514_0073
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 
> 1
> 2013-03-05 22:34:43,255 Stage-1 map = 0%,  reduce = 0%
> 2013-03-05 22:34:49,323 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:50,337 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:51,351 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:52,360 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:53,370 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:54,379 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:55,389 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 8.58 sec
> 2013-03-05 22:34:56,402 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 8.58 sec
> 2013-03-05 22:34:57,413 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 8.58 sec
> 2013-03-05 22:34:58,424 Stage-1 

[jira] [Commented] (IMPALA-110) Add support for multiple distinct operators in the same query block

2018-10-23 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661192#comment-16661192
 ] 

Ruslan Dautkhanov commented on IMPALA-110:
--

APPX_COUNT_DISTINCT wouldn't go away. You would still have an option to run 
approximate count distinct, as those are much cheaper/ much faster on highly 
cardinal columns and a still a good option if you're not looking for exact 
answer. 

And no it wouldn't go to 1 count distinct limitation per query. 

That's my understanding. 

 

> Add support for multiple distinct operators in the same query block
> ---
>
> Key: IMPALA-110
> URL: https://issues.apache.org/jira/browse/IMPALA-110
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend, Frontend
>Affects Versions: Impala 0.5, Impala 1.4, Impala 2.0, Impala 2.2, Impala 
> 2.3.0
>Reporter: Greg Rahn
>Assignee: Thomas Tauber-Marshall
>Priority: Major
>  Labels: sql-language
> Fix For: Impala 3.1.0
>
>
> Impala only allows a single (DISTINCT columns) expression in each query.
> {color:red}Note:
> If you do not need precise accuracy, you can produce an estimate of the 
> distinct values for a column by specifying NDV(column); a query can contain 
> multiple instances of NDV(column). To make Impala automatically rewrite 
> COUNT(DISTINCT) expressions to NDV(), enable the APPX_COUNT_DISTINCT query 
> option.
> {color}
> {code}
> [impala:21000] > select count(distinct i_class_id) from item;
> Query: select count(distinct i_class_id) from item
> Query finished, fetching results ...
> 16
> Returned 1 row(s) in 1.51s
> {code}
> {code}
> [impala:21000] > select count(distinct i_class_id), count(distinct 
> i_brand_id) from item;
> Query: select count(distinct i_class_id), count(distinct i_brand_id) from item
> ERROR: com.cloudera.impala.common.AnalysisException: Analysis exception (in 
> select count(distinct i_class_id), count(distinct i_brand_id) from item)
>   at 
> com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:133)
>   at 
> com.cloudera.impala.service.Frontend.createExecRequest(Frontend.java:221)
>   at 
> com.cloudera.impala.service.JniFrontend.createExecRequest(JniFrontend.java:89)
> Caused by: com.cloudera.impala.common.AnalysisException: all DISTINCT 
> aggregate functions need to have the same set of parameters as COUNT(DISTINCT 
> i_class_id); deviating function: COUNT(DISTINCT i_brand_id)
>   at 
> com.cloudera.impala.analysis.AggregateInfo.createDistinctAggInfo(AggregateInfo.java:196)
>   at 
> com.cloudera.impala.analysis.AggregateInfo.create(AggregateInfo.java:143)
>   at 
> com.cloudera.impala.analysis.SelectStmt.createAggInfo(SelectStmt.java:466)
>   at 
> com.cloudera.impala.analysis.SelectStmt.analyzeAggregation(SelectStmt.java:347)
>   at com.cloudera.impala.analysis.SelectStmt.analyze(SelectStmt.java:155)
>   at 
> com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:130)
>   ... 2 more
> {code}
> Hive supports this:
> {code}
> $ hive -e "select count(distinct i_class_id), count(distinct i_brand_id) from 
> item;"
> Logging initialized using configuration in 
> file:/etc/hive/conf.dist/hive-log4j.properties
> Hive history file=/tmp/grahn/hive_job_log_grahn_201303052234_1625576708.txt
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=
> In order to set a constant number of reducers:
>   set mapred.reduce.tasks=
> Starting Job = job_201302081514_0073, Tracking URL = 
> http://impala:50030/jobdetails.jsp?jobid=job_201302081514_0073
> Kill Command = /usr/lib/hadoop/bin/hadoop job  
> -Dmapred.job.tracker=m0525.mtv.cloudera.com:8021 -kill job_201302081514_0073
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 
> 1
> 2013-03-05 22:34:43,255 Stage-1 map = 0%,  reduce = 0%
> 2013-03-05 22:34:49,323 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:50,337 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:51,351 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:52,360 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:53,370 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:54,379 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:55,389 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 8.58 sec
> 2013-03-05 22:34:56,402 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 

[jira] [Commented] (IMPALA-110) Add support for multiple distinct operators in the same query block

2018-09-26 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628963#comment-16628963
 ] 

Ruslan Dautkhanov commented on IMPALA-110:
--

Hooray!! Congrats - great work everyone involved. 
Will this be part of Impala 2.13 release / CDH 5.16? 
Thank you!

> Add support for multiple distinct operators in the same query block
> ---
>
> Key: IMPALA-110
> URL: https://issues.apache.org/jira/browse/IMPALA-110
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend, Frontend
>Affects Versions: Impala 0.5, Impala 1.4, Impala 2.0, Impala 2.2, Impala 
> 2.3.0
>Reporter: Greg Rahn
>Assignee: Thomas Tauber-Marshall
>Priority: Major
>  Labels: sql-language
>
> Impala only allows a single (DISTINCT columns) expression in each query.
> {color:red}Note:
> If you do not need precise accuracy, you can produce an estimate of the 
> distinct values for a column by specifying NDV(column); a query can contain 
> multiple instances of NDV(column). To make Impala automatically rewrite 
> COUNT(DISTINCT) expressions to NDV(), enable the APPX_COUNT_DISTINCT query 
> option.
> {color}
> {code}
> [impala:21000] > select count(distinct i_class_id) from item;
> Query: select count(distinct i_class_id) from item
> Query finished, fetching results ...
> 16
> Returned 1 row(s) in 1.51s
> {code}
> {code}
> [impala:21000] > select count(distinct i_class_id), count(distinct 
> i_brand_id) from item;
> Query: select count(distinct i_class_id), count(distinct i_brand_id) from item
> ERROR: com.cloudera.impala.common.AnalysisException: Analysis exception (in 
> select count(distinct i_class_id), count(distinct i_brand_id) from item)
>   at 
> com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:133)
>   at 
> com.cloudera.impala.service.Frontend.createExecRequest(Frontend.java:221)
>   at 
> com.cloudera.impala.service.JniFrontend.createExecRequest(JniFrontend.java:89)
> Caused by: com.cloudera.impala.common.AnalysisException: all DISTINCT 
> aggregate functions need to have the same set of parameters as COUNT(DISTINCT 
> i_class_id); deviating function: COUNT(DISTINCT i_brand_id)
>   at 
> com.cloudera.impala.analysis.AggregateInfo.createDistinctAggInfo(AggregateInfo.java:196)
>   at 
> com.cloudera.impala.analysis.AggregateInfo.create(AggregateInfo.java:143)
>   at 
> com.cloudera.impala.analysis.SelectStmt.createAggInfo(SelectStmt.java:466)
>   at 
> com.cloudera.impala.analysis.SelectStmt.analyzeAggregation(SelectStmt.java:347)
>   at com.cloudera.impala.analysis.SelectStmt.analyze(SelectStmt.java:155)
>   at 
> com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:130)
>   ... 2 more
> {code}
> Hive supports this:
> {code}
> $ hive -e "select count(distinct i_class_id), count(distinct i_brand_id) from 
> item;"
> Logging initialized using configuration in 
> file:/etc/hive/conf.dist/hive-log4j.properties
> Hive history file=/tmp/grahn/hive_job_log_grahn_201303052234_1625576708.txt
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=
> In order to set a constant number of reducers:
>   set mapred.reduce.tasks=
> Starting Job = job_201302081514_0073, Tracking URL = 
> http://impala:50030/jobdetails.jsp?jobid=job_201302081514_0073
> Kill Command = /usr/lib/hadoop/bin/hadoop job  
> -Dmapred.job.tracker=m0525.mtv.cloudera.com:8021 -kill job_201302081514_0073
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 
> 1
> 2013-03-05 22:34:43,255 Stage-1 map = 0%,  reduce = 0%
> 2013-03-05 22:34:49,323 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:50,337 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:51,351 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:52,360 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:53,370 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:54,379 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:55,389 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 8.58 sec
> 2013-03-05 22:34:56,402 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 8.58 sec
> 2013-03-05 22:34:57,413 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 8.58 sec
> 2013-03-05 22:34:58,424 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 8.58 sec
> MapReduce Total cumulative CPU time: 8 seconds 580 msec
> Ended Job 

[jira] [Commented] (IMPALA-7204) Add support for GROUP BY ROLLUP

2018-07-02 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530549#comment-16530549
 ] 

Ruslan Dautkhanov commented on IMPALA-7204:
---

Thanks [~tarmstrong] !

Would be great if it also adds infrastructure to runs these most heavy 
operations with intra-node parallelism.

Most of our query execution times come from count(distinct) as we do mostly 
count(distinct) on highly-cardinal values.. 

cc [~twmarshall]

> Add support for GROUP BY ROLLUP
> ---
>
> Key: IMPALA-7204
> URL: https://issues.apache.org/jira/browse/IMPALA-7204
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 3.0, Impala 2.12.0
>Reporter: Ruslan Dautkhanov
>Priority: Critical
>  Labels: GROUP_BY, sql
>
> Now suppose that we'd like to analyze our sales data, to study the amount of 
> sales that is occurring for different products, in different states and 
> regions. Using the ROLLUP feature of SQL 2003, we could issue the query:
> {code:sql}
> select region, state, product, sum(sales) total_sales
> from sales_history 
> group by rollup (region, state, product)
> {code}
> Semantically, the above query is equivalent to
>  
> {code:sql}
> select region, state, product, sum(sales) total_sales
> from sales_history 
> group by region, state, product
> union
> select region, state, null, sum(sales) total_sales
> from sales_history 
> group by region, state
> union
> select region, null, null, sum(sales) total_sales
> from sales_history 
> group by region
> union
> select null, null, null, sum(sales) total_sales
> from sales_history
>  
> {code}
> The query might produce results that looked something like:
> {noformat}
> REGION STATE PRODUCT TOTAL_SALES
> -- - --- ---
> null null null 6200
> EAST MA BOATS 100
> EAST MA CARS 1500
> EAST MA null 1600
> EAST NY BOATS 150
> EAST NY CARS 1000
> EAST NY null 1150
> EAST null null 2750
> WEST CA BOATS 750
> WEST CA CARS 500
> WEST CA null 1250
> WEST AZ BOATS 2000
> WEST AZ CARS 200
> WEST AZ null 2200
> WEST null null 3450
> {noformat}
> We have a lot of production queries that work around this missing Impala 
> functionality by having three UNION ALLs. Physical execution plan shows 
> Impala actually reads full fact table three times. So it could be a three 
> times improvement (or more, depending on number of columns that are being 
> rolled up).
> I can't find another SQL on Hadoop engine that doesn't support this feature. 
>  *Checked Spark, Hive, PIG, Flink and some other engines - they all do 
> support this basic SQL feature*.
> Would be great to have a matching feature in Impala too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-7204) Add support for GROUP BY ROLLUP

2018-06-23 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IMPALA-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated IMPALA-7204:
--
Description: 
Now suppose that we'd like to analyze our sales data, to study the amount of 
sales that is occurring for different products, in different states and 
regions. Using the ROLLUP feature of SQL 2003, we could issue the query:
{code:sql}
select region, state, product, sum(sales) total_sales
from sales_history 
group by rollup (region, state, product)
{code}
Semantically, the above query is equivalent to

 
{code:sql}
select region, state, product, sum(sales) total_sales
from sales_history 
group by region, state, product
union
select region, state, null, sum(sales) total_sales
from sales_history 
group by region, state
union
select region, null, null, sum(sales) total_sales
from sales_history 
group by region
union
select null, null, null, sum(sales) total_sales
from sales_history
 
{code}
The query might produce results that looked something like:
{noformat}
REGION STATE PRODUCT TOTAL_SALES
-- - --- ---
null null null 6200
EAST MA BOATS 100
EAST MA CARS 1500
EAST MA null 1600
EAST NY BOATS 150
EAST NY CARS 1000
EAST NY null 1150
EAST null null 2750
WEST CA BOATS 750
WEST CA CARS 500
WEST CA null 1250
WEST AZ BOATS 2000
WEST AZ CARS 200
WEST AZ null 2200
WEST null null 3450

{noformat}
We have a lot of production queries that work around this missing Impala 
functionality by having three UNION ALLs. Physical execution plan shows Impala 
actually reads full fact table three times. So it could be a three times 
improvement (or more, depending on number of columns that are being rolled up).

I can't find another SQL on Hadoop engine that doesn't support this feature. 
 *Checked Spark, Hive, PIG, Flink and some other engines - they all do support 
this basic SQL feature*.

Would be great to have a matching feature in Impala too.

  was:
Now suppose that we'd like to analyze our sales data, to study the amount of 
sales that is occurring for different products, in different states and 
regions. Using the ROLLUP feature of SQL 2003, we could issue the query:

{code:sql}
select region, state, product, sum(sales) total_sales
from sales_history 
group by rollup (region, state, product)
{code}
Semantically, the above query is equivalent to

 
{code:sql}
select region, state, product, sum(sales) total_sales
from sales_history 
group by region, state, product
union
select region, state, null, sum(sales) total_sales
from sales_history 
group by region, state
union
select region, null, null, sum(sales) total_sales
from sales_history 
group by region
union
select null, null, null, sum(sales) total_sales
from sales_history
 
{code}

The query might produce results that looked something like:

{noformat}

REGION STATE PRODUCT TOTAL_SALES
-- - --- ---
null null null 6200
EAST MA BOATS 100
EAST MA CARS 1500
EAST MA null 1600
EAST NY BOATS 150
EAST NY CARS 1000
EAST NY null 1150
EAST null null 2750
WEST CA BOATS 750
WEST CA CARS 500
WEST CA null 1250
WEST AZ BOATS 2000
WEST AZ CARS 200
WEST AZ null 2200
WEST null null 3450

{noformat}

We have a lot of production queries that work around this missing Impala 
functionality by having three UNION ALLs. Physical execution plan shows Impala 
actually reads full fact table three times. So it could be a three times 
improvement (or more, depending on number of columns that are being rolled up).

I can't find another SQL on Hadoop engine that doesn't support this feature. 
Checked Spark, Hive, PIG, Flink and some other engines - they all do support 
this basic SQL features. Would be great to have a matching feature in Impala 
too. 



> Add support for GROUP BY ROLLUP
> ---
>
> Key: IMPALA-7204
> URL: https://issues.apache.org/jira/browse/IMPALA-7204
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 3.0, Impala 2.12.0
>Reporter: Ruslan Dautkhanov
>Priority: Critical
>  Labels: GROUP_BY, sql
>
> Now suppose that we'd like to analyze our sales data, to study the amount of 
> sales that is occurring for different products, in different states and 
> regions. Using the ROLLUP feature of SQL 2003, we could issue the query:
> {code:sql}
> select region, state, product, sum(sales) total_sales
> from sales_history 
> group by rollup (region, state, product)
> {code}
> Semantically, the above query is equivalent to
>  
> {code:sql}
> select region, state, product, sum(sales) total_sales
> from sales_history 
> group by region, state, product
> union
> select region, state, null, sum(sales) total_sales
> from sales_history 
> group by region, state
> union
> select region, null, null, sum(sales) total_sales
> from sales_history 
> group by region
> union
> select 

[jira] [Created] (IMPALA-7204) Add support for GROUP BY ROLLUP

2018-06-23 Thread Ruslan Dautkhanov (JIRA)
Ruslan Dautkhanov created IMPALA-7204:
-

 Summary: Add support for GROUP BY ROLLUP
 Key: IMPALA-7204
 URL: https://issues.apache.org/jira/browse/IMPALA-7204
 Project: IMPALA
  Issue Type: Improvement
  Components: Backend
Affects Versions: Impala 2.12.0, Impala 3.0
Reporter: Ruslan Dautkhanov


Now suppose that we'd like to analyze our sales data, to study the amount of 
sales that is occurring for different products, in different states and 
regions. Using the ROLLUP feature of SQL 2003, we could issue the query:

{code:sql}
select region, state, product, sum(sales) total_sales
from sales_history 
group by rollup (region, state, product)
{code}
Semantically, the above query is equivalent to

 
{code:sql}
select region, state, product, sum(sales) total_sales
from sales_history 
group by region, state, product
union
select region, state, null, sum(sales) total_sales
from sales_history 
group by region, state
union
select region, null, null, sum(sales) total_sales
from sales_history 
group by region
union
select null, null, null, sum(sales) total_sales
from sales_history
 
{code}

The query might produce results that looked something like:

{noformat}

REGION STATE PRODUCT TOTAL_SALES
-- - --- ---
null null null 6200
EAST MA BOATS 100
EAST MA CARS 1500
EAST MA null 1600
EAST NY BOATS 150
EAST NY CARS 1000
EAST NY null 1150
EAST null null 2750
WEST CA BOATS 750
WEST CA CARS 500
WEST CA null 1250
WEST AZ BOATS 2000
WEST AZ CARS 200
WEST AZ null 2200
WEST null null 3450

{noformat}

We have a lot of production queries that work around this missing Impala 
functionality by having three UNION ALLs. Physical execution plan shows Impala 
actually reads full fact table three times. So it could be a three times 
improvement (or more, depending on number of columns that are being rolled up).

I can't find another SQL on Hadoop engine that doesn't support this feature. 
Checked Spark, Hive, PIG, Flink and some other engines - they all do support 
this basic SQL features. Would be great to have a matching feature in Impala 
too. 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-110) Add support for multiple distinct operators in the same query block

2018-06-01 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498545#comment-16498545
 ] 

Ruslan Dautkhanov commented on IMPALA-110:
--

{quote}[~alex.behm] added a comment - 27/Oct/17 21:53
 The feature requires fundamental architectural changes from a tree-execution 
to a dag-execution model.

[~alex.behm] added a comment - 30/Oct/17 10:04
 Ruslan Dautkhanov, this is it. As far as I'm aware of, there currently is no 
other feature that requires the switch.
{quote}
I thought of another use case for DAG execution - is a MATERIALIZED subquery 
IMPALA-7114 .

Would also be great to implement MATERIALIZE hint as in Oracle too.

[https://oracle-base.com/articles/misc/with-clause#materialize-hint]

Basically in Oracle we could run WITH subqueries that will be persisted for the 
duration of the query, so if that subquery is used multiple times in outer 
query, a heavy join or any other operation has to run only once. If it's 
executed as a DAG, then all subqueries dependent on this materialized subquery 
can be run in parallel.

Hopefully that would tip the Impala to switch to DAG execution model.

> Add support for multiple distinct operators in the same query block
> ---
>
> Key: IMPALA-110
> URL: https://issues.apache.org/jira/browse/IMPALA-110
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend, Frontend
>Affects Versions: Impala 0.5, Impala 1.4, Impala 2.0, Impala 2.2, Impala 
> 2.3.0
>Reporter: Greg Rahn
>Assignee: Thomas Tauber-Marshall
>Priority: Major
>  Labels: sql-language
>
> Impala only allows a single (DISTINCT columns) expression in each query.
> {color:red}Note:
> If you do not need precise accuracy, you can produce an estimate of the 
> distinct values for a column by specifying NDV(column); a query can contain 
> multiple instances of NDV(column). To make Impala automatically rewrite 
> COUNT(DISTINCT) expressions to NDV(), enable the APPX_COUNT_DISTINCT query 
> option.
> {color}
> {code}
> [impala:21000] > select count(distinct i_class_id) from item;
> Query: select count(distinct i_class_id) from item
> Query finished, fetching results ...
> 16
> Returned 1 row(s) in 1.51s
> {code}
> {code}
> [impala:21000] > select count(distinct i_class_id), count(distinct 
> i_brand_id) from item;
> Query: select count(distinct i_class_id), count(distinct i_brand_id) from item
> ERROR: com.cloudera.impala.common.AnalysisException: Analysis exception (in 
> select count(distinct i_class_id), count(distinct i_brand_id) from item)
>   at 
> com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:133)
>   at 
> com.cloudera.impala.service.Frontend.createExecRequest(Frontend.java:221)
>   at 
> com.cloudera.impala.service.JniFrontend.createExecRequest(JniFrontend.java:89)
> Caused by: com.cloudera.impala.common.AnalysisException: all DISTINCT 
> aggregate functions need to have the same set of parameters as COUNT(DISTINCT 
> i_class_id); deviating function: COUNT(DISTINCT i_brand_id)
>   at 
> com.cloudera.impala.analysis.AggregateInfo.createDistinctAggInfo(AggregateInfo.java:196)
>   at 
> com.cloudera.impala.analysis.AggregateInfo.create(AggregateInfo.java:143)
>   at 
> com.cloudera.impala.analysis.SelectStmt.createAggInfo(SelectStmt.java:466)
>   at 
> com.cloudera.impala.analysis.SelectStmt.analyzeAggregation(SelectStmt.java:347)
>   at com.cloudera.impala.analysis.SelectStmt.analyze(SelectStmt.java:155)
>   at 
> com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:130)
>   ... 2 more
> {code}
> Hive supports this:
> {code}
> $ hive -e "select count(distinct i_class_id), count(distinct i_brand_id) from 
> item;"
> Logging initialized using configuration in 
> file:/etc/hive/conf.dist/hive-log4j.properties
> Hive history file=/tmp/grahn/hive_job_log_grahn_201303052234_1625576708.txt
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=
> In order to set a constant number of reducers:
>   set mapred.reduce.tasks=
> Starting Job = job_201302081514_0073, Tracking URL = 
> http://impala:50030/jobdetails.jsp?jobid=job_201302081514_0073
> Kill Command = /usr/lib/hadoop/bin/hadoop job  
> -Dmapred.job.tracker=m0525.mtv.cloudera.com:8021 -kill job_201302081514_0073
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 
> 1
> 2013-03-05 22:34:43,255 Stage-1 map = 0%,  reduce = 0%
> 2013-03-05 22:34:49,323 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 

[jira] [Created] (IMPALA-7114) MATERIALIZE a subquery hint

2018-06-01 Thread Ruslan Dautkhanov (JIRA)
Ruslan Dautkhanov created IMPALA-7114:
-

 Summary: MATERIALIZE a subquery hint
 Key: IMPALA-7114
 URL: https://issues.apache.org/jira/browse/IMPALA-7114
 Project: IMPALA
  Issue Type: New Feature
Affects Versions: Impala 2.12.0, Impala 3.0
Reporter: Ruslan Dautkhanov


Would also be great to implement MATERIALIZE hint as in Oracle too.

[https://oracle-base.com/articles/misc/with-clause#materialize-hint]

Basically in Oracle we could run WITH subqueries that will be persisted for the 
duration of the query, so if that subquery is used multiple times in outer 
query, a heavy join or any other operation has to run only once. 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-7112) Change from tree-execution to a dag-execution model

2018-06-01 Thread Ruslan Dautkhanov (JIRA)
Ruslan Dautkhanov created IMPALA-7112:
-

 Summary: Change from tree-execution to a dag-execution model
 Key: IMPALA-7112
 URL: https://issues.apache.org/jira/browse/IMPALA-7112
 Project: IMPALA
  Issue Type: Improvement
Affects Versions: Impala 2.12.0, Impala 3.0
Reporter: Ruslan Dautkhanov


Quoting [~alex.behm] from IMPALA-110 comment :

{quote}
 * The feature requires fundamental architectural changes from a tree-execution 
to a dag-execution model.
{quote}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-3446) Materialized views

2018-06-01 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498527#comment-16498527
 ] 

Ruslan Dautkhanov commented on IMPALA-3446:
---

Would also be great to implement MATERIALIZE hint as in Oracle too.

[https://oracle-base.com/articles/misc/with-clause#materialize-hint]

Basically in Oracle we could run WITH subqueries that will be persisted for the 
duration of the query, so if that subquery is used multiple times in outer 
query, a heavy join or any other operation has to run only once. 

> Materialized views
> --
>
> Key: IMPALA-3446
> URL: https://issues.apache.org/jira/browse/IMPALA-3446
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Frontend
>Affects Versions: Impala 2.5.0
>Reporter: Marcell Szabo
>Priority: Minor
>
> This JIRA is a placeholder for this big topic.
> Some user stories I can imagine under the epic:
> # materialized view is a CTAS where the SELECT is saved in the HMS and can be 
> rerun by a simple command
> # materialized view detects that the source data has changed and falls back 
> to be a view instead of SELECT *
> # materialized view detects that the source data has changed and reruns the 
> CTAS automatically
> # ... reruns only the necessary changes (e.g. if only some of the partitions 
> change)
> # Impala and Hive to have common semantics of materialized views 
> (https://issues.apache.org/jira/browse/HIVE-10459)
> # materialized view stores extra statistics that help the optimizer, storing 
> the full resultset is optional
> # query optimizer checks for every query whether part of the query plan can 
> be covered by an existing materialized view



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org