[jira] [Commented] (IMPALA-5073) Considering bypassing TCMalloc by default for buffer pool
[ https://issues.apache.org/jira/browse/IMPALA-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831722#comment-16831722 ] Ruslan Dautkhanov commented on IMPALA-5073: --- sorry for the noise on some of these jiras.. it's just some feedback after working with Red Hat support on some of performance issue we're seeing. I can share more details from Red Hat if you guys are interested .thanks for any inputs. > Considering bypassing TCMalloc by default for buffer pool > - > > Key: IMPALA-5073 > URL: https://issues.apache.org/jira/browse/IMPALA-5073 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 2.9.0 >Reporter: Tim Armstrong >Priority: Minor > Labels: resource-management > > There would be some advantages to switch from from allocating buffers via > TCMalloc and instead using mmap directly - e.g. less contention for the page > heap lock. > There are also downsides - virtual memory consumption could increase and we > may end up mapping and unmapping memory more frequently. > We would also need to wire up the MemTrackers so they include this memory in > the process estimate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-5073) Considering bypassing TCMalloc by default for buffer pool
[ https://issues.apache.org/jira/browse/IMPALA-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831718#comment-16831718 ] Ruslan Dautkhanov commented on IMPALA-5073: --- https://github.com/cloudera/Impala/blob/cdh5-2.11.0_5.14.4/be/src/runtime/bufferpool/system-allocator.cc#L81 mmap() call should have MAP_HUGETLB in 3rd argument isn't it? for huge page to be used. That call currently doesn't have it, so Impala's mmap()ed memory never uses huge pages? Like it is described in https://d3s.mff.cuni.cz/legacy/teaching/advanced_operating_systems/slides/10_huge_pages.pdf for example Thanks! > Considering bypassing TCMalloc by default for buffer pool > - > > Key: IMPALA-5073 > URL: https://issues.apache.org/jira/browse/IMPALA-5073 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 2.9.0 >Reporter: Tim Armstrong >Priority: Minor > Labels: resource-management > > There would be some advantages to switch from from allocating buffers via > TCMalloc and instead using mmap directly - e.g. less contention for the page > heap lock. > There are also downsides - virtual memory consumption could increase and we > may end up mapping and unmapping memory more frequently. > We would also need to wire up the MemTrackers so they include this memory in > the process estimate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-5073) Considering bypassing TCMalloc by default for buffer pool
[ https://issues.apache.org/jira/browse/IMPALA-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831716#comment-16831716 ] Ruslan Dautkhanov commented on IMPALA-5073: --- Also somewhat related - it's interesting that how it's done in JVM with pretouches to move allocation cost to startup phase [https://shipilev.net/jvm/anatomy-quarks/2-transparent-huge-pages/] {quote} To shift these costs to the JVM startup that will avoid surprising latency hiccups when application is running, you may instruct JVM to touch every single page in Java heap with -XX:+AlwaysPreTouch during initialization. It is a good idea to enable pre-touch for larger heaps anyway. And there comes the funny part: enabling -XX:+UseTransparentHugePages actually makes -XX:+AlwaysPreTouch faster, because JVM now knows it has to touch the heap in larger quanta (say, a byte every 2M), rather than in smaller ones (say, a byte every 4K). {quote} > Considering bypassing TCMalloc by default for buffer pool > - > > Key: IMPALA-5073 > URL: https://issues.apache.org/jira/browse/IMPALA-5073 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 2.9.0 >Reporter: Tim Armstrong >Priority: Minor > Labels: resource-management > > There would be some advantages to switch from from allocating buffers via > TCMalloc and instead using mmap directly - e.g. less contention for the page > heap lock. > There are also downsides - virtual memory consumption could increase and we > may end up mapping and unmapping memory more frequently. > We would also need to wire up the MemTrackers so they include this memory in > the process estimate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-5073) Considering bypassing TCMalloc by default for buffer pool
[ https://issues.apache.org/jira/browse/IMPALA-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831702#comment-16831702 ] Ruslan Dautkhanov commented on IMPALA-5073: --- [~tarmstrong] My 2 cents. Would it be possible to preallocate a fixed amount of memory for impalad at startup? Like it's done in Oracle through `SGA_AGGREGATE_TARGET` (global shared buffers, cached query plans etc etc) and `PGA_AGGREGATE_TARGET` (for per-session memory demand like for sorting temp results etc). This increases startup time, which is okay, but there is no overhead to mmap() on a per-session / per-query level. For Oracle SGA always dominates (global db block buffer caches etc), but for Impala PGA-like memory structure would dominate I guess (per-query shuffling results etc) . Another advantage of preallocating memory is that you could use hugetlbfs optionally as amount of memory for that preallocated memory is static and known ahead of time. > Considering bypassing TCMalloc by default for buffer pool > - > > Key: IMPALA-5073 > URL: https://issues.apache.org/jira/browse/IMPALA-5073 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 2.9.0 >Reporter: Tim Armstrong >Priority: Minor > Labels: resource-management > > There would be some advantages to switch from from allocating buffers via > TCMalloc and instead using mmap directly - e.g. less contention for the page > heap lock. > There are also downsides - virtual memory consumption could increase and we > may end up mapping and unmapping memory more frequently. > We would also need to wire up the MemTrackers so they include this memory in > the process estimate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-5229) Try using TCMalloc + Huge Pages for buffers
[ https://issues.apache.org/jira/browse/IMPALA-5229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831112#comment-16831112 ] Ruslan Dautkhanov commented on IMPALA-5229: --- Thank you for prompt response [~tarmstrong] I agree it's hard to manage for Impala that's colocated with YARN and other workloads for example. It makes sense for some dedicated scenarios only .. we do use hugetlbfs on dedicated Oracle servers for examples where we can tell exactly how much memory Oracle will use for its caches etc. > Try using TCMalloc + Huge Pages for buffers > --- > > Key: IMPALA-5229 > URL: https://issues.apache.org/jira/browse/IMPALA-5229 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 2.9.0 >Reporter: Tim Armstrong >Assignee: Tim Armstrong >Priority: Major > Labels: perf, resource-management > Fix For: Impala 2.9.0 > > > As well as mmap() + huge pages, we could support TCMalloc + huge pages. I > believe TCMalloc will support this well because: > * We can allocate huge-page-aligned memory via posix_memalign() > * TCMalloc, by default, always decommits large allocations upon freeing them. > So if we undo the HUGEPAGE madvise() before handing the pages back to > TCMalloc, then TCMalloc will decommit the huge pages backing the allocation, > and we won't get TCMalloc's page heap into a weird state. > This could give us the perf benefits of huge pages without some of the > headaches associated with using mmap() directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-5229) Try using TCMalloc + Huge Pages for buffers
[ https://issues.apache.org/jira/browse/IMPALA-5229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831068#comment-16831068 ] Ruslan Dautkhanov commented on IMPALA-5229: --- We have some performance issues and working RH Support.. {noformat} I'm surprised that a system with so much RAM and Java isn't already using HugeTLB. However the busiest cores might not be the problem and I could be looking in the wrong place, so we'll await your reply on that to target our investigation better. Also, you might want to ask Cloudera if you should be using the classic HugeTLB sort of pre-allocated HugePages? That will massively reduce your pagetable sizes and lead to less memory fragmentation too. {noformat} How do we enable Huge pages for Impala? Thank you! > Try using TCMalloc + Huge Pages for buffers > --- > > Key: IMPALA-5229 > URL: https://issues.apache.org/jira/browse/IMPALA-5229 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 2.9.0 >Reporter: Tim Armstrong >Assignee: Tim Armstrong >Priority: Major > Labels: perf, resource-management > Fix For: Impala 2.9.0 > > > As well as mmap() + huge pages, we could support TCMalloc + huge pages. I > believe TCMalloc will support this well because: > * We can allocate huge-page-aligned memory via posix_memalign() > * TCMalloc, by default, always decommits large allocations upon freeing them. > So if we undo the HUGEPAGE madvise() before handing the pages back to > TCMalloc, then TCMalloc will decommit the huge pages backing the allocation, > and we won't get TCMalloc's page heap into a weird state. > This could give us the perf benefits of huge pages without some of the > headaches associated with using mmap() directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-6088) Rack aware broadcast operator
[ https://issues.apache.org/jira/browse/IMPALA-6088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827317#comment-16827317 ] Ruslan Dautkhanov commented on IMPALA-6088: --- Would be also interesting to look if something like Torrent protocol helps here too. This Torrent protocol can take into account rack location, if available. Spark uses this to avoid -query coordinator- Spark Driver to be a network bottleneck https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala {quote} * The driver divides the serialized object into small chunks and * stores those chunks in the BlockManager of the driver. * On each executor, the executor first attempts to fetch the object from its BlockManager. If * it does not exist, it then uses remote fetches to fetch the small chunks from the driver and/or * other executors if available. Once it gets the chunks, it puts the chunks in its own * BlockManager, ready for other executors to fetch from. * This prevents the driver from being the bottleneck in sending out multiple copies of the * broadcast data (one per executor). {quote} > Rack aware broadcast operator > - > > Key: IMPALA-6088 > URL: https://issues.apache.org/jira/browse/IMPALA-6088 > Project: IMPALA > Issue Type: Sub-task > Components: Distributed Exec >Reporter: Mostafa Mokhtar >Priority: Major > > When conducting large scale experiments on a 6 rack cluster with aggregator > core network topology overall cluster bandwidth utilization was limited. > With aggregator core networks nodes and racks are not equidistant, which > means a broadcast operation can be inefficient as the broadcasting node needs > to send the same data N times to each node on a remote rack. > Ideally Rowbatches should be sent once per remote rack then a node on each > remote rack would broadcast within its rack. > Table below represent rack to rack latency for the 90% of operations, ration > between best and worst case is 7.3x > | || va||vc||vd1|| vd3|| ve| > |va| 4,238| 4,290| 9,692| 8,897| 8,208| > |vc| 9,290| 4,396| 30,952| 13,529| 14,578| > |vd1| 9,131| 29,066| 4,346| 17,265| 16,849| > |vd3| 7,409| 15,517| 17,265| 4,370| 4,687| > |ve| 4,914| 16,894| 16,430| 4,713| 4,472| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7204) Add support for GROUP BY ROLLUP
[ https://issues.apache.org/jira/browse/IMPALA-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810112#comment-16810112 ] Ruslan Dautkhanov commented on IMPALA-7204: --- cc [~grahn] - can you please let us know if you guys are planning to add support for this feature? Our Account team said you might be the right person to ask ) Thanks for any ideas! > Add support for GROUP BY ROLLUP > --- > > Key: IMPALA-7204 > URL: https://issues.apache.org/jira/browse/IMPALA-7204 > Project: IMPALA > Issue Type: New Feature > Components: Backend >Affects Versions: Impala 3.0, Impala 2.12.0 >Reporter: Ruslan Dautkhanov >Priority: Major > Labels: GROUP_BY, sql > > Now suppose that we'd like to analyze our sales data, to study the amount of > sales that is occurring for different products, in different states and > regions. Using the ROLLUP feature of SQL 2003, we could issue the query: > {code:sql} > select region, state, product, sum(sales) total_sales > from sales_history > group by rollup (region, state, product) > {code} > Semantically, the above query is equivalent to > > {code:sql} > select region, state, product, sum(sales) total_sales > from sales_history > group by region, state, product > union > select region, state, null, sum(sales) total_sales > from sales_history > group by region, state > union > select region, null, null, sum(sales) total_sales > from sales_history > group by region > union > select null, null, null, sum(sales) total_sales > from sales_history > > {code} > The query might produce results that looked something like: > {noformat} > REGION STATE PRODUCT TOTAL_SALES > -- - --- --- > null null null 6200 > EAST MA BOATS 100 > EAST MA CARS 1500 > EAST MA null 1600 > EAST NY BOATS 150 > EAST NY CARS 1000 > EAST NY null 1150 > EAST null null 2750 > WEST CA BOATS 750 > WEST CA CARS 500 > WEST CA null 1250 > WEST AZ BOATS 2000 > WEST AZ CARS 200 > WEST AZ null 2200 > WEST null null 3450 > {noformat} > We have a lot of production queries that work around this missing Impala > functionality by having three UNION ALLs. Physical execution plan shows > Impala actually reads full fact table three times. So it could be a three > times improvement (or more, depending on number of columns that are being > rolled up). > I can't find another SQL on Hadoop engine that doesn't support this feature. > *Checked Spark, Hive, PIG, Flink and some other engines - they all do > support this basic SQL feature*. > Would be great to have a matching feature in Impala too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-6042) Allow Impala shell to also use a global impalarc configuration
[ https://issues.apache.org/jira/browse/IMPALA-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667434#comment-16667434 ] Ruslan Dautkhanov commented on IMPALA-6042: --- Another use case is to update PATH variable, to point to a Python2 home for example. As impala-shell is not Python3-compatible (IMPALA-3343) and we're migrating all our applications to Python 3. It would be great to have a way to globally update certain config variables and environment variables like PATH to have a workaround for this and some other issues. Thank you. > Allow Impala shell to also use a global impalarc configuration > -- > > Key: IMPALA-6042 > URL: https://issues.apache.org/jira/browse/IMPALA-6042 > Project: IMPALA > Issue Type: Improvement > Components: Clients >Reporter: Balazs Jeszenszky >Priority: Minor > Labels: newbie, shell, usability > > Currently, impalarc files can be specified on a per-user basis (stored in > ~/.impalarc), and they aren't created by default. > The Impala shell should pick up /etc/impalarc as well, in addition to the > user-specific configurations. > The intent here is to allow a "global" configuration of the shell by a system > administrator with common options like: > {code} > --ssl > -l > -k > -u > -i > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-3446) Materialized views
[ https://issues.apache.org/jira/browse/IMPALA-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16665602#comment-16665602 ] Ruslan Dautkhanov commented on IMPALA-3446: --- Any updates on this case? Both HIVE-14484 and HIVE-14249 are resolved in Hive so technically materialized views are supported in upstream Hive now. Would be great to have same in Impala, as Impala's use cases are exactly around faster interactive response time, than for Hive. Thank you. > Materialized views > -- > > Key: IMPALA-3446 > URL: https://issues.apache.org/jira/browse/IMPALA-3446 > Project: IMPALA > Issue Type: New Feature > Components: Frontend >Affects Versions: Impala 2.5.0 >Reporter: Marcell Szabo >Priority: Minor > > This JIRA is a placeholder for this big topic. > Some user stories I can imagine under the epic: > # materialized view is a CTAS where the SELECT is saved in the HMS and can be > rerun by a simple command > # materialized view detects that the source data has changed and falls back > to be a view instead of SELECT * > # materialized view detects that the source data has changed and reruns the > CTAS automatically > # ... reruns only the necessary changes (e.g. if only some of the partitions > change) > # Impala and Hive to have common semantics of materialized views > (https://issues.apache.org/jira/browse/HIVE-10459) > # materialized view stores extra statistics that help the optimizer, storing > the full resultset is optional > # query optimizer checks for every query whether part of the query plan can > be covered by an existing materialized view -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Comment Edited] (IMPALA-110) Add support for multiple distinct operators in the same query block
[ https://issues.apache.org/jira/browse/IMPALA-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662704#comment-16662704 ] Ruslan Dautkhanov edited comment on IMPALA-110 at 10/24/18 7:17 PM: [~grahn] Yes, we would like to keep this session level parameter too, so if it's on, then count(distinct) essentially runs as ndv. Oracle Database 12.2 just introduced similar session setting level too : [https://oracle-base.com/articles/12c/approximate-query-processing-12cr2] "You could refactor your code, or you could ask Oracle to convert your exact calls to approximate calls instead" Please don't remove this great feature! Thank you. was (Author: tagar): [~grahn]] Yes, we would like to keep this session level parameter too, so if it's on, then count(distinct) essentially runs as ndv. Oracle Database 12.2 just introduced similar session setting level too : [https://oracle-base.com/articles/12c/approximate-query-processing-12cr2] "You could refactor your code, or you could ask Oracle to convert your exact calls to approximate calls instead" Please don't remove this great feature! Thank you. > Add support for multiple distinct operators in the same query block > --- > > Key: IMPALA-110 > URL: https://issues.apache.org/jira/browse/IMPALA-110 > Project: IMPALA > Issue Type: New Feature > Components: Backend, Frontend >Affects Versions: Impala 0.5, Impala 1.4, Impala 2.0, Impala 2.2, Impala > 2.3.0 >Reporter: Greg Rahn >Assignee: Thomas Tauber-Marshall >Priority: Major > Labels: sql-language > Fix For: Impala 3.1.0 > > > Impala only allows a single (DISTINCT columns) expression in each query. > {color:red}Note: > If you do not need precise accuracy, you can produce an estimate of the > distinct values for a column by specifying NDV(column); a query can contain > multiple instances of NDV(column). To make Impala automatically rewrite > COUNT(DISTINCT) expressions to NDV(), enable the APPX_COUNT_DISTINCT query > option. > {color} > {code} > [impala:21000] > select count(distinct i_class_id) from item; > Query: select count(distinct i_class_id) from item > Query finished, fetching results ... > 16 > Returned 1 row(s) in 1.51s > {code} > {code} > [impala:21000] > select count(distinct i_class_id), count(distinct > i_brand_id) from item; > Query: select count(distinct i_class_id), count(distinct i_brand_id) from item > ERROR: com.cloudera.impala.common.AnalysisException: Analysis exception (in > select count(distinct i_class_id), count(distinct i_brand_id) from item) > at > com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:133) > at > com.cloudera.impala.service.Frontend.createExecRequest(Frontend.java:221) > at > com.cloudera.impala.service.JniFrontend.createExecRequest(JniFrontend.java:89) > Caused by: com.cloudera.impala.common.AnalysisException: all DISTINCT > aggregate functions need to have the same set of parameters as COUNT(DISTINCT > i_class_id); deviating function: COUNT(DISTINCT i_brand_id) > at > com.cloudera.impala.analysis.AggregateInfo.createDistinctAggInfo(AggregateInfo.java:196) > at > com.cloudera.impala.analysis.AggregateInfo.create(AggregateInfo.java:143) > at > com.cloudera.impala.analysis.SelectStmt.createAggInfo(SelectStmt.java:466) > at > com.cloudera.impala.analysis.SelectStmt.analyzeAggregation(SelectStmt.java:347) > at com.cloudera.impala.analysis.SelectStmt.analyze(SelectStmt.java:155) > at > com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:130) > ... 2 more > {code} > Hive supports this: > {code} > $ hive -e "select count(distinct i_class_id), count(distinct i_brand_id) from > item;" > Logging initialized using configuration in > file:/etc/hive/conf.dist/hive-log4j.properties > Hive history file=/tmp/grahn/hive_job_log_grahn_201303052234_1625576708.txt > Total MapReduce jobs = 1 > Launching Job 1 out of 1 > Number of reduce tasks determined at compile time: 1 > In order to change the average load for a reducer (in bytes): > set hive.exec.reducers.bytes.per.reducer= > In order to limit the maximum number of reducers: > set hive.exec.reducers.max= > In order to set a constant number of reducers: > set mapred.reduce.tasks= > Starting Job = job_201302081514_0073, Tracking URL = > http://impala:50030/jobdetails.jsp?jobid=job_201302081514_0073 > Kill Command = /usr/lib/hadoop/bin/hadoop job > -Dmapred.job.tracker=m0525.mtv.cloudera.com:8021 -kill job_201302081514_0073 > Hadoop job information for Stage-1: number of mappers: 1; number of reducers: > 1 > 2013-03-05 22:34:43,255 Stage-1 map = 0%, reduce = 0% > 2013-03-05 22:34:49,323 Stage-1 map =
[jira] [Commented] (IMPALA-110) Add support for multiple distinct operators in the same query block
[ https://issues.apache.org/jira/browse/IMPALA-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662704#comment-16662704 ] Ruslan Dautkhanov commented on IMPALA-110: -- [~grahn]] Yes, we would like to keep this session level parameter too, so if it's on, then count(distinct) essentially runs as ndv. Oracle Database 12.2 just introduced similar session setting level too : [https://oracle-base.com/articles/12c/approximate-query-processing-12cr2] "You could refactor your code, or you could ask Oracle to convert your exact calls to approximate calls instead" Please don't remove this great feature! Thank you. > Add support for multiple distinct operators in the same query block > --- > > Key: IMPALA-110 > URL: https://issues.apache.org/jira/browse/IMPALA-110 > Project: IMPALA > Issue Type: New Feature > Components: Backend, Frontend >Affects Versions: Impala 0.5, Impala 1.4, Impala 2.0, Impala 2.2, Impala > 2.3.0 >Reporter: Greg Rahn >Assignee: Thomas Tauber-Marshall >Priority: Major > Labels: sql-language > Fix For: Impala 3.1.0 > > > Impala only allows a single (DISTINCT columns) expression in each query. > {color:red}Note: > If you do not need precise accuracy, you can produce an estimate of the > distinct values for a column by specifying NDV(column); a query can contain > multiple instances of NDV(column). To make Impala automatically rewrite > COUNT(DISTINCT) expressions to NDV(), enable the APPX_COUNT_DISTINCT query > option. > {color} > {code} > [impala:21000] > select count(distinct i_class_id) from item; > Query: select count(distinct i_class_id) from item > Query finished, fetching results ... > 16 > Returned 1 row(s) in 1.51s > {code} > {code} > [impala:21000] > select count(distinct i_class_id), count(distinct > i_brand_id) from item; > Query: select count(distinct i_class_id), count(distinct i_brand_id) from item > ERROR: com.cloudera.impala.common.AnalysisException: Analysis exception (in > select count(distinct i_class_id), count(distinct i_brand_id) from item) > at > com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:133) > at > com.cloudera.impala.service.Frontend.createExecRequest(Frontend.java:221) > at > com.cloudera.impala.service.JniFrontend.createExecRequest(JniFrontend.java:89) > Caused by: com.cloudera.impala.common.AnalysisException: all DISTINCT > aggregate functions need to have the same set of parameters as COUNT(DISTINCT > i_class_id); deviating function: COUNT(DISTINCT i_brand_id) > at > com.cloudera.impala.analysis.AggregateInfo.createDistinctAggInfo(AggregateInfo.java:196) > at > com.cloudera.impala.analysis.AggregateInfo.create(AggregateInfo.java:143) > at > com.cloudera.impala.analysis.SelectStmt.createAggInfo(SelectStmt.java:466) > at > com.cloudera.impala.analysis.SelectStmt.analyzeAggregation(SelectStmt.java:347) > at com.cloudera.impala.analysis.SelectStmt.analyze(SelectStmt.java:155) > at > com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:130) > ... 2 more > {code} > Hive supports this: > {code} > $ hive -e "select count(distinct i_class_id), count(distinct i_brand_id) from > item;" > Logging initialized using configuration in > file:/etc/hive/conf.dist/hive-log4j.properties > Hive history file=/tmp/grahn/hive_job_log_grahn_201303052234_1625576708.txt > Total MapReduce jobs = 1 > Launching Job 1 out of 1 > Number of reduce tasks determined at compile time: 1 > In order to change the average load for a reducer (in bytes): > set hive.exec.reducers.bytes.per.reducer= > In order to limit the maximum number of reducers: > set hive.exec.reducers.max= > In order to set a constant number of reducers: > set mapred.reduce.tasks= > Starting Job = job_201302081514_0073, Tracking URL = > http://impala:50030/jobdetails.jsp?jobid=job_201302081514_0073 > Kill Command = /usr/lib/hadoop/bin/hadoop job > -Dmapred.job.tracker=m0525.mtv.cloudera.com:8021 -kill job_201302081514_0073 > Hadoop job information for Stage-1: number of mappers: 1; number of reducers: > 1 > 2013-03-05 22:34:43,255 Stage-1 map = 0%, reduce = 0% > 2013-03-05 22:34:49,323 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:50,337 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:51,351 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:52,360 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:53,370 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:54,379 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:55,389 Stage-1 map = 100%,
[jira] [Commented] (IMPALA-7204) Add support for GROUP BY ROLLUP
[ https://issues.apache.org/jira/browse/IMPALA-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661346#comment-16661346 ] Ruslan Dautkhanov commented on IMPALA-7204: --- thank you [~tarmstrong] Since now IMPALA-110 is resolved, would it be possible to put this on Impala Roadmap ? thanks! > Add support for GROUP BY ROLLUP > --- > > Key: IMPALA-7204 > URL: https://issues.apache.org/jira/browse/IMPALA-7204 > Project: IMPALA > Issue Type: New Feature > Components: Backend >Affects Versions: Impala 3.0, Impala 2.12.0 >Reporter: Ruslan Dautkhanov >Priority: Major > Labels: GROUP_BY, sql > > Now suppose that we'd like to analyze our sales data, to study the amount of > sales that is occurring for different products, in different states and > regions. Using the ROLLUP feature of SQL 2003, we could issue the query: > {code:sql} > select region, state, product, sum(sales) total_sales > from sales_history > group by rollup (region, state, product) > {code} > Semantically, the above query is equivalent to > > {code:sql} > select region, state, product, sum(sales) total_sales > from sales_history > group by region, state, product > union > select region, state, null, sum(sales) total_sales > from sales_history > group by region, state > union > select region, null, null, sum(sales) total_sales > from sales_history > group by region > union > select null, null, null, sum(sales) total_sales > from sales_history > > {code} > The query might produce results that looked something like: > {noformat} > REGION STATE PRODUCT TOTAL_SALES > -- - --- --- > null null null 6200 > EAST MA BOATS 100 > EAST MA CARS 1500 > EAST MA null 1600 > EAST NY BOATS 150 > EAST NY CARS 1000 > EAST NY null 1150 > EAST null null 2750 > WEST CA BOATS 750 > WEST CA CARS 500 > WEST CA null 1250 > WEST AZ BOATS 2000 > WEST AZ CARS 200 > WEST AZ null 2200 > WEST null null 3450 > {noformat} > We have a lot of production queries that work around this missing Impala > functionality by having three UNION ALLs. Physical execution plan shows > Impala actually reads full fact table three times. So it could be a three > times improvement (or more, depending on number of columns that are being > rolled up). > I can't find another SQL on Hadoop engine that doesn't support this feature. > *Checked Spark, Hive, PIG, Flink and some other engines - they all do > support this basic SQL feature*. > Would be great to have a matching feature in Impala too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-110) Add support for multiple distinct operators in the same query block
[ https://issues.apache.org/jira/browse/IMPALA-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661238#comment-16661238 ] Ruslan Dautkhanov commented on IMPALA-110: -- [~twmarshall] , We have BI tools that have `count(distinct)` hardcoded, although we can still change session level settings. It would be great to keep APPX_COUNT_DISTINCT supported for such use cases. > Add support for multiple distinct operators in the same query block > --- > > Key: IMPALA-110 > URL: https://issues.apache.org/jira/browse/IMPALA-110 > Project: IMPALA > Issue Type: New Feature > Components: Backend, Frontend >Affects Versions: Impala 0.5, Impala 1.4, Impala 2.0, Impala 2.2, Impala > 2.3.0 >Reporter: Greg Rahn >Assignee: Thomas Tauber-Marshall >Priority: Major > Labels: sql-language > Fix For: Impala 3.1.0 > > > Impala only allows a single (DISTINCT columns) expression in each query. > {color:red}Note: > If you do not need precise accuracy, you can produce an estimate of the > distinct values for a column by specifying NDV(column); a query can contain > multiple instances of NDV(column). To make Impala automatically rewrite > COUNT(DISTINCT) expressions to NDV(), enable the APPX_COUNT_DISTINCT query > option. > {color} > {code} > [impala:21000] > select count(distinct i_class_id) from item; > Query: select count(distinct i_class_id) from item > Query finished, fetching results ... > 16 > Returned 1 row(s) in 1.51s > {code} > {code} > [impala:21000] > select count(distinct i_class_id), count(distinct > i_brand_id) from item; > Query: select count(distinct i_class_id), count(distinct i_brand_id) from item > ERROR: com.cloudera.impala.common.AnalysisException: Analysis exception (in > select count(distinct i_class_id), count(distinct i_brand_id) from item) > at > com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:133) > at > com.cloudera.impala.service.Frontend.createExecRequest(Frontend.java:221) > at > com.cloudera.impala.service.JniFrontend.createExecRequest(JniFrontend.java:89) > Caused by: com.cloudera.impala.common.AnalysisException: all DISTINCT > aggregate functions need to have the same set of parameters as COUNT(DISTINCT > i_class_id); deviating function: COUNT(DISTINCT i_brand_id) > at > com.cloudera.impala.analysis.AggregateInfo.createDistinctAggInfo(AggregateInfo.java:196) > at > com.cloudera.impala.analysis.AggregateInfo.create(AggregateInfo.java:143) > at > com.cloudera.impala.analysis.SelectStmt.createAggInfo(SelectStmt.java:466) > at > com.cloudera.impala.analysis.SelectStmt.analyzeAggregation(SelectStmt.java:347) > at com.cloudera.impala.analysis.SelectStmt.analyze(SelectStmt.java:155) > at > com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:130) > ... 2 more > {code} > Hive supports this: > {code} > $ hive -e "select count(distinct i_class_id), count(distinct i_brand_id) from > item;" > Logging initialized using configuration in > file:/etc/hive/conf.dist/hive-log4j.properties > Hive history file=/tmp/grahn/hive_job_log_grahn_201303052234_1625576708.txt > Total MapReduce jobs = 1 > Launching Job 1 out of 1 > Number of reduce tasks determined at compile time: 1 > In order to change the average load for a reducer (in bytes): > set hive.exec.reducers.bytes.per.reducer= > In order to limit the maximum number of reducers: > set hive.exec.reducers.max= > In order to set a constant number of reducers: > set mapred.reduce.tasks= > Starting Job = job_201302081514_0073, Tracking URL = > http://impala:50030/jobdetails.jsp?jobid=job_201302081514_0073 > Kill Command = /usr/lib/hadoop/bin/hadoop job > -Dmapred.job.tracker=m0525.mtv.cloudera.com:8021 -kill job_201302081514_0073 > Hadoop job information for Stage-1: number of mappers: 1; number of reducers: > 1 > 2013-03-05 22:34:43,255 Stage-1 map = 0%, reduce = 0% > 2013-03-05 22:34:49,323 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:50,337 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:51,351 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:52,360 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:53,370 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:54,379 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:55,389 Stage-1 map = 100%, reduce = 100%, Cumulative CPU > 8.58 sec > 2013-03-05 22:34:56,402 Stage-1 map = 100%, reduce = 100%, Cumulative CPU > 8.58 sec > 2013-03-05 22:34:57,413 Stage-1 map = 100%, reduce = 100%, Cumulative CPU > 8.58 sec > 2013-03-05 22:34:58,424 Stage-1
[jira] [Commented] (IMPALA-110) Add support for multiple distinct operators in the same query block
[ https://issues.apache.org/jira/browse/IMPALA-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661192#comment-16661192 ] Ruslan Dautkhanov commented on IMPALA-110: -- APPX_COUNT_DISTINCT wouldn't go away. You would still have an option to run approximate count distinct, as those are much cheaper/ much faster on highly cardinal columns and a still a good option if you're not looking for exact answer. And no it wouldn't go to 1 count distinct limitation per query. That's my understanding. > Add support for multiple distinct operators in the same query block > --- > > Key: IMPALA-110 > URL: https://issues.apache.org/jira/browse/IMPALA-110 > Project: IMPALA > Issue Type: New Feature > Components: Backend, Frontend >Affects Versions: Impala 0.5, Impala 1.4, Impala 2.0, Impala 2.2, Impala > 2.3.0 >Reporter: Greg Rahn >Assignee: Thomas Tauber-Marshall >Priority: Major > Labels: sql-language > Fix For: Impala 3.1.0 > > > Impala only allows a single (DISTINCT columns) expression in each query. > {color:red}Note: > If you do not need precise accuracy, you can produce an estimate of the > distinct values for a column by specifying NDV(column); a query can contain > multiple instances of NDV(column). To make Impala automatically rewrite > COUNT(DISTINCT) expressions to NDV(), enable the APPX_COUNT_DISTINCT query > option. > {color} > {code} > [impala:21000] > select count(distinct i_class_id) from item; > Query: select count(distinct i_class_id) from item > Query finished, fetching results ... > 16 > Returned 1 row(s) in 1.51s > {code} > {code} > [impala:21000] > select count(distinct i_class_id), count(distinct > i_brand_id) from item; > Query: select count(distinct i_class_id), count(distinct i_brand_id) from item > ERROR: com.cloudera.impala.common.AnalysisException: Analysis exception (in > select count(distinct i_class_id), count(distinct i_brand_id) from item) > at > com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:133) > at > com.cloudera.impala.service.Frontend.createExecRequest(Frontend.java:221) > at > com.cloudera.impala.service.JniFrontend.createExecRequest(JniFrontend.java:89) > Caused by: com.cloudera.impala.common.AnalysisException: all DISTINCT > aggregate functions need to have the same set of parameters as COUNT(DISTINCT > i_class_id); deviating function: COUNT(DISTINCT i_brand_id) > at > com.cloudera.impala.analysis.AggregateInfo.createDistinctAggInfo(AggregateInfo.java:196) > at > com.cloudera.impala.analysis.AggregateInfo.create(AggregateInfo.java:143) > at > com.cloudera.impala.analysis.SelectStmt.createAggInfo(SelectStmt.java:466) > at > com.cloudera.impala.analysis.SelectStmt.analyzeAggregation(SelectStmt.java:347) > at com.cloudera.impala.analysis.SelectStmt.analyze(SelectStmt.java:155) > at > com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:130) > ... 2 more > {code} > Hive supports this: > {code} > $ hive -e "select count(distinct i_class_id), count(distinct i_brand_id) from > item;" > Logging initialized using configuration in > file:/etc/hive/conf.dist/hive-log4j.properties > Hive history file=/tmp/grahn/hive_job_log_grahn_201303052234_1625576708.txt > Total MapReduce jobs = 1 > Launching Job 1 out of 1 > Number of reduce tasks determined at compile time: 1 > In order to change the average load for a reducer (in bytes): > set hive.exec.reducers.bytes.per.reducer= > In order to limit the maximum number of reducers: > set hive.exec.reducers.max= > In order to set a constant number of reducers: > set mapred.reduce.tasks= > Starting Job = job_201302081514_0073, Tracking URL = > http://impala:50030/jobdetails.jsp?jobid=job_201302081514_0073 > Kill Command = /usr/lib/hadoop/bin/hadoop job > -Dmapred.job.tracker=m0525.mtv.cloudera.com:8021 -kill job_201302081514_0073 > Hadoop job information for Stage-1: number of mappers: 1; number of reducers: > 1 > 2013-03-05 22:34:43,255 Stage-1 map = 0%, reduce = 0% > 2013-03-05 22:34:49,323 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:50,337 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:51,351 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:52,360 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:53,370 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:54,379 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:55,389 Stage-1 map = 100%, reduce = 100%, Cumulative CPU > 8.58 sec > 2013-03-05 22:34:56,402 Stage-1 map = 100%, reduce = 100%, Cumulative CPU >
[jira] [Commented] (IMPALA-110) Add support for multiple distinct operators in the same query block
[ https://issues.apache.org/jira/browse/IMPALA-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628963#comment-16628963 ] Ruslan Dautkhanov commented on IMPALA-110: -- Hooray!! Congrats - great work everyone involved. Will this be part of Impala 2.13 release / CDH 5.16? Thank you! > Add support for multiple distinct operators in the same query block > --- > > Key: IMPALA-110 > URL: https://issues.apache.org/jira/browse/IMPALA-110 > Project: IMPALA > Issue Type: New Feature > Components: Backend, Frontend >Affects Versions: Impala 0.5, Impala 1.4, Impala 2.0, Impala 2.2, Impala > 2.3.0 >Reporter: Greg Rahn >Assignee: Thomas Tauber-Marshall >Priority: Major > Labels: sql-language > > Impala only allows a single (DISTINCT columns) expression in each query. > {color:red}Note: > If you do not need precise accuracy, you can produce an estimate of the > distinct values for a column by specifying NDV(column); a query can contain > multiple instances of NDV(column). To make Impala automatically rewrite > COUNT(DISTINCT) expressions to NDV(), enable the APPX_COUNT_DISTINCT query > option. > {color} > {code} > [impala:21000] > select count(distinct i_class_id) from item; > Query: select count(distinct i_class_id) from item > Query finished, fetching results ... > 16 > Returned 1 row(s) in 1.51s > {code} > {code} > [impala:21000] > select count(distinct i_class_id), count(distinct > i_brand_id) from item; > Query: select count(distinct i_class_id), count(distinct i_brand_id) from item > ERROR: com.cloudera.impala.common.AnalysisException: Analysis exception (in > select count(distinct i_class_id), count(distinct i_brand_id) from item) > at > com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:133) > at > com.cloudera.impala.service.Frontend.createExecRequest(Frontend.java:221) > at > com.cloudera.impala.service.JniFrontend.createExecRequest(JniFrontend.java:89) > Caused by: com.cloudera.impala.common.AnalysisException: all DISTINCT > aggregate functions need to have the same set of parameters as COUNT(DISTINCT > i_class_id); deviating function: COUNT(DISTINCT i_brand_id) > at > com.cloudera.impala.analysis.AggregateInfo.createDistinctAggInfo(AggregateInfo.java:196) > at > com.cloudera.impala.analysis.AggregateInfo.create(AggregateInfo.java:143) > at > com.cloudera.impala.analysis.SelectStmt.createAggInfo(SelectStmt.java:466) > at > com.cloudera.impala.analysis.SelectStmt.analyzeAggregation(SelectStmt.java:347) > at com.cloudera.impala.analysis.SelectStmt.analyze(SelectStmt.java:155) > at > com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:130) > ... 2 more > {code} > Hive supports this: > {code} > $ hive -e "select count(distinct i_class_id), count(distinct i_brand_id) from > item;" > Logging initialized using configuration in > file:/etc/hive/conf.dist/hive-log4j.properties > Hive history file=/tmp/grahn/hive_job_log_grahn_201303052234_1625576708.txt > Total MapReduce jobs = 1 > Launching Job 1 out of 1 > Number of reduce tasks determined at compile time: 1 > In order to change the average load for a reducer (in bytes): > set hive.exec.reducers.bytes.per.reducer= > In order to limit the maximum number of reducers: > set hive.exec.reducers.max= > In order to set a constant number of reducers: > set mapred.reduce.tasks= > Starting Job = job_201302081514_0073, Tracking URL = > http://impala:50030/jobdetails.jsp?jobid=job_201302081514_0073 > Kill Command = /usr/lib/hadoop/bin/hadoop job > -Dmapred.job.tracker=m0525.mtv.cloudera.com:8021 -kill job_201302081514_0073 > Hadoop job information for Stage-1: number of mappers: 1; number of reducers: > 1 > 2013-03-05 22:34:43,255 Stage-1 map = 0%, reduce = 0% > 2013-03-05 22:34:49,323 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:50,337 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:51,351 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:52,360 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:53,370 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:54,379 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec > 2013-03-05 22:34:55,389 Stage-1 map = 100%, reduce = 100%, Cumulative CPU > 8.58 sec > 2013-03-05 22:34:56,402 Stage-1 map = 100%, reduce = 100%, Cumulative CPU > 8.58 sec > 2013-03-05 22:34:57,413 Stage-1 map = 100%, reduce = 100%, Cumulative CPU > 8.58 sec > 2013-03-05 22:34:58,424 Stage-1 map = 100%, reduce = 100%, Cumulative CPU > 8.58 sec > MapReduce Total cumulative CPU time: 8 seconds 580 msec > Ended Job
[jira] [Commented] (IMPALA-7204) Add support for GROUP BY ROLLUP
[ https://issues.apache.org/jira/browse/IMPALA-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530549#comment-16530549 ] Ruslan Dautkhanov commented on IMPALA-7204: --- Thanks [~tarmstrong] ! Would be great if it also adds infrastructure to runs these most heavy operations with intra-node parallelism. Most of our query execution times come from count(distinct) as we do mostly count(distinct) on highly-cardinal values.. cc [~twmarshall] > Add support for GROUP BY ROLLUP > --- > > Key: IMPALA-7204 > URL: https://issues.apache.org/jira/browse/IMPALA-7204 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 3.0, Impala 2.12.0 >Reporter: Ruslan Dautkhanov >Priority: Critical > Labels: GROUP_BY, sql > > Now suppose that we'd like to analyze our sales data, to study the amount of > sales that is occurring for different products, in different states and > regions. Using the ROLLUP feature of SQL 2003, we could issue the query: > {code:sql} > select region, state, product, sum(sales) total_sales > from sales_history > group by rollup (region, state, product) > {code} > Semantically, the above query is equivalent to > > {code:sql} > select region, state, product, sum(sales) total_sales > from sales_history > group by region, state, product > union > select region, state, null, sum(sales) total_sales > from sales_history > group by region, state > union > select region, null, null, sum(sales) total_sales > from sales_history > group by region > union > select null, null, null, sum(sales) total_sales > from sales_history > > {code} > The query might produce results that looked something like: > {noformat} > REGION STATE PRODUCT TOTAL_SALES > -- - --- --- > null null null 6200 > EAST MA BOATS 100 > EAST MA CARS 1500 > EAST MA null 1600 > EAST NY BOATS 150 > EAST NY CARS 1000 > EAST NY null 1150 > EAST null null 2750 > WEST CA BOATS 750 > WEST CA CARS 500 > WEST CA null 1250 > WEST AZ BOATS 2000 > WEST AZ CARS 200 > WEST AZ null 2200 > WEST null null 3450 > {noformat} > We have a lot of production queries that work around this missing Impala > functionality by having three UNION ALLs. Physical execution plan shows > Impala actually reads full fact table three times. So it could be a three > times improvement (or more, depending on number of columns that are being > rolled up). > I can't find another SQL on Hadoop engine that doesn't support this feature. > *Checked Spark, Hive, PIG, Flink and some other engines - they all do > support this basic SQL feature*. > Would be great to have a matching feature in Impala too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-7204) Add support for GROUP BY ROLLUP
[ https://issues.apache.org/jira/browse/IMPALA-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruslan Dautkhanov updated IMPALA-7204: -- Description: Now suppose that we'd like to analyze our sales data, to study the amount of sales that is occurring for different products, in different states and regions. Using the ROLLUP feature of SQL 2003, we could issue the query: {code:sql} select region, state, product, sum(sales) total_sales from sales_history group by rollup (region, state, product) {code} Semantically, the above query is equivalent to {code:sql} select region, state, product, sum(sales) total_sales from sales_history group by region, state, product union select region, state, null, sum(sales) total_sales from sales_history group by region, state union select region, null, null, sum(sales) total_sales from sales_history group by region union select null, null, null, sum(sales) total_sales from sales_history {code} The query might produce results that looked something like: {noformat} REGION STATE PRODUCT TOTAL_SALES -- - --- --- null null null 6200 EAST MA BOATS 100 EAST MA CARS 1500 EAST MA null 1600 EAST NY BOATS 150 EAST NY CARS 1000 EAST NY null 1150 EAST null null 2750 WEST CA BOATS 750 WEST CA CARS 500 WEST CA null 1250 WEST AZ BOATS 2000 WEST AZ CARS 200 WEST AZ null 2200 WEST null null 3450 {noformat} We have a lot of production queries that work around this missing Impala functionality by having three UNION ALLs. Physical execution plan shows Impala actually reads full fact table three times. So it could be a three times improvement (or more, depending on number of columns that are being rolled up). I can't find another SQL on Hadoop engine that doesn't support this feature. *Checked Spark, Hive, PIG, Flink and some other engines - they all do support this basic SQL feature*. Would be great to have a matching feature in Impala too. was: Now suppose that we'd like to analyze our sales data, to study the amount of sales that is occurring for different products, in different states and regions. Using the ROLLUP feature of SQL 2003, we could issue the query: {code:sql} select region, state, product, sum(sales) total_sales from sales_history group by rollup (region, state, product) {code} Semantically, the above query is equivalent to {code:sql} select region, state, product, sum(sales) total_sales from sales_history group by region, state, product union select region, state, null, sum(sales) total_sales from sales_history group by region, state union select region, null, null, sum(sales) total_sales from sales_history group by region union select null, null, null, sum(sales) total_sales from sales_history {code} The query might produce results that looked something like: {noformat} REGION STATE PRODUCT TOTAL_SALES -- - --- --- null null null 6200 EAST MA BOATS 100 EAST MA CARS 1500 EAST MA null 1600 EAST NY BOATS 150 EAST NY CARS 1000 EAST NY null 1150 EAST null null 2750 WEST CA BOATS 750 WEST CA CARS 500 WEST CA null 1250 WEST AZ BOATS 2000 WEST AZ CARS 200 WEST AZ null 2200 WEST null null 3450 {noformat} We have a lot of production queries that work around this missing Impala functionality by having three UNION ALLs. Physical execution plan shows Impala actually reads full fact table three times. So it could be a three times improvement (or more, depending on number of columns that are being rolled up). I can't find another SQL on Hadoop engine that doesn't support this feature. Checked Spark, Hive, PIG, Flink and some other engines - they all do support this basic SQL features. Would be great to have a matching feature in Impala too. > Add support for GROUP BY ROLLUP > --- > > Key: IMPALA-7204 > URL: https://issues.apache.org/jira/browse/IMPALA-7204 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 3.0, Impala 2.12.0 >Reporter: Ruslan Dautkhanov >Priority: Critical > Labels: GROUP_BY, sql > > Now suppose that we'd like to analyze our sales data, to study the amount of > sales that is occurring for different products, in different states and > regions. Using the ROLLUP feature of SQL 2003, we could issue the query: > {code:sql} > select region, state, product, sum(sales) total_sales > from sales_history > group by rollup (region, state, product) > {code} > Semantically, the above query is equivalent to > > {code:sql} > select region, state, product, sum(sales) total_sales > from sales_history > group by region, state, product > union > select region, state, null, sum(sales) total_sales > from sales_history > group by region, state > union > select region, null, null, sum(sales) total_sales > from sales_history > group by region > union > select
[jira] [Created] (IMPALA-7204) Add support for GROUP BY ROLLUP
Ruslan Dautkhanov created IMPALA-7204: - Summary: Add support for GROUP BY ROLLUP Key: IMPALA-7204 URL: https://issues.apache.org/jira/browse/IMPALA-7204 Project: IMPALA Issue Type: Improvement Components: Backend Affects Versions: Impala 2.12.0, Impala 3.0 Reporter: Ruslan Dautkhanov Now suppose that we'd like to analyze our sales data, to study the amount of sales that is occurring for different products, in different states and regions. Using the ROLLUP feature of SQL 2003, we could issue the query: {code:sql} select region, state, product, sum(sales) total_sales from sales_history group by rollup (region, state, product) {code} Semantically, the above query is equivalent to {code:sql} select region, state, product, sum(sales) total_sales from sales_history group by region, state, product union select region, state, null, sum(sales) total_sales from sales_history group by region, state union select region, null, null, sum(sales) total_sales from sales_history group by region union select null, null, null, sum(sales) total_sales from sales_history {code} The query might produce results that looked something like: {noformat} REGION STATE PRODUCT TOTAL_SALES -- - --- --- null null null 6200 EAST MA BOATS 100 EAST MA CARS 1500 EAST MA null 1600 EAST NY BOATS 150 EAST NY CARS 1000 EAST NY null 1150 EAST null null 2750 WEST CA BOATS 750 WEST CA CARS 500 WEST CA null 1250 WEST AZ BOATS 2000 WEST AZ CARS 200 WEST AZ null 2200 WEST null null 3450 {noformat} We have a lot of production queries that work around this missing Impala functionality by having three UNION ALLs. Physical execution plan shows Impala actually reads full fact table three times. So it could be a three times improvement (or more, depending on number of columns that are being rolled up). I can't find another SQL on Hadoop engine that doesn't support this feature. Checked Spark, Hive, PIG, Flink and some other engines - they all do support this basic SQL features. Would be great to have a matching feature in Impala too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-110) Add support for multiple distinct operators in the same query block
[ https://issues.apache.org/jira/browse/IMPALA-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498545#comment-16498545 ] Ruslan Dautkhanov commented on IMPALA-110: -- {quote}[~alex.behm] added a comment - 27/Oct/17 21:53 The feature requires fundamental architectural changes from a tree-execution to a dag-execution model. [~alex.behm] added a comment - 30/Oct/17 10:04 Ruslan Dautkhanov, this is it. As far as I'm aware of, there currently is no other feature that requires the switch. {quote} I thought of another use case for DAG execution - is a MATERIALIZED subquery IMPALA-7114 . Would also be great to implement MATERIALIZE hint as in Oracle too. [https://oracle-base.com/articles/misc/with-clause#materialize-hint] Basically in Oracle we could run WITH subqueries that will be persisted for the duration of the query, so if that subquery is used multiple times in outer query, a heavy join or any other operation has to run only once. If it's executed as a DAG, then all subqueries dependent on this materialized subquery can be run in parallel. Hopefully that would tip the Impala to switch to DAG execution model. > Add support for multiple distinct operators in the same query block > --- > > Key: IMPALA-110 > URL: https://issues.apache.org/jira/browse/IMPALA-110 > Project: IMPALA > Issue Type: New Feature > Components: Backend, Frontend >Affects Versions: Impala 0.5, Impala 1.4, Impala 2.0, Impala 2.2, Impala > 2.3.0 >Reporter: Greg Rahn >Assignee: Thomas Tauber-Marshall >Priority: Major > Labels: sql-language > > Impala only allows a single (DISTINCT columns) expression in each query. > {color:red}Note: > If you do not need precise accuracy, you can produce an estimate of the > distinct values for a column by specifying NDV(column); a query can contain > multiple instances of NDV(column). To make Impala automatically rewrite > COUNT(DISTINCT) expressions to NDV(), enable the APPX_COUNT_DISTINCT query > option. > {color} > {code} > [impala:21000] > select count(distinct i_class_id) from item; > Query: select count(distinct i_class_id) from item > Query finished, fetching results ... > 16 > Returned 1 row(s) in 1.51s > {code} > {code} > [impala:21000] > select count(distinct i_class_id), count(distinct > i_brand_id) from item; > Query: select count(distinct i_class_id), count(distinct i_brand_id) from item > ERROR: com.cloudera.impala.common.AnalysisException: Analysis exception (in > select count(distinct i_class_id), count(distinct i_brand_id) from item) > at > com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:133) > at > com.cloudera.impala.service.Frontend.createExecRequest(Frontend.java:221) > at > com.cloudera.impala.service.JniFrontend.createExecRequest(JniFrontend.java:89) > Caused by: com.cloudera.impala.common.AnalysisException: all DISTINCT > aggregate functions need to have the same set of parameters as COUNT(DISTINCT > i_class_id); deviating function: COUNT(DISTINCT i_brand_id) > at > com.cloudera.impala.analysis.AggregateInfo.createDistinctAggInfo(AggregateInfo.java:196) > at > com.cloudera.impala.analysis.AggregateInfo.create(AggregateInfo.java:143) > at > com.cloudera.impala.analysis.SelectStmt.createAggInfo(SelectStmt.java:466) > at > com.cloudera.impala.analysis.SelectStmt.analyzeAggregation(SelectStmt.java:347) > at com.cloudera.impala.analysis.SelectStmt.analyze(SelectStmt.java:155) > at > com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:130) > ... 2 more > {code} > Hive supports this: > {code} > $ hive -e "select count(distinct i_class_id), count(distinct i_brand_id) from > item;" > Logging initialized using configuration in > file:/etc/hive/conf.dist/hive-log4j.properties > Hive history file=/tmp/grahn/hive_job_log_grahn_201303052234_1625576708.txt > Total MapReduce jobs = 1 > Launching Job 1 out of 1 > Number of reduce tasks determined at compile time: 1 > In order to change the average load for a reducer (in bytes): > set hive.exec.reducers.bytes.per.reducer= > In order to limit the maximum number of reducers: > set hive.exec.reducers.max= > In order to set a constant number of reducers: > set mapred.reduce.tasks= > Starting Job = job_201302081514_0073, Tracking URL = > http://impala:50030/jobdetails.jsp?jobid=job_201302081514_0073 > Kill Command = /usr/lib/hadoop/bin/hadoop job > -Dmapred.job.tracker=m0525.mtv.cloudera.com:8021 -kill job_201302081514_0073 > Hadoop job information for Stage-1: number of mappers: 1; number of reducers: > 1 > 2013-03-05 22:34:43,255 Stage-1 map = 0%, reduce = 0% > 2013-03-05 22:34:49,323 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 > sec >
[jira] [Created] (IMPALA-7114) MATERIALIZE a subquery hint
Ruslan Dautkhanov created IMPALA-7114: - Summary: MATERIALIZE a subquery hint Key: IMPALA-7114 URL: https://issues.apache.org/jira/browse/IMPALA-7114 Project: IMPALA Issue Type: New Feature Affects Versions: Impala 2.12.0, Impala 3.0 Reporter: Ruslan Dautkhanov Would also be great to implement MATERIALIZE hint as in Oracle too. [https://oracle-base.com/articles/misc/with-clause#materialize-hint] Basically in Oracle we could run WITH subqueries that will be persisted for the duration of the query, so if that subquery is used multiple times in outer query, a heavy join or any other operation has to run only once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-7112) Change from tree-execution to a dag-execution model
Ruslan Dautkhanov created IMPALA-7112: - Summary: Change from tree-execution to a dag-execution model Key: IMPALA-7112 URL: https://issues.apache.org/jira/browse/IMPALA-7112 Project: IMPALA Issue Type: Improvement Affects Versions: Impala 2.12.0, Impala 3.0 Reporter: Ruslan Dautkhanov Quoting [~alex.behm] from IMPALA-110 comment : {quote} * The feature requires fundamental architectural changes from a tree-execution to a dag-execution model. {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-3446) Materialized views
[ https://issues.apache.org/jira/browse/IMPALA-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498527#comment-16498527 ] Ruslan Dautkhanov commented on IMPALA-3446: --- Would also be great to implement MATERIALIZE hint as in Oracle too. [https://oracle-base.com/articles/misc/with-clause#materialize-hint] Basically in Oracle we could run WITH subqueries that will be persisted for the duration of the query, so if that subquery is used multiple times in outer query, a heavy join or any other operation has to run only once. > Materialized views > -- > > Key: IMPALA-3446 > URL: https://issues.apache.org/jira/browse/IMPALA-3446 > Project: IMPALA > Issue Type: New Feature > Components: Frontend >Affects Versions: Impala 2.5.0 >Reporter: Marcell Szabo >Priority: Minor > > This JIRA is a placeholder for this big topic. > Some user stories I can imagine under the epic: > # materialized view is a CTAS where the SELECT is saved in the HMS and can be > rerun by a simple command > # materialized view detects that the source data has changed and falls back > to be a view instead of SELECT * > # materialized view detects that the source data has changed and reruns the > CTAS automatically > # ... reruns only the necessary changes (e.g. if only some of the partitions > change) > # Impala and Hive to have common semantics of materialized views > (https://issues.apache.org/jira/browse/HIVE-10459) > # materialized view stores extra statistics that help the optimizer, storing > the full resultset is optional > # query optimizer checks for every query whether part of the query plan can > be covered by an existing materialized view -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org