Hi Eugene, Answering your questions: 1) Grouping is performed on both mapper and reducer (coordinator). If you group be affinity key, you may try setting "SqlFieldsQuery.colocated=true" to bypass grouping on reducer 2) For this specific query H2 will store (customer_id, count(*), sum(views)) for every customer_id. It is hard to guess how much space it would take in heap, but I think it would be ~50-100 bytes per customer_id. So if you have N customers, it would be (100 * N) bytes 3) Please see https://apacheignite-sql.readme.io/docs/performance-and-debugging
Vladimir. On Thu, Aug 30, 2018 at 5:57 PM eugene miretsky <eugene.miret...@gmail.com> wrote: > Thanks against for the detailed response! > > Our main use case is preforming large SQL queries over tables with 200M+ > rows - wanted to give you a bit more details and context you can pass along > > A simple example would be: > > - Table: customer_id, date, category, views, clicks ( pkey = > "customer_id, date", affinity key = date ) > - Query: SELECT count(*) where date < X AND categroy in (C1, C2, C3) > GROUP BY customer_id HAVING SUM(views) > 20 > > My main concernse are > 1) How is the group by performed. You mentioend that it is performend on > the coordinator, I was coping that singe we are grouping using an colomn > that is an affintiy key, each node will be able to do it's own group by > 2) How much heap should I allocate for the group by stage > 3) General performance tips > > Cheers, > Eugene > > > On Thu, Aug 30, 2018 at 1:32 AM Denis Magda <dma...@apache.org> wrote: > >> Eugene, >> >> Just want to be sure you know about the existence of the following pages >> which elaborate on Ignite memory architecture in details: >> >> - >> >> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Entriesandpagesindurablememory >> - >> >> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood >> >> >> >>> 1) Are indexs loaded into heap (when used)? >>> >> >> Something might be copied to disk but in most of the cases we perform >> comparisons and other operations directly off-heap. >> See org.apache.ignite.internal.processors.query.h2.database.InlineIndexHelper >> and related classes. >> >> 2) Are full pages loaded into heap, or only the matching records? >>> >> >> Matching records (result set) are presently loaded. The pages are not. >> >> >>> 3) When the query needs more processing than the exisiting index >>> (non-indexed columns, groupBy, aggreag) where/how does it happen? >>> >> >> We will be doing a full scan. Grouping and aggregations are finalized on >> the query coordinator which needs to get a full result set. >> >> 4) How is the query coordinator chosen? Is it the client node? How about >>> when using the web console? >>> >> >> That's your application. Web Console uses Ignite SQL APIs as well. >> >> >>> 5) What paralalism settings would your recomend, we were thinking to set >>> parallelJobsNumber to 1 and task parallelism to number of cores * 2 - >>> this way we can make sure that each job gets al the heap memory instead of >>> all jobs fighting each other. Not sure if it makes sense, and it will also >>> prevent us from making real time transactional transactional queries.(we >>> are hoping to use ignite for both olap and simple real time queries) >> >> >> I would start a separate discussion for this bringing this question to >> the attention of our SQL experts. I'm not the one of them. >> >> -- >> Denis >> >> On Mon, Aug 27, 2018 at 8:54 PM eugene miretsky < >> eugene.miret...@gmail.com> wrote: >> >>> Denis, thanks for the detailed response. >>> >>> A few more follow up questions >>> 1) Are indexs loaded into heap (when used)? >>> 2) Are full pages loaded into heap, or only the matching records? >>> 3) When the query needs more processing than the exisiting index >>> (non-indexed columns, groupBy, aggreag) where/how does it happen? >>> 4) How is the query coordinator chosen? Is it the client node? How about >>> when using the web console? >>> 5) What paralalism settings would your recomend, we were thinking to set >>> parallelJobsNumber to 1 and task parallelism to number of cores * 2 - >>> this way we can make sure that each job gets al the heap memory instead of >>> all jobs fighting each other. Not sure if it makes sense, and it will also >>> prevent us from making real time transactional transactional queries.(we >>> are hoping to use ignite for both olap and simple real time queries) >>> >>> Cheers, >>> Eugene >>> >>> >>> On Sat, Aug 25, 2018 at 3:25 AM Denis Magda <dma...@apache.org> wrote: >>> >>>> Hello Eugene, >>>> >>>> 1) In what format is data stored off heap? >>>> >>>> >>>> Data is always stored in the binary format let it be on-heap, off-heap >>>> or Ignite persistence. >>>> https://apacheignite.readme.io/docs/binary-marshaller >>>> >>>> 2) What happens when a SQL query is executed, in particular >>>> >>>>> >>>>> - How is H2 used? How is data loaded in H2? What if some of the >>>>> data is on disk? >>>>> >>>>> H2 is used to build execution plans for SELECTs. H2 calls Ignite's >>>> B+Tree based indexing implementation to see which indexes are set. All the >>>> data and indexes are always stored in Ignite (off-heap + disk). >>>> >>>>> >>>>> - When is data loaded into heap, and how much? Is only the output >>>>> of H2 loaded, or everything? >>>>> >>>>> Queries results are stored in Java heap temporarily. Once the result >>>> set is read by your application, it will be garbage collected. >>>> >>>>> >>>>> - How is the reduce stage performed? Is it performed only on one >>>>> node (hence that node needs to load all the data into memory) >>>>> >>>>> Correct, the final result set is reduced on a query coordinator - your >>>> application that executed a SELECT. >>>> >>>> 3) What happens when Ingite runs out of memory during execution? Is >>>>> data evictied to disk (if persistence is enabled)? >>>> >>>> >>>> I guess you mean what happens if a result set doesn't fit in RAM during >>>> the execution, right? If so, then OOM will occur. We're working on an >>>> improvement that will offload the result set to disk to avoid OOM for all >>>> the scenarious: >>>> https://issues.apache.org/jira/browse/IGNITE-7526 >>>> >>>> >>>> >>>>> 4) Based on the code, it looks like I need to set my data region size >>>>> to at most 50% of available memory (to avoid the warning), this seems a >>>>> bit >>>>> wastefull. >>>> >>>> >>>> There is no such a requirement. I know many deployments use cases when >>>> one data region is given 20% of RAM, the other is given 40% and everything >>>> else is persisted to disk. >>>> >>>> 5) Do you have any general advice on benchmarking the memory >>>>> requirpement? So far I have not been able to find a way to check how much >>>>> memory each table takes on and off heap, and how much memory each query >>>>> takes. >>>> >>>> >>>> We use Yardstick for performance benchmarking: >>>> https://apacheignite.readme.io/docs/perfomance-benchmarking >>>> >>>> -- >>>> Denis >>>> >>>> On Fri, Aug 24, 2018 at 7:06 AM eugene miretsky < >>>> eugene.miret...@gmail.com> wrote: >>>> >>>>> Thanks! >>>>> >>>>> I am trying to understand when and how data is moved from off-heap to >>>>> on heap, particularly when using SQL. I took a look at the wiki >>>>> <https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood> >>>>> but >>>>> still have a few questions >>>>> >>>>> My understanding is that data is always store off-heap >>>>> >>>>> 1) In what format is data stored off heap? >>>>> 2) What happens when a SQL query is executed, in particular >>>>> >>>>> - How is H2 used? How is data loaded in H2? What if some of the >>>>> data is on disk? >>>>> - When is data loaded into heap, and how much? Is only the output >>>>> of H2 loaded, or everything? >>>>> - How is the reduce stage performed? Is it performed only on one >>>>> node (hence that node needs to load all the data into memory) >>>>> >>>>> 3) What happens when Ingite runs out of memory during execution? Is >>>>> data evictied to disk (if persistence is enabled)? >>>>> 4) Based on the code, it looks like I need to set my data region size >>>>> to at most 50% of available memory (to avoid the warning), this seems a >>>>> bit >>>>> wastefull. >>>>> 5) Do you have any general advice on benchmarking the memory >>>>> requirpement? So far I have not been able to find a way to check how much >>>>> memory each table takes on and off heap, and how much memory each query >>>>> takes. >>>>> >>>>> Cheers, >>>>> Eugene >>>>> >>>>> On Fri, Aug 24, 2018 at 8:06 AM, NSAmelchev <nsamelc...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Eugene, >>>>>> >>>>>> Yes, it's a misprint as Dmitry wrote. >>>>>> >>>>>> Ignite print this warning if nodes on local machine require more than >>>>>> 80% of >>>>>> physical RAM. >>>>>> >>>>>> From code, you can see that total heap/offheap memory summing >>>>>> from nodes having the same mac address. This way calculates total >>>>>> memory >>>>>> used >>>>>> by the local machine. >>>>>> >>>>>> -- >>>>>> Best wishes, >>>>>> Amelchev Nikita >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/ >>>>>> >>>>> >>>>>