It’s the combinatorial explosion to get to those 38 tuples that’s the problem. What do the cardinalities of each of the “columns” (range indexes) look like? Is there a way you can reduce those?
Justin -- Justin Makeig Director, Product Management MarkLogic > On Sep 23, 2016, at 12:53 PM, Mark Shanks <markshanks...@hotmail.com> wrote: > > I've already said it wasn't due to a high number of value-tuples. There are > only 38 value-tuples returned in total. Hence, limiting the result to the > first 100 [1 to 100] as you suggested is the same as the original query, and > the execution time is the same. I ran the code with your modification to > confirm this. > From: general-boun...@developer.marklogic.com > <general-boun...@developer.marklogic.com> on behalf of Rob Szkutak > <rob.szku...@marklogic.com> > Sent: Saturday, 24 September 2016 2:45:32 AM > To: MarkLogic Developer Discussion > Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates > > Hi, > > My assumption as I've written previously would be #3. > > A very simple way to check would be cts:value-tuples()[100] . Adding the > [100] on the end would limit yourself to returning no more than the first 100 > tuples of your result set. It wouldn't reduce the number of documents that > are evaluated. (To prove that, you could also do [100 to 200]). If your > theory about #2 is correct, then adding [100] shouldn't improve performance. > > Best, > Rob > > Rob Szkutak > Senior Consultant > MarkLogic Corporation > rob.szku...@marklogic.com > www.marklogic.com > > From: general-boun...@developer.marklogic.com > [general-boun...@developer.marklogic.com] on behalf of Mark Shanks > [markshanks...@hotmail.com] > Sent: Friday, September 23, 2016 11:36 AM > To: MarkLogic Developer Discussion > Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates > > Yes, many values were fine with a 10,000 document set but slowed down > massively when run against several million. To be clear, there are at least 3 > counts we could be talking about. 1) The total number of documents in the > database. 2) The number of documents that the query is restricted to (such as > restricting to a certain date range). 3) The total number of value-tuples > returned. My experience is that the number 2 is driving the slowness (i.e., > the total number of value-tuples returned may be the same, but when marklogic > needs to determine this set over millions of documents rather than a small > number, performance degrades more than would be expected based on the number > alone, at least compared to the case of returning only 2 facets. > > I'm still unclear of what is going on under the hood in Marklogic. The > following link (https://docs.marklogic.com/guide/search-dev/lexicon) talks > about value co-occurrrence lexicons. If this is built, then 2 facets could > just refer to this and would result in the extremely fast performance > observed. On the other hand, 3 or more facets would not have a pre-prepared > lexicon to quiz. The documentation isn't clear whether a co-occurrence > lexicon is built whenever an index is built, or whether it needs to be > specifically configured. The documentation about creating lexicons points you > to the " 'Text Indexing' and 'Element/Attribute Range Indexes and Lexicons' > chapters of the Administrator's Guide", but these then don't mention > co-occurrence lexicons at all. So it isn't clear how you actually get a > co-occurrence lexicon built. > > Thanks. > Browsing With Lexicons (Search Developer's Guide ... > docs.marklogic.com > Browsing With Lexicons. MarkLogic Server allows you to create lexicons, which > are lists of unique words or values, either throughout an entire database > (words only ... > > From: general-boun...@developer.marklogic.com > <general-boun...@developer.marklogic.com> on behalf of Rob Szkutak > <rob.szku...@marklogic.com> > Sent: Friday, 23 September 2016 10:13:41 AM > To: MarkLogic Developer Discussion > Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates > > Hi, > > I thought in your earlier email you implied that many values were fine with a > 10,000 document set but slowed down when run against several million? This > lead me to believe the slowdown is caused by returning too many tuples. > > A simple test to confirm if its a problem with the size of the result set > would be to limit the size of the result set and see if your performance > improves. > > Best, > Rob > > Rob Szkutak > Senior Consultant > MarkLogic Corporation > rob.szku...@marklogic.com > www.marklogic.com > > From: general-boun...@developer.marklogic.com > [general-boun...@developer.marklogic.com] on behalf of Mark Shanks > [markshanks...@hotmail.com] > Sent: Thursday, September 22, 2016 7:02 PM > To: MarkLogic Developer Discussion > Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates > > Thanks. The point is that the execution time isn't increasing at an > exponential rate. Note also that each of the facets had about the same number > of entries, so it isn't as if the number of tuples increased from, e.g., 50 > to 4 million. I find it interesting that marklogic has a separate statement > cts:value-co-occurrences for looking at effectively 2 facets. Seems maybe > that 2 facets are cached in some way or some shortcut is provided for their > computation, whereas more than 2 needs to go a longer way that requires much > more processing than either 1 or 2. > From: general-boun...@developer.marklogic.com > <general-boun...@developer.marklogic.com> on behalf of Rob Szkutak > <rob.szku...@marklogic.com> > Sent: Friday, 23 September 2016 4:58:01 AM > To: MarkLogic Developer Discussion > Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates > > Hi, > > As you add more values to the value-tuples call, you will typically > exponentially increase the amount of results you receive. The total number of > results will be the total number of all possible unique combinations of all > values. More values means more unique combinations of all values. > > If your code you had : > > for $each in $tuples > return > fn:concat() > > If you have 4 million documents, you could be returning 4 million tuples at > most or easily returning some other number of tuples in the millions. > > If you wrote code in any platform that did something like "for each tuple in > a set of millions do something" then you will expect that processing to take > some time. > > So, what are your options? > > 1) You could order your tuples by the most (or least) common ones and then > paginate the results, returning a much smaller number for each page. > > 2) You could cache the information as data is ingested into a document and > then pull that document instead of doing all the work to figure it out on the > fly. > > 3) You could investigate upgrading your hardware and see if that helps the > processing complete more quickly. > > I would personally recommend #1 . If you're getting back a large number of > results, you'll absolutely find #1 to be the most navigable alternative. > > Best, > Rob > > Rob Szkutak > Senior Consultant > MarkLogic Corporation > rob.szku...@marklogic.com > www.marklogic.com > > From: general-boun...@developer.marklogic.com > [general-boun...@developer.marklogic.com] on behalf of Mark Shanks > [markshanks...@hotmail.com] > Sent: Thursday, September 22, 2016 1:32 PM > To: MarkLogic Developer Discussion > Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates > > As a follow-up, we found that the query was super fast with a small dataset > (e.g., 10,000 records). On the other hand, with a large dataset (40 million, > and pulling around 1 milllion records), we found that the query would be > super fast with 1 or 2 facets, e.g.: > > let $tuples := > cts:value-tuples( > ( > cts:element-reference(xs:QName("Site")) > ), > (), > cts:and-query(( > cts:element-range-query(xs:QName("Audit_Date"), ">", > xs:date("2010-01-01")), > cts:element-range-query(xs:QName("Audit_Date"), "<", > xs:date("2011-01-01")) > )) > ) > > or > > let $tuples := > cts:value-tuples( > ( > cts:element-reference(xs:QName("Site")), > cts:element-reference(xs:QName("Department")) > ), > (), > cts:and-query(( > cts:element-range-query(xs:QName("Audit_Date"), ">", > xs:date("2010-01-01")), > cts:element-range-query(xs:QName("Audit_Date"), "<", > xs:date("2011-01-01")) > )) > ) > > but would take a massive performance hit once the facets are increased to 3, > and 4 was much slower again. E.g.: > > let $tuples := > cts:value-tuples( > ( > cts:element-reference(xs:QName("Site")), > cts:element-reference(xs:QName("Department")), > cts:element-reference(xs:QName("LOB")) > ), > (), > cts:and-query(( > cts:element-range-query(xs:QName("Audit_Date"), ">", > xs:date("2010-01-01")), > cts:element-range-query(xs:QName("Audit_Date"), "<", > xs:date("2011-01-01")) > )) > ) > > By performance hit, I mean the first two queries would take 1 second each. > Pulling 3 facets would take 250 seconds, and pulling 4 facets would take 350 > seconds. Anyone have any idea of what is going on under the hood to lead to > such a breaking point between 1-2 facets and more facets? Any better way to > do the query in such circumstances to avoid the performance hit? > > Thanks. > From: general-boun...@developer.marklogic.com > <general-boun...@developer.marklogic.com> on behalf of Mark Shanks > <markshanks...@hotmail.com> > Sent: Wednesday, 21 September 2016 4:35:32 AM > To: MarkLogic Developer Discussion > Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates > > Hi Rob, > > Your suggestion worked very well! Super fast, at least with the relatively > small dataset I'm using at present. > > Thanks. > From: general-boun...@developer.marklogic.com > <general-boun...@developer.marklogic.com> on behalf of Rob Szkutak > <rob.szku...@marklogic.com> > Sent: Saturday, 17 September 2016 7:28:01 AM > To: MarkLogic Developer Discussion > Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates > > Hi, > > The fastest way to do that I can think of would be to index Data/Site, > Data/Department, Data/LOB, /Data/Audit_Date. > > Next, you could use cts:value-tuples() to build your result set directly out > of the in-memory indexes without needing to pull document fragments . > Finally, you would just need to return your concatenation. > > It would look something like this (Not tested) : > > let $tuples := > cts:value-tuples( > ( > cts:element-reference(xs:QName("Site")), > cts:element-reference(xs:QName("Department")), > cts:element-reference(xs:QName("LOB")) > ), > (), > cts:and-query(( > cts:element-range-query(xs:QName("Audit_Date"), ">", > xs:date("2010-01-01")), > cts:element-range-query(xs:QName("Audit_Date"), "<", > xs:date("2011-01-01")), > cts:or-query(( > cts:element-value-query(xs:QName("Classification"), "Finding"), > cts:element-value-query(xs:QName("Classification"), "Observation") > )) > )) > ) > > for $each in $tuples > return > fn:concat($each[1], |, $each[2], |, $each[3], cts:frequency($each)) > > Best, > Rob > > Rob Szkutak > Senior Consultant > MarkLogic Corporation > rob.szku...@marklogic.com > www.marklogic.com > > From: general-boun...@developer.marklogic.com > [general-boun...@developer.marklogic.com] on behalf of Mark Shanks > [markshanks...@hotmail.com] > Sent: Friday, September 16, 2016 3:55 PM > To: 'General MarkLogic Developer Discussion' > Subject: [MarkLogic Dev General] Speeding up xquery returning aggregates > > Hi, > > I'm trying to find the best way to return the results of what would be the > following equivalent sql statement: > > select count(*) from Data > where Audit_Date > "2010-01-01" and Audit_Date < "2011-01-01" and > (Classification = "Finding" or Classification = "Observation") > group by Site, Department, LOB > > I didn't test this sql statement, but it should give you the idea... Anyway, > I came up with the following xquery equivalent: > > for $s in distinct-values(/Data/Site) > return > for $d in distinct-values(/Data/Department) > return > for $lob in distinct-values(/Data/LOB) > return concat($s,'|',$d,'|',$lob,'|', > count( > for $x in (/Data[Site=$s and Department=$d and LOB=$lob and > (Classification='Finding' or Classification='Observation')]) > let $date as xs:dateTime := $x/Audit_Date > where $date gt xs:dateTime("2010-01-01T00:00:00") > and $date lt xs:dateTime("2011-01-01T00:00:00") > return ($x) > ) > ) > > It works fine and is not super-slow, but isn't particularly fast either. Is > this the most efficient way to get this type of information out of marklogic? > Assuming the fields are indexed, would some search command be faster? Or > maybe subset the data better? > > Thanks, > > Mark > _______________________________________________ > General mailing list > General@developer.marklogic.com > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general