Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

2016-09-23 Thread Justin Makeig
It’s the combinatorial explosion to get to those 38 tuples that’s the problem. 
What do the cardinalities of each of the “columns” (range indexes) look like? 
Is there a way you can reduce those?

Justin

--
Justin Makeig
Director, Product Management
MarkLogic


> On Sep 23, 2016, at 12:53 PM, Mark Shanks <markshanks...@hotmail.com> wrote:
> 
> I've already said it wasn't due to a high number of value-tuples. There are 
> only 38 value-tuples returned in total. Hence, limiting the result to the 
> first 100 [1 to 100] as you suggested is the same as the original query, and 
> the execution time is the same. I ran the code with your modification to 
> confirm this.
> From: general-boun...@developer.marklogic.com 
> <general-boun...@developer.marklogic.com> on behalf of Rob Szkutak 
> <rob.szku...@marklogic.com>
> Sent: Saturday, 24 September 2016 2:45:32 AM
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates
>  
> Hi,
> 
> My assumption as I've written previously would be #3.
> 
> A very simple way to check would be cts:value-tuples()[100] . Adding the 
> [100] on the end would limit yourself to returning no more than the first 100 
> tuples of your result set. It wouldn't reduce the number of documents that 
> are evaluated. (To prove that, you could also do [100 to 200]). If your 
> theory about #2 is correct, then adding [100] shouldn't improve performance.
> 
> Best,
> Rob
> 
> Rob Szkutak
> Senior Consultant
> MarkLogic Corporation
> rob.szku...@marklogic.com
> www.marklogic.com
> 
> From: general-boun...@developer.marklogic.com 
> [general-boun...@developer.marklogic.com] on behalf of Mark Shanks 
> [markshanks...@hotmail.com]
> Sent: Friday, September 23, 2016 11:36 AM
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates
> 
> Yes, many values were fine with a 10,000 document set but slowed down 
> massively when run against several million. To be clear, there are at least 3 
> counts we could be talking about. 1) The total number of documents in the 
> database. 2) The number of documents that the query is restricted to (such as 
> restricting to a certain date range). 3) The total number of value-tuples 
> returned. My experience is that the number 2 is driving the slowness (i.e., 
> the total number of value-tuples returned may be the same, but when marklogic 
> needs to determine this set over millions of documents rather than a small 
> number, performance degrades more than would be expected based on the number 
> alone, at least compared to the case of returning only 2 facets. 
> 
> I'm still unclear of what is going on under the hood in Marklogic. The 
> following link (https://docs.marklogic.com/guide/search-dev/lexicon) talks 
> about value co-occurrrence lexicons. If this is built, then 2 facets could 
> just refer to this and would result in the extremely fast performance 
> observed. On the other hand, 3 or more facets would not have a pre-prepared 
> lexicon to quiz. The documentation isn't clear whether a co-occurrence 
> lexicon is built whenever an index is built, or whether it needs to be 
> specifically configured. The documentation about creating lexicons points you 
> to the " 'Text Indexing' and 'Element/Attribute Range Indexes and Lexicons' 
> chapters of the Administrator's Guide", but these then don't mention 
> co-occurrence lexicons at all. So it isn't clear how you actually get a 
> co-occurrence lexicon built.
> 
> Thanks.
> Browsing With Lexicons (Search Developer's Guide ...
> docs.marklogic.com
> Browsing With Lexicons. MarkLogic Server allows you to create lexicons, which 
> are lists of unique words or values, either throughout an entire database 
> (words only ...
> 
> From: general-boun...@developer.marklogic.com 
> <general-boun...@developer.marklogic.com> on behalf of Rob Szkutak 
> <rob.szku...@marklogic.com>
> Sent: Friday, 23 September 2016 10:13:41 AM
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates
>  
> Hi,
> 
> I thought in your earlier email you implied that many values were fine with a 
> 10,000 document set but slowed down when run against several million? This 
> lead me to believe the slowdown is caused by returning too many tuples.
> 
> A simple test to confirm if its a problem with the size of the result set 
> would be to limit the size of the result set and see if your performance 
> improves.
> 
> Best,
> Rob
> 
> Rob Szkutak
> Senior Consultant
> MarkLogic Corporation
> rob.szku...@marklogic.com
> www.marklogic.com
> 
&

Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

2016-09-23 Thread Mark Shanks
I've already said it wasn't due to a high number of value-tuples. There are 
only 38 value-tuples returned in total. Hence, limiting the result to the first 
100 [1 to 100] as you suggested is the same as the original query, and the 
execution time is the same. I ran the code with your modification to confirm 
this.


From: general-boun...@developer.marklogic.com 
<general-boun...@developer.marklogic.com> on behalf of Rob Szkutak 
<rob.szku...@marklogic.com>
Sent: Saturday, 24 September 2016 2:45:32 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

Hi,

My assumption as I've written previously would be #3.

A very simple way to check would be cts:value-tuples()[100] . Adding the [100] 
on the end would limit yourself to returning no more than the first 100 tuples 
of your result set. It wouldn't reduce the number of documents that are 
evaluated. (To prove that, you could also do [100 to 200]). If your theory 
about #2 is correct, then adding [100] shouldn't improve performance.

Best,
Rob

Rob Szkutak
Senior Consultant
MarkLogic Corporation
rob.szku...@marklogic.com
www.marklogic.com<http://www.marklogic.com>


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Mark Shanks 
[markshanks...@hotmail.com]
Sent: Friday, September 23, 2016 11:36 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates


Yes, many values were fine with a 10,000 document set but slowed down massively 
when run against several million. To be clear, there are at least 3 counts we 
could be talking about. 1) The total number of documents in the database. 2) 
The number of documents that the query is restricted to (such as restricting to 
a certain date range). 3) The total number of value-tuples returned. My 
experience is that the number 2 is driving the slowness (i.e., the total number 
of value-tuples returned may be the same, but when marklogic needs to determine 
this set over millions of documents rather than a small number, performance 
degrades more than would be expected based on the number alone, at least 
compared to the case of returning only 2 facets.


I'm still unclear of what is going on under the hood in Marklogic. The 
following link (https://docs.marklogic.com/guide/search-dev/lexicon) talks 
about value co-occurrrence lexicons. If this is built, then 2 facets could just 
refer to this and would result in the extremely fast performance observed. On 
the other hand, 3 or more facets would not have a pre-prepared lexicon to quiz. 
The documentation isn't clear whether a co-occurrence lexicon is built whenever 
an index is built, or whether it needs to be specifically configured. The 
documentation about creating lexicons points you to the " 'Text Indexing' and 
'Element/Attribute Range Indexes and Lexicons' chapters of the Administrator's 
Guide", but these then don't mention co-occurrence lexicons at all. So it isn't 
clear how you actually get a co-occurrence lexicon built.


Thanks.

Browsing With Lexicons (Search Developer's Guide 
...<https://docs.marklogic.com/guide/search-dev/lexicon>
docs.marklogic.com
Browsing With Lexicons. MarkLogic Server allows you to create lexicons, which 
are lists of unique words or values, either throughout an entire database 
(words only ...



From: general-boun...@developer.marklogic.com 
<general-boun...@developer.marklogic.com> on behalf of Rob Szkutak 
<rob.szku...@marklogic.com>
Sent: Friday, 23 September 2016 10:13:41 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

Hi,

I thought in your earlier email you implied that many values were fine with a 
10,000 document set but slowed down when run against several million? This lead 
me to believe the slowdown is caused by returning too many tuples.

A simple test to confirm if its a problem with the size of the result set would 
be to limit the size of the result set and see if your performance improves.

Best,
Rob

Rob Szkutak
Senior Consultant
MarkLogic Corporation
rob.szku...@marklogic.com
www.marklogic.com<http://www.marklogic.com>


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Mark Shanks 
[markshanks...@hotmail.com]
Sent: Thursday, September 22, 2016 7:02 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates


Thanks. The point is that the execution time isn't increasing at an exponential 
rate. Note also that each of the facets had about the same number of entries, 
so it isn't as if the number of tuples increased from, e.g., 50 to 4 million. I 
find it interesting that marklogic has a sepa

Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

2016-09-23 Thread Mary Holstege
On Fri, 23 Sep 2016 09:36:16 -0700, Mark Shanks  
 wrote:
...
>
> I'm still unclear of what is going on under the hood in Marklogic. The  
> following link (https://docs.marklogic.com/guide/search-dev/lexicon)  
> talks about value co-occurrrence lexicons. If this is built, then 2  
> facets could just refer to this and would result in the extremely fast  
> performance observed. On the other hand, 3 or more facets would not have  
> a pre-prepared lexicon to quiz. The documentation isn't clear whether a  
> co-occurrence lexicon is built whenever an index is built, or whether it  
> needs to be specifically configured. The documentation about creating  
> lexicons points you to the " 'Text Indexing' and 'Element/Attribute  
> Range Indexes and Lexicons' chapters of the Administrator's Guide", but  
> these then don't mention co-occurrence lexicons at all. So it isn't  
> clear how you actually get a co-occurrence lexicon built.

There is no such thing as a co-occurrence lexicon, so it is never built:  
there are co-occurrence lexicon calls. Co-occurrences are computed over  
lexicons when you ask. The more lexicons involved in that call, the more  
work that it needs to do. The other big driver for performance in  
cts:value-tuples calls is how many instances of each value. To find  
co-occurrences of A, B, and C, for each value of A, for each document that  
contains that value, for each value of B, for each document that contains  
that value, get all values of C. It isn't quire exponential because there  
is a certain amount of internal caching that happens to avoid rework, but  
every additional lexicon added to the call makes it harder. We don't cache  
the complete set of co-occurrences anywhere right now.

//Mary
___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

2016-09-23 Thread Rob Szkutak
Hi,

My assumption as I've written previously would be #3.

A very simple way to check would be cts:value-tuples()[100] . Adding the [100] 
on the end would limit yourself to returning no more than the first 100 tuples 
of your result set. It wouldn't reduce the number of documents that are 
evaluated. (To prove that, you could also do [100 to 200]). If your theory 
about #2 is correct, then adding [100] shouldn't improve performance.

Best,
Rob

Rob Szkutak
Senior Consultant
MarkLogic Corporation
rob.szku...@marklogic.com
www.marklogic.com<http://www.marklogic.com>


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Mark Shanks 
[markshanks...@hotmail.com]
Sent: Friday, September 23, 2016 11:36 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates


Yes, many values were fine with a 10,000 document set but slowed down massively 
when run against several million. To be clear, there are at least 3 counts we 
could be talking about. 1) The total number of documents in the database. 2) 
The number of documents that the query is restricted to (such as restricting to 
a certain date range). 3) The total number of value-tuples returned. My 
experience is that the number 2 is driving the slowness (i.e., the total number 
of value-tuples returned may be the same, but when marklogic needs to determine 
this set over millions of documents rather than a small number, performance 
degrades more than would be expected based on the number alone, at least 
compared to the case of returning only 2 facets.


I'm still unclear of what is going on under the hood in Marklogic. The 
following link (https://docs.marklogic.com/guide/search-dev/lexicon) talks 
about value co-occurrrence lexicons. If this is built, then 2 facets could just 
refer to this and would result in the extremely fast performance observed. On 
the other hand, 3 or more facets would not have a pre-prepared lexicon to quiz. 
The documentation isn't clear whether a co-occurrence lexicon is built whenever 
an index is built, or whether it needs to be specifically configured. The 
documentation about creating lexicons points you to the " 'Text Indexing' and 
'Element/Attribute Range Indexes and Lexicons' chapters of the Administrator's 
Guide", but these then don't mention co-occurrence lexicons at all. So it isn't 
clear how you actually get a co-occurrence lexicon built.


Thanks.

Browsing With Lexicons (Search Developer's Guide 
...<https://docs.marklogic.com/guide/search-dev/lexicon>
docs.marklogic.com
Browsing With Lexicons. MarkLogic Server allows you to create lexicons, which 
are lists of unique words or values, either throughout an entire database 
(words only ...



From: general-boun...@developer.marklogic.com 
<general-boun...@developer.marklogic.com> on behalf of Rob Szkutak 
<rob.szku...@marklogic.com>
Sent: Friday, 23 September 2016 10:13:41 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

Hi,

I thought in your earlier email you implied that many values were fine with a 
10,000 document set but slowed down when run against several million? This lead 
me to believe the slowdown is caused by returning too many tuples.

A simple test to confirm if its a problem with the size of the result set would 
be to limit the size of the result set and see if your performance improves.

Best,
Rob

Rob Szkutak
Senior Consultant
MarkLogic Corporation
rob.szku...@marklogic.com
www.marklogic.com<http://www.marklogic.com>


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Mark Shanks 
[markshanks...@hotmail.com]
Sent: Thursday, September 22, 2016 7:02 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates


Thanks. The point is that the execution time isn't increasing at an exponential 
rate. Note also that each of the facets had about the same number of entries, 
so it isn't as if the number of tuples increased from, e.g., 50 to 4 million. I 
find it interesting that marklogic has a separate statement 
cts:value-co-occurrences for looking at effectively 2 facets. Seems maybe that 
2 facets are cached in some way or some shortcut is provided for their 
computation, whereas more than 2 needs to go a longer way that requires much 
more processing than either 1 or 2.


From: general-boun...@developer.marklogic.com 
<general-boun...@developer.marklogic.com> on behalf of Rob Szkutak 
<rob.szku...@marklogic.com>
Sent: Friday, 23 September 2016 4:58:01 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

Hi,

As you add more values to the 

Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

2016-09-23 Thread Mark Shanks
Yes, many values were fine with a 10,000 document set but slowed down massively 
when run against several million. To be clear, there are at least 3 counts we 
could be talking about. 1) The total number of documents in the database. 2) 
The number of documents that the query is restricted to (such as restricting to 
a certain date range). 3) The total number of value-tuples returned. My 
experience is that the number 2 is driving the slowness (i.e., the total number 
of value-tuples returned may be the same, but when marklogic needs to determine 
this set over millions of documents rather than a small number, performance 
degrades more than would be expected based on the number alone, at least 
compared to the case of returning only 2 facets.


I'm still unclear of what is going on under the hood in Marklogic. The 
following link (https://docs.marklogic.com/guide/search-dev/lexicon) talks 
about value co-occurrrence lexicons. If this is built, then 2 facets could just 
refer to this and would result in the extremely fast performance observed. On 
the other hand, 3 or more facets would not have a pre-prepared lexicon to quiz. 
The documentation isn't clear whether a co-occurrence lexicon is built whenever 
an index is built, or whether it needs to be specifically configured. The 
documentation about creating lexicons points you to the " 'Text Indexing' and 
'Element/Attribute Range Indexes and Lexicons' chapters of the Administrator's 
Guide", but these then don't mention co-occurrence lexicons at all. So it isn't 
clear how you actually get a co-occurrence lexicon built.


Thanks.

Browsing With Lexicons (Search Developer's Guide 
...<https://docs.marklogic.com/guide/search-dev/lexicon>
docs.marklogic.com
Browsing With Lexicons. MarkLogic Server allows you to create lexicons, which 
are lists of unique words or values, either throughout an entire database 
(words only ...



From: general-boun...@developer.marklogic.com 
<general-boun...@developer.marklogic.com> on behalf of Rob Szkutak 
<rob.szku...@marklogic.com>
Sent: Friday, 23 September 2016 10:13:41 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

Hi,

I thought in your earlier email you implied that many values were fine with a 
10,000 document set but slowed down when run against several million? This lead 
me to believe the slowdown is caused by returning too many tuples.

A simple test to confirm if its a problem with the size of the result set would 
be to limit the size of the result set and see if your performance improves.

Best,
Rob

Rob Szkutak
Senior Consultant
MarkLogic Corporation
rob.szku...@marklogic.com
www.marklogic.com<http://www.marklogic.com>


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Mark Shanks 
[markshanks...@hotmail.com]
Sent: Thursday, September 22, 2016 7:02 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates


Thanks. The point is that the execution time isn't increasing at an exponential 
rate. Note also that each of the facets had about the same number of entries, 
so it isn't as if the number of tuples increased from, e.g., 50 to 4 million. I 
find it interesting that marklogic has a separate statement 
cts:value-co-occurrences for looking at effectively 2 facets. Seems maybe that 
2 facets are cached in some way or some shortcut is provided for their 
computation, whereas more than 2 needs to go a longer way that requires much 
more processing than either 1 or 2.


From: general-boun...@developer.marklogic.com 
<general-boun...@developer.marklogic.com> on behalf of Rob Szkutak 
<rob.szku...@marklogic.com>
Sent: Friday, 23 September 2016 4:58:01 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

Hi,

As you add more values to the value-tuples call, you will typically 
exponentially increase the amount of results you receive. The total number of 
results will be the total number of all possible unique combinations of all 
values. More values means more unique combinations of all values.

If your code you had :

for $each in $tuples
return
fn:concat()

If you have 4 million documents, you could be returning 4 million tuples at 
most or easily returning some other number of tuples in the millions.

If you wrote code in any platform that did something like "for each tuple in a 
set of millions do something" then you will expect that processing to take some 
time.

So, what are your options?

1) You could order your tuples by the most (or least) common ones and then 
paginate the results, returning a much smaller number for each page.

2) You could cache the information as data is ingested into a document and then 
pu

Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

2016-09-22 Thread Rob Szkutak
Hi,

I thought in your earlier email you implied that many values were fine with a 
10,000 document set but slowed down when run against several million? This lead 
me to believe the slowdown is caused by returning too many tuples.

A simple test to confirm if its a problem with the size of the result set would 
be to limit the size of the result set and see if your performance improves.

Best,
Rob

Rob Szkutak
Senior Consultant
MarkLogic Corporation
rob.szku...@marklogic.com
www.marklogic.com<http://www.marklogic.com>


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Mark Shanks 
[markshanks...@hotmail.com]
Sent: Thursday, September 22, 2016 7:02 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates


Thanks. The point is that the execution time isn't increasing at an exponential 
rate. Note also that each of the facets had about the same number of entries, 
so it isn't as if the number of tuples increased from, e.g., 50 to 4 million. I 
find it interesting that marklogic has a separate statement 
cts:value-co-occurrences for looking at effectively 2 facets. Seems maybe that 
2 facets are cached in some way or some shortcut is provided for their 
computation, whereas more than 2 needs to go a longer way that requires much 
more processing than either 1 or 2.


From: general-boun...@developer.marklogic.com 
<general-boun...@developer.marklogic.com> on behalf of Rob Szkutak 
<rob.szku...@marklogic.com>
Sent: Friday, 23 September 2016 4:58:01 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

Hi,

As you add more values to the value-tuples call, you will typically 
exponentially increase the amount of results you receive. The total number of 
results will be the total number of all possible unique combinations of all 
values. More values means more unique combinations of all values.

If your code you had :

for $each in $tuples
return
fn:concat()

If you have 4 million documents, you could be returning 4 million tuples at 
most or easily returning some other number of tuples in the millions.

If you wrote code in any platform that did something like "for each tuple in a 
set of millions do something" then you will expect that processing to take some 
time.

So, what are your options?

1) You could order your tuples by the most (or least) common ones and then 
paginate the results, returning a much smaller number for each page.

2) You could cache the information as data is ingested into a document and then 
pull that document instead of doing all the work to figure it out on the fly.

3) You could investigate upgrading your hardware and see if that helps the 
processing complete more quickly.

I would personally recommend #1 . If you're getting back a large number of 
results, you'll absolutely find #1 to be the most navigable alternative.

Best,
Rob

Rob Szkutak
Senior Consultant
MarkLogic Corporation
rob.szku...@marklogic.com
www.marklogic.com<http://www.marklogic.com>


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Mark Shanks 
[markshanks...@hotmail.com]
Sent: Thursday, September 22, 2016 1:32 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates


As a follow-up, we found that the query was super fast with a small dataset 
(e.g., 10,000 records). On the other hand, with a large dataset (40 million, 
and pulling around 1 milllion records), we found that the query would be super 
fast with 1 or 2 facets, e.g.:


let $tuples :=
  cts:value-tuples(
(
cts:element-reference(xs:QName("Site"))
),
   (),
   cts:and-query((
 cts:element-range-query(xs:QName("Audit_Date"), ">", 
xs:date("2010-01-01")),
 cts:element-range-query(xs:QName("Audit_Date"), "<", xs:date("2011-01-01"))
   ))
  )

or


let $tuples :=
  cts:value-tuples(
(
cts:element-reference(xs:QName("Site")),
   cts:element-reference(xs:QName("Department"))
),
   (),
   cts:and-query((
 cts:element-range-query(xs:QName("Audit_Date"), ">", 
xs:date("2010-01-01")),
 cts:element-range-query(xs:QName("Audit_Date"), "<", xs:date("2011-01-01"))
   ))
  )

but would take a massive performance hit once the facets are increased to 3, 
and 4 was much slower again. E.g.:


let $tuples :=
  cts:value-tuples(
(
cts:element-reference(xs:QName("Site")),
   cts:element-reference(xs:QName("Department")),
cts:element-reference(xs:QName("LOB"))
),
   (),
   cts:and-query((
 cts:element-range

Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

2016-09-22 Thread Mark Shanks
Thanks. The point is that the execution time isn't increasing at an exponential 
rate. Note also that each of the facets had about the same number of entries, 
so it isn't as if the number of tuples increased from, e.g., 50 to 4 million. I 
find it interesting that marklogic has a separate statement 
cts:value-co-occurrences for looking at effectively 2 facets. Seems maybe that 
2 facets are cached in some way or some shortcut is provided for their 
computation, whereas more than 2 needs to go a longer way that requires much 
more processing than either 1 or 2.


From: general-boun...@developer.marklogic.com 
<general-boun...@developer.marklogic.com> on behalf of Rob Szkutak 
<rob.szku...@marklogic.com>
Sent: Friday, 23 September 2016 4:58:01 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

Hi,

As you add more values to the value-tuples call, you will typically 
exponentially increase the amount of results you receive. The total number of 
results will be the total number of all possible unique combinations of all 
values. More values means more unique combinations of all values.

If your code you had :

for $each in $tuples
return
fn:concat()

If you have 4 million documents, you could be returning 4 million tuples at 
most or easily returning some other number of tuples in the millions.

If you wrote code in any platform that did something like "for each tuple in a 
set of millions do something" then you will expect that processing to take some 
time.

So, what are your options?

1) You could order your tuples by the most (or least) common ones and then 
paginate the results, returning a much smaller number for each page.

2) You could cache the information as data is ingested into a document and then 
pull that document instead of doing all the work to figure it out on the fly.

3) You could investigate upgrading your hardware and see if that helps the 
processing complete more quickly.

I would personally recommend #1 . If you're getting back a large number of 
results, you'll absolutely find #1 to be the most navigable alternative.

Best,
Rob

Rob Szkutak
Senior Consultant
MarkLogic Corporation
rob.szku...@marklogic.com
www.marklogic.com<http://www.marklogic.com>


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Mark Shanks 
[markshanks...@hotmail.com]
Sent: Thursday, September 22, 2016 1:32 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates


As a follow-up, we found that the query was super fast with a small dataset 
(e.g., 10,000 records). On the other hand, with a large dataset (40 million, 
and pulling around 1 milllion records), we found that the query would be super 
fast with 1 or 2 facets, e.g.:


let $tuples :=
  cts:value-tuples(
(
cts:element-reference(xs:QName("Site"))
),
   (),
   cts:and-query((
 cts:element-range-query(xs:QName("Audit_Date"), ">", 
xs:date("2010-01-01")),
 cts:element-range-query(xs:QName("Audit_Date"), "<", xs:date("2011-01-01"))
   ))
  )

or


let $tuples :=
  cts:value-tuples(
(
cts:element-reference(xs:QName("Site")),
   cts:element-reference(xs:QName("Department"))
),
   (),
   cts:and-query((
 cts:element-range-query(xs:QName("Audit_Date"), ">", 
xs:date("2010-01-01")),
 cts:element-range-query(xs:QName("Audit_Date"), "<", xs:date("2011-01-01"))
   ))
  )

but would take a massive performance hit once the facets are increased to 3, 
and 4 was much slower again. E.g.:


let $tuples :=
  cts:value-tuples(
(
cts:element-reference(xs:QName("Site")),
   cts:element-reference(xs:QName("Department")),
cts:element-reference(xs:QName("LOB"))
),
   (),
   cts:and-query((
 cts:element-range-query(xs:QName("Audit_Date"), ">", 
xs:date("2010-01-01")),
 cts:element-range-query(xs:QName("Audit_Date"), "<", xs:date("2011-01-01"))
   ))
  )

By performance hit, I mean the first two queries would take 1 second each. 
Pulling 3 facets would take 250 seconds, and pulling 4 facets would take 350 
seconds. Anyone have any idea of what is going on under the hood to lead to 
such a breaking point between 1-2 facets and more facets? Any better way to do 
the query in such circumstances to avoid the performance hit?


Thanks.

____________
From: general-boun...@developer.marklogic.com 
<general-boun...@developer.marklogic.com> on behalf of Mark Shanks 
<markshanks...@hotmail.com>
Sent: Wednesday, 21 September 2016 4:35:32 AM
To: MarkLogic Developer Dis

Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

2016-09-22 Thread Rob Szkutak
Hi,

As you add more values to the value-tuples call, you will typically 
exponentially increase the amount of results you receive. The total number of 
results will be the total number of all possible unique combinations of all 
values. More values means more unique combinations of all values.

If your code you had :

for $each in $tuples
return
fn:concat()

If you have 4 million documents, you could be returning 4 million tuples at 
most or easily returning some other number of tuples in the millions.

If you wrote code in any platform that did something like "for each tuple in a 
set of millions do something" then you will expect that processing to take some 
time.

So, what are your options?

1) You could order your tuples by the most (or least) common ones and then 
paginate the results, returning a much smaller number for each page.

2) You could cache the information as data is ingested into a document and then 
pull that document instead of doing all the work to figure it out on the fly.

3) You could investigate upgrading your hardware and see if that helps the 
processing complete more quickly.

I would personally recommend #1 . If you're getting back a large number of 
results, you'll absolutely find #1 to be the most navigable alternative.

Best,
Rob

Rob Szkutak
Senior Consultant
MarkLogic Corporation
rob.szku...@marklogic.com
www.marklogic.com<http://www.marklogic.com>


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Mark Shanks 
[markshanks...@hotmail.com]
Sent: Thursday, September 22, 2016 1:32 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates


As a follow-up, we found that the query was super fast with a small dataset 
(e.g., 10,000 records). On the other hand, with a large dataset (40 million, 
and pulling around 1 milllion records), we found that the query would be super 
fast with 1 or 2 facets, e.g.:


let $tuples :=
  cts:value-tuples(
(
cts:element-reference(xs:QName("Site"))
),
   (),
   cts:and-query((
 cts:element-range-query(xs:QName("Audit_Date"), ">", 
xs:date("2010-01-01")),
 cts:element-range-query(xs:QName("Audit_Date"), "<", xs:date("2011-01-01"))
   ))
  )

or


let $tuples :=
  cts:value-tuples(
(
cts:element-reference(xs:QName("Site")),
   cts:element-reference(xs:QName("Department"))
),
   (),
   cts:and-query((
 cts:element-range-query(xs:QName("Audit_Date"), ">", 
xs:date("2010-01-01")),
 cts:element-range-query(xs:QName("Audit_Date"), "<", xs:date("2011-01-01"))
   ))
  )

but would take a massive performance hit once the facets are increased to 3, 
and 4 was much slower again. E.g.:


let $tuples :=
  cts:value-tuples(
(
cts:element-reference(xs:QName("Site")),
   cts:element-reference(xs:QName("Department")),
cts:element-reference(xs:QName("LOB"))
),
   (),
   cts:and-query((
 cts:element-range-query(xs:QName("Audit_Date"), ">", 
xs:date("2010-01-01")),
 cts:element-range-query(xs:QName("Audit_Date"), "<", xs:date("2011-01-01"))
   ))
  )

By performance hit, I mean the first two queries would take 1 second each. 
Pulling 3 facets would take 250 seconds, and pulling 4 facets would take 350 
seconds. Anyone have any idea of what is going on under the hood to lead to 
such a breaking point between 1-2 facets and more facets? Any better way to do 
the query in such circumstances to avoid the performance hit?


Thanks.

________________
From: general-boun...@developer.marklogic.com 
<general-boun...@developer.marklogic.com> on behalf of Mark Shanks 
<markshanks...@hotmail.com>
Sent: Wednesday, 21 September 2016 4:35:32 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates


Hi Rob,


Your suggestion worked very well! Super fast, at least with the relatively 
small dataset I'm using at present.


Thanks.

____________
From: general-boun...@developer.marklogic.com 
<general-boun...@developer.marklogic.com> on behalf of Rob Szkutak 
<rob.szku...@marklogic.com>
Sent: Saturday, 17 September 2016 7:28:01 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

Hi,

The fastest way to do that I can think of would be to index Data/Site, 
Data/Department, Data/LOB, /Data/Audit_Date.

Next, you could use cts:value-tuples() to build your result set directly out of 
the in-memory indexes without needing to pull document fragments . Finally, you 
would just need to return your concatenation.

It would look som

Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

2016-09-22 Thread Mark Shanks
As a follow-up, we found that the query was super fast with a small dataset 
(e.g., 10,000 records). On the other hand, with a large dataset (40 million, 
and pulling around 1 milllion records), we found that the query would be super 
fast with 1 or 2 facets, e.g.:


let $tuples :=
  cts:value-tuples(
(
cts:element-reference(xs:QName("Site"))
),
   (),
   cts:and-query((
 cts:element-range-query(xs:QName("Audit_Date"), ">", 
xs:date("2010-01-01")),
 cts:element-range-query(xs:QName("Audit_Date"), "<", xs:date("2011-01-01"))
   ))
  )

or


let $tuples :=
  cts:value-tuples(
(
cts:element-reference(xs:QName("Site")),
   cts:element-reference(xs:QName("Department"))
),
   (),
   cts:and-query((
 cts:element-range-query(xs:QName("Audit_Date"), ">", 
xs:date("2010-01-01")),
 cts:element-range-query(xs:QName("Audit_Date"), "<", xs:date("2011-01-01"))
   ))
  )

but would take a massive performance hit once the facets are increased to 3, 
and 4 was much slower again. E.g.:


let $tuples :=
  cts:value-tuples(
(
cts:element-reference(xs:QName("Site")),
   cts:element-reference(xs:QName("Department")),
cts:element-reference(xs:QName("LOB"))
),
   (),
   cts:and-query((
 cts:element-range-query(xs:QName("Audit_Date"), ">", 
xs:date("2010-01-01")),
 cts:element-range-query(xs:QName("Audit_Date"), "<", xs:date("2011-01-01"))
   ))
  )

By performance hit, I mean the first two queries would take 1 second each. 
Pulling 3 facets would take 250 seconds, and pulling 4 facets would take 350 
seconds. Anyone have any idea of what is going on under the hood to lead to 
such a breaking point between 1-2 facets and more facets? Any better way to do 
the query in such circumstances to avoid the performance hit?


Thanks.

________
From: general-boun...@developer.marklogic.com 
<general-boun...@developer.marklogic.com> on behalf of Mark Shanks 
<markshanks...@hotmail.com>
Sent: Wednesday, 21 September 2016 4:35:32 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates


Hi Rob,


Your suggestion worked very well! Super fast, at least with the relatively 
small dataset I'm using at present.


Thanks.

____________
From: general-boun...@developer.marklogic.com 
<general-boun...@developer.marklogic.com> on behalf of Rob Szkutak 
<rob.szku...@marklogic.com>
Sent: Saturday, 17 September 2016 7:28:01 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

Hi,

The fastest way to do that I can think of would be to index Data/Site, 
Data/Department, Data/LOB, /Data/Audit_Date.

Next, you could use cts:value-tuples() to build your result set directly out of 
the in-memory indexes without needing to pull document fragments . Finally, you 
would just need to return your concatenation.

It would look something like this (Not tested) :

let $tuples :=
  cts:value-tuples(
(
cts:element-reference(xs:QName("Site")),
   cts:element-reference(xs:QName("Department")),
cts:element-reference(xs:QName("LOB"))
),
   (),
   cts:and-query((
 cts:element-range-query(xs:QName("Audit_Date"), ">", 
xs:date("2010-01-01")),
 cts:element-range-query(xs:QName("Audit_Date"), "<", 
xs:date("2011-01-01")),
 cts:or-query((
 cts:element-value-query(xs:QName("Classification"), "Finding"),
 cts:element-value-query(xs:QName("Classification"), "Observation")
 ))
   ))
  )

for $each in $tuples
return
  fn:concat($each[1], |, $each[2], |, $each[3], cts:frequency($each))

Best,
Rob

Rob Szkutak
Senior Consultant
MarkLogic Corporation
rob.szku...@marklogic.com
www.marklogic.com<http://www.marklogic.com>


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Mark Shanks 
[markshanks...@hotmail.com]
Sent: Friday, September 16, 2016 3:55 PM
To: 'General MarkLogic Developer Discussion'
Subject: [MarkLogic Dev General] Speeding up xquery returning aggregates


Hi,


I'm trying to find the best way to return the results of what would be the 
following equivalent sql statement:


select count(*) from Data

where Audit_Date > "2010-01-01" and Audit_Date < "2011-01-01" and 
(Classification = "Finding" or Classification = "Observation")

group by Site, Department, LOB


I didn't test this sql statement, but it should give you the idea... Anyway, I 
came 

Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

2016-09-20 Thread Mark Shanks
Hi Rob,


Your suggestion worked very well! Super fast, at least with the relatively 
small dataset I'm using at present.


Thanks.


From: general-boun...@developer.marklogic.com 
<general-boun...@developer.marklogic.com> on behalf of Rob Szkutak 
<rob.szku...@marklogic.com>
Sent: Saturday, 17 September 2016 7:28:01 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

Hi,

The fastest way to do that I can think of would be to index Data/Site, 
Data/Department, Data/LOB, /Data/Audit_Date.

Next, you could use cts:value-tuples() to build your result set directly out of 
the in-memory indexes without needing to pull document fragments . Finally, you 
would just need to return your concatenation.

It would look something like this (Not tested) :

let $tuples :=
  cts:value-tuples(
(
cts:element-reference(xs:QName("Site")),
   cts:element-reference(xs:QName("Department")),
cts:element-reference(xs:QName("LOB"))
),
   (),
   cts:and-query((
 cts:element-range-query(xs:QName("Audit_Date"), ">", 
xs:date("2010-01-01")),
 cts:element-range-query(xs:QName("Audit_Date"), "<", 
xs:date("2011-01-01")),
 cts:or-query((
 cts:element-value-query(xs:QName("Classification"), "Finding"),
 cts:element-value-query(xs:QName("Classification"), "Observation")
 ))
   ))
  )

for $each in $tuples
return
  fn:concat($each[1], |, $each[2], |, $each[3], cts:frequency($each))

Best,
Rob

Rob Szkutak
Senior Consultant
MarkLogic Corporation
rob.szku...@marklogic.com
www.marklogic.com<http://www.marklogic.com>


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Mark Shanks 
[markshanks...@hotmail.com]
Sent: Friday, September 16, 2016 3:55 PM
To: 'General MarkLogic Developer Discussion'
Subject: [MarkLogic Dev General] Speeding up xquery returning aggregates


Hi,


I'm trying to find the best way to return the results of what would be the 
following equivalent sql statement:


select count(*) from Data

where Audit_Date > "2010-01-01" and Audit_Date < "2011-01-01" and 
(Classification = "Finding" or Classification = "Observation")

group by Site, Department, LOB


I didn't test this sql statement, but it should give you the idea... Anyway, I 
came up with the following xquery equivalent:


for $s in distinct-values(/Data/Site)
return
for $d in distinct-values(/Data/Department)
return
for $lob in distinct-values(/Data/LOB)
return concat($s,'|',$d,'|',$lob,'|',
count(
for $x in (/Data[Site=$s and Department=$d and LOB=$lob and 
(Classification='Finding' or Classification='Observation')])
let $date as xs:dateTime := $x/Audit_Date
where $date gt xs:dateTime("2010-01-01T00:00:00")
and $date lt xs:dateTime("2011-01-01T00:00:00")
return ($x)
)
)


It works fine and is not super-slow, but isn't particularly fast either. Is 
this the most efficient way to get this type of information out of marklogic? 
Assuming the fields are indexed, would some search command be faster? Or maybe 
subset the data better?


Thanks,


Mark
___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

2016-09-16 Thread Rob Szkutak
Hi,

The fastest way to do that I can think of would be to index Data/Site, 
Data/Department, Data/LOB, /Data/Audit_Date.

Next, you could use cts:value-tuples() to build your result set directly out of 
the in-memory indexes without needing to pull document fragments . Finally, you 
would just need to return your concatenation.

It would look something like this (Not tested) :

let $tuples :=
  cts:value-tuples(
(
cts:element-reference(xs:QName("Site")),
   cts:element-reference(xs:QName("Department")),
cts:element-reference(xs:QName("LOB"))
),
   (),
   cts:and-query((
 cts:element-range-query(xs:QName("Audit_Date"), ">", 
xs:date("2010-01-01")),
 cts:element-range-query(xs:QName("Audit_Date"), "<", 
xs:date("2011-01-01")),
 cts:or-query((
 cts:element-value-query(xs:QName("Classification"), "Finding"),
 cts:element-value-query(xs:QName("Classification"), "Observation")
 ))
   ))
  )

for $each in $tuples
return
  fn:concat($each[1], |, $each[2], |, $each[3], cts:frequency($each))

Best,
Rob

Rob Szkutak
Senior Consultant
MarkLogic Corporation
rob.szku...@marklogic.com
www.marklogic.com<http://www.marklogic.com>


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Mark Shanks 
[markshanks...@hotmail.com]
Sent: Friday, September 16, 2016 3:55 PM
To: 'General MarkLogic Developer Discussion'
Subject: [MarkLogic Dev General] Speeding up xquery returning aggregates


Hi,


I'm trying to find the best way to return the results of what would be the 
following equivalent sql statement:


select count(*) from Data

where Audit_Date > "2010-01-01" and Audit_Date < "2011-01-01" and 
(Classification = "Finding" or Classification = "Observation")

group by Site, Department, LOB


I didn't test this sql statement, but it should give you the idea... Anyway, I 
came up with the following xquery equivalent:


for $s in distinct-values(/Data/Site)
return
for $d in distinct-values(/Data/Department)
return
for $lob in distinct-values(/Data/LOB)
return concat($s,'|',$d,'|',$lob,'|',
count(
for $x in (/Data[Site=$s and Department=$d and LOB=$lob and 
(Classification='Finding' or Classification='Observation')])
let $date as xs:dateTime := $x/Audit_Date
where $date gt xs:dateTime("2010-01-01T00:00:00")
and $date lt xs:dateTime("2011-01-01T00:00:00")
return ($x)
)
)


It works fine and is not super-slow, but isn't particularly fast either. Is 
this the most efficient way to get this type of information out of marklogic? 
Assuming the fields are indexed, would some search command be faster? Or maybe 
subset the data better?


Thanks,


Mark
___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


[MarkLogic Dev General] Speeding up xquery returning aggregates

2016-09-16 Thread Mark Shanks
Hi,


I'm trying to find the best way to return the results of what would be the 
following equivalent sql statement:


select count(*) from Data

where Audit_Date > "2010-01-01" and Audit_Date < "2011-01-01" and 
(Classification = "Finding" or Classification = "Observation")

group by Site, Department, LOB


I didn't test this sql statement, but it should give you the idea... Anyway, I 
came up with the following xquery equivalent:


for $s in distinct-values(/Data/Site)
return
for $d in distinct-values(/Data/Department)
return
for $lob in distinct-values(/Data/LOB)
return concat($s,'|',$d,'|',$lob,'|',
count(
for $x in (/Data[Site=$s and Department=$d and LOB=$lob and 
(Classification='Finding' or Classification='Observation')])
let $date as xs:dateTime := $x/Audit_Date
where $date gt xs:dateTime("2010-01-01T00:00:00")
and $date lt xs:dateTime("2011-01-01T00:00:00")
return ($x)
)
)


It works fine and is not super-slow, but isn't particularly fast either. Is 
this the most efficient way to get this type of information out of marklogic? 
Assuming the fields are indexed, would some search command be faster? Or maybe 
subset the data better?


Thanks,


Mark
___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general