dsmiley commented on code in PR #4053:
URL: https://github.com/apache/solr/pull/4053#discussion_r2723043185


##########
solr/solr-ref-guide/modules/query-guide/pages/exporting-result-sets.adoc:
##########
@@ -23,6 +23,26 @@ This feature uses a stream sorting technique that begins to 
send records within
 
 The cases where this functionality may be useful include: session analysis, 
distributed merge joins, time series roll-ups, aggregations on high cardinality 
fields, fully distributed field collapsing, and sort-based stats.
 
+== Comparison with Cursors
+
+The `/export` handler offers several advantages over 
xref:pagination-of-results.adoc#fetching-a-large-number-of-sorted-results-cursors[cursor-based
 pagination] for streaming large result sets.
+
+With cursors, the query is re-executed for each page of results.
+In contrast, `/export` runs the filter query once and the resulting 
segment-level bitmasks are applied once per segment, after which the documents 
are simply iterated over.
+Additionally, the segments that existed when the stream was opened are held 
open for the duration of the export, eliminating the disappearing or duplicate 
document issues that can occur with cursors.
+The trade-off is that IndexReaders are kept around for longer periods of time.
+
+Another advantage of `/export` is significantly lower latency until the first 
document is returned, because the internal batch size is decoupled from the 
response message size.
+With cursors, you typically need to set the `rows` parameter to a high value 
(e.g., 100,000) to achieve decent throughput.
+However, this creates a "glugging" effect: when you request a large batch, 
Solr must build the entire payload and send it over the wire while your client 
waits.

Review Comment:
   I affirm the glugging but your rationale/guessing is certainly false.  For 
SearchHandler, The payload is streamed/produced on the fly as it iterates 
documents.  The code isn't there; it's elsewhere in a ResponseWriter, if I 
recall.  Solr does have to do some up-front work --  producing a list of 
document IDs that match the search, and sorted as desired.  This is the 
"QTime".  Retrieving data to return it is after; it's not accumulated in 
memory; it's streamed, and lengthens the true elapsed time.
   
   Wouldn't "export" have similar up-front costs to execute the query?
   
   Any way, the broad strokes of your message look good.



##########
solr/solr-ref-guide/modules/query-guide/pages/exporting-result-sets.adoc:
##########
@@ -23,6 +23,26 @@ This feature uses a stream sorting technique that begins to 
send records within
 
 The cases where this functionality may be useful include: session analysis, 
distributed merge joins, time series roll-ups, aggregations on high cardinality 
fields, fully distributed field collapsing, and sort-based stats.
 
+== Comparison with Cursors
+
+The `/export` handler offers several advantages over 
xref:pagination-of-results.adoc#fetching-a-large-number-of-sorted-results-cursors[cursor-based
 pagination] for streaming large result sets.
+
+With cursors, the query is re-executed for each page of results.
+In contrast, `/export` runs the filter query once and the resulting 
segment-level bitmasks are applied once per segment, after which the documents 
are simply iterated over.
+Additionally, the segments that existed when the stream was opened are held 
open for the duration of the export, eliminating the disappearing or duplicate 
document issues that can occur with cursors.
+The trade-off is that IndexReaders are kept around for longer periods of time.
+
+Another advantage of `/export` is significantly lower latency until the first 
document is returned, because the internal batch size is decoupled from the 
response message size.
+With cursors, you typically need to set the `rows` parameter to a high value 
(e.g., 100,000) to achieve decent throughput.
+However, this creates a "glugging" effect: when you request a large batch, 
Solr must build the entire payload and send it over the wire while your client 
waits.
+Only after receiving and decoding this large payload can the client request 
the next batch, but in the interim Solr sits idle on this request.
+With the `/export` handler, these steps are decoupled - Solr can continue 
sorting and decoding/encoding documents while waiting for more demand from the 
client.
+
+The advantage of cursors is flexibility.
+A cursor mark can be persisted and resumed later, even across restarts, 
whereas an `/export` stream is entirely in-memory and must be consumed in a 
single session.

Review Comment:
   ```suggestion
   A `cursorMark` can be persisted and resumed later, even across restarts, or 
never continued if enough results were consumed to satisfy the use-case.  
   An `/export` stream must be consumed in a single session.
   ```
   I'm tempted to say that a stream should be completely consumed but maybe 
/export can handle a client that doesn't want more data, gracefully?  Do you 
know?



##########
solr/solr-ref-guide/modules/query-guide/pages/exporting-result-sets.adoc:
##########
@@ -23,6 +23,26 @@ This feature uses a stream sorting technique that begins to 
send records within
 
 The cases where this functionality may be useful include: session analysis, 
distributed merge joins, time series roll-ups, aggregations on high cardinality 
fields, fully distributed field collapsing, and sort-based stats.
 
+== Comparison with Cursors

Review Comment:
   BTW I very much appreciate the extra effort here!
   
    Unless you are in the mood, don't go off an do performance experiments just 
because we're asking questions.  Say what you're comfortable claiming and not 
more and that's fine :-)



##########
solr/solr-ref-guide/modules/query-guide/pages/exporting-result-sets.adoc:
##########


Review Comment:
   I think we should at least cross-link between pagination-of-results.adoc 
with exporting-result-sets.adoc because they are obviously related.  Their 
embeddings ought to be similar ;-)



##########
solr/solr-ref-guide/modules/query-guide/pages/exporting-result-sets.adoc:
##########
@@ -78,6 +98,10 @@ The `fl` property defines the fields that will be exported 
with the result set.
 Any of the field types that can be sorted (i.e., int, long, float, double, 
string, date, boolean) can be used in the field list.
 The fields can be single or multi-valued.
 
+By default, fields in the field list must have docValues enabled.
+However, when the `includeStoredFields` parameter is set to `true`, fields 
with only stored values (no docValues) can also be included.
+Note that sort fields still require docValues regardless of this setting.

Review Comment:
   ```suggestion
   Note that sort fields still require docValues, regardless of this setting.
   ```



##########
solr/solr-ref-guide/modules/query-guide/pages/exporting-result-sets.adoc:
##########
@@ -49,7 +69,7 @@ The default value is `30000` but users may want to specify 
smaller values to lim
 An optional parameter `includeStoredFields` (default `false`) enables 
exporting fields that only have stored values (no docValues).
 When set to `true`, fields without docValues but with stored values can be 
included in the field list (`fl`).
 Note that retrieving stored fields may significantly impact export performance 
compared to docValues fields, as stored fields require additional I/O 
operations.
-Fields that have both docValues and stored values will always use docValues 
for optimal performance, regardless of this parameter setting.
+If all requested fields are `docValues=true` then the data will be read only 
from docValues. This behavior applies to fields that are also `stored=true` and 
does not depend on the value of the `includeStoredFields` parameter.

Review Comment:
   ```suggestion
   If all requested fields are `docValues=true` then the data will only be read 
from docValues.
   This behavior applies to fields that are also `stored=true` and does not 
depend on the value of the `includeStoredFields` parameter.
   ```



##########
solr/solr-ref-guide/modules/query-guide/pages/exporting-result-sets.adoc:
##########
@@ -23,6 +23,26 @@ This feature uses a stream sorting technique that begins to 
send records within
 
 The cases where this functionality may be useful include: session analysis, 
distributed merge joins, time series roll-ups, aggregations on high cardinality 
fields, fully distributed field collapsing, and sort-based stats.
 
+== Comparison with Cursors
+
+The `/export` handler offers several advantages over 
xref:pagination-of-results.adoc#fetching-a-large-number-of-sorted-results-cursors[cursor-based
 pagination] for streaming large result sets.
+
+With cursors, the query is re-executed for each page of results.
+In contrast, `/export` runs the filter query once and the resulting 
segment-level bitmasks are applied once per segment, after which the documents 
are simply iterated over.
+Additionally, the segments that existed when the stream was opened are held 
open for the duration of the export, eliminating the disappearing or duplicate 
document issues that can occur with cursors.
+The trade-off is that IndexReaders are kept around for longer periods of time.
+
+Another advantage of `/export` is significantly lower latency until the first 
document is returned, because the internal batch size is decoupled from the 
response message size.
+With cursors, you typically need to set the `rows` parameter to a high value 
(e.g., 100,000) to achieve decent throughput.
+However, this creates a "glugging" effect: when you request a large batch, 
Solr must build the entire payload and send it over the wire while your client 
waits.
+Only after receiving and decoding this large payload can the client request 
the next batch, but in the interim Solr sits idle on this request.
+With the `/export` handler, these steps are decoupled - Solr can continue 
sorting and decoding/encoding documents while waiting for more demand from the 
client.
+
+The advantage of cursors is flexibility.

Review Comment:
   ```suggestion
   The advantage of cursors is _flexibility_.
   Cursors impose no constraints on the sort criteria accept that you must 
include a unique key, which isn't a real constraint.
   Cursors work as part of `SearchHandler` and thus can include most/all 
capabilities of it like highlighting.
   ```



##########
solr/solr-ref-guide/modules/query-guide/pages/exporting-result-sets.adoc:
##########
@@ -23,6 +23,24 @@ This feature uses a stream sorting technique that begins to 
send records within
 
 The cases where this functionality may be useful include: session analysis, 
distributed merge joins, time series roll-ups, aggregations on high cardinality 
fields, fully distributed field collapsing, and sort-based stats.
 
+== Comparison with Cursors
+
+The `/export` handler offers several advantages over 
xref:pagination-of-results.adoc#fetching-a-large-number-of-sorted-results-cursors[cursor-based
 pagination] for streaming large result sets.
+
+With cursors, the query is re-executed for each page of results.
+In contrast, `/export` runs the filter query once and the resulting 
segment-level bitmasks are applied once per segment, after which the documents 
are simply iterated over.
+Additionally, the segments that existed when the stream was opened are held 
open for the duration of the export, eliminating the disappearing or duplicate 
document issues that can occur with cursors.
+The trade-off is that IndexReaders are kept around for longer periods of time.

Review Comment:
   I feel we're potentially suggesting the contributor here put more work into 
this than he bargained for.  Any documentation he's comfortable writing is 
encouraged... and beyond that, well let's just get this merged and have real 
users kick the tires and we'll see.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to