[jira] [Comment Edited] (NIFI-3248) GetSolr can miss recently updated documents

Johannes Peter (JIRA) Mon, 04 Sep 2017 03:19:22 -0700

    [ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152432#comment-16152432
 ]


Johannes Peter edited comment on NIFI-3248 at 9/4/17 10:17 AM:
---------------------------------------------------------------

(1) Sorting by ID ensures that each document ist retrieved only once, even if 
the document is updated. Sorting by \_version\_ asc ensures that each version 
of a document is retrieved once, as updated documents are "appended" at the 
end. I personally expect that someone, who uses Solr as a source, wants to see 
updated Solr documents in the target system to replace old ones. However, we 
could make this configurable. 
(2) The parameter fq provides the same query capabilities like q and can be 
used in the same way. The essential difference is that q basically is used to 
calculate relevancy, whereas fq is basically used to filter and to improve 
performance. In this case, we don't need relevancy as we sort by indexing time. 
Nevertheless, I see the point that users expect a property where they can 
configure the main query. 
(3) \_version\_ behaves like a timestamp, so there should be a little chance 
that two documents within a collection have the same value (in a cluster). I 
know that there is a way to convert it into a timestamp, but I first have to 
figure out how to do this exactly. Sorting by "\_version\_ asc" and using 
cursor marks should make the retrieval reliable to a very high degree. 
(4) I want to emphasize again that the logic and the purpose of GetSolr doesn't 
cover the capabilities of Solr sufficiently. There should be an additional 
processor to use Solr not only as a source, but also as a query layer. Features 
like faceting, grouping, pivot (e. g. for analytical purposes), spellchecking 
(e. g. for OCR or NLP), etc. etc. etc. are not covered by GetSolr (and 
shouldn't be included as the main focus should rely on the reliable retrieval). 
However, there should be a more flexible option to query Solr within workflows.


was (Author: jope):
(1) Sorting by ID ensures that each document ist retrieved once, even if the 
document is updated. Sorting by \_version\_ asc ensures that each version of a 
document is retrieved once, as updated documents are "appended" at the end. I 
personally expect that someone, who uses Solr as a source, wants to see updated 
Solr documents in the target system to replace old ones. However, we could make 
this configurable. 
(2) The parameter fq provides the same query capabilities like q and can be 
used in the same way. The essential difference is that q basically is used to 
calculate relevancy, whereas fq is basically used to filter and to improve 
performance. In this case, we don't need relevancy as we sort by indexing time. 
Nevertheless, I see the point that users expect a property where they can 
configure the main query. 
(3) \_version\_ behaves like a timestamp, so there should be a little chance 
that two documents within a collection have the same value (in a cluster). I 
know that there is a way to convert it into a timestamp, but I first have to 
figure out how to do this exactly. Sorting by "\_version\_ asc" and using 
cursor marks should make the retrieval reliable to a very high degree. 
(4) I want to emphasize again that the logic and the purpose of GetSolr doesn't 
cover the capabilities of Solr sufficiently. There should be an additional 
processor to use Solr not only as a source, but also as a query layer. Features 
like faceting, grouping, pivot (e. g. for analytical purposes), spellchecking 
(e. g. for OCR or NLP), etc. etc. etc. are not covered by GetSolr (and 
shouldn't be included as the main focus should rely on the reliable retrieval). 
However, there should be a more flexible option to query Solr within workflows.

> GetSolr can miss recently updated documents
> -------------------------------------------
>
>                 Key: NIFI-3248
>                 URL: https://issues.apache.org/jira/browse/NIFI-3248
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>            Reporter: Koji Kawamura
>            Assignee: Johannes Peter
>         Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("yyyy-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow below, GetSolr can miss updated documents:
> {code}
> t1: GetSolr queried
> t2: GenerateFlowFile set date = t2
> t3: PutSolrContentStream stored new doc
> t4: GetSolr queried again, from t1 to t4, but the new doc hasn't been indexed
> t5: Solr completed index
> t6: GetSolr queried again, from t4 to t6, the doc didn't match query
> {code}
> This behavior should be at least documented.
> Plus, it would be helpful to add a new configuration property to GetSolr, to 
> specify commit lag-time so that GetSolr aims older timestamp range to query 
> documents.
> {code}
> // with commit lag-time
> t1: GetSolr queried
> t2: GenerateFlowFile set date = t2
> t3: PutSolrContentStream stored new doc
> t4: GetSolr queried again, from (t1 - lag) to (t4 - lag), but the new doc 
> hasn't been indexed
> t5: Solr completed index
> t6: GetSolr queried again, from (t4 - lag) to (t6 - lag), the doc can match 
> query
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (NIFI-3248) GetSolr can miss recently updated documents

Reply via email to