[ https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211099#comment-16211099 ]
ASF GitHub Bot commented on NIFI-3248: -------------------------------------- Github user JohannesDaniel commented on a diff in the pull request: https://github.com/apache/nifi/pull/2199#discussion_r145712892 --- Diff: nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java --- @@ -66,42 +79,72 @@ import org.apache.solr.common.SolrDocument; import org.apache.solr.common.SolrDocumentList; import org.apache.solr.common.SolrInputDocument; +import org.apache.solr.common.params.CursorMarkParams; -@Tags({"Apache", "Solr", "Get", "Pull"}) +@Tags({"Apache", "Solr", "Get", "Pull", "Records"}) @InputRequirement(Requirement.INPUT_FORBIDDEN) -@CapabilityDescription("Queries Solr and outputs the results as a FlowFile") +@CapabilityDescription("Queries Solr and outputs the results as a FlowFile in the format of XML or using a Record Writer") +@Stateful(scopes = {Scope.CLUSTER}, description = "Stores latest date of Date Field so that the same data will not be fetched multiple times.") public class GetSolr extends SolrProcessor { - public static final PropertyDescriptor SOLR_QUERY = new PropertyDescriptor - .Builder().name("Solr Query") - .description("A query to execute against Solr") + public static final String STATE_MANAGER_FILTER = "stateManager_filter"; + public static final String STATE_MANAGER_CURSOR_MARK = "stateManager_cursorMark"; + public static final AllowableValue MODE_XML = new AllowableValue("XML"); + public static final AllowableValue MODE_REC = new AllowableValue("Records"); --- End diff -- Principally yes, by using the Schema API. But I dont expect this to be too easy. I suggest that we create a separate ticket for this as it should require some deeper considerations. > GetSolr can miss recently updated documents > ------------------------------------------- > > Key: NIFI-3248 > URL: https://issues.apache.org/jira/browse/NIFI-3248 > Project: Apache NiFi > Issue Type: Bug > Components: Extensions > Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, > 1.0.1 > Reporter: Koji Kawamura > Assignee: Johannes Peter > Attachments: nifi-flow.png, query-result-with-curly-bracket.png, > query-result-with-square-bracket.png > > > GetSolr holds the last query timestamp so that it only fetches documents > those have been added or updated since the last query. > However, GetSolr misses some of those updated documents, and once the > documents date field value becomes older than last query timestamp, the > document won't be able to be queried by GetSolr any more. > This JIRA is for tracking the process of investigating this behavior, and > discussion on them. > Here are things that can be a cause of this behavior: > |#|Short description|Should we address it?| > |1|Timestamp range filter, curly or square bracket?|No| > |2|Timezone difference between update and query|Additional docs might be > helpful| > |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, > add 'commit lag-time'?| > h2. 1. Timestamp range filter, curly or square bracket? > At the first glance, using curly and square bracket in mix looked strange > ([source > code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]). > But these difference has a meaning. > The square bracket on the range query is inclusive and the curly bracket is > exclusive. If we use inclusive on both sides and a document has a time stamp > exactly on the boundary then it could be returned in two consecutive > executions, and we only want it in one. > This is intentional, and it should be as it is. > h2. 2. Timezone difference between update and query > Solr treats date fields as [UTC > representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|]. > If date field String value of an updated document represents time without > timezone, and NiFi is running on an environment using timezone other than > UTC, GetSolr can't perform date range query as users expect. > Let's say NiFi is running with JST(UTC+9). A process added a document to Solr > at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it > as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any > documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, > i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date > range filter. > To avoid this, updated documents must have proper timezone in date field > string representation. > If one uses NiFi expression language to set current timestamp to that date > field, following NiFi expression can be used: > {code} > ${now():format("yyyy-MM-dd'T'HH:mm:ss.SSSZ")} > {code} > It will produce a result like: > {code} > 2016-12-27T15:30:04.895+0900 > {code} > Then it will be indexed in Solr with UTC and will be queried by GetSolr as > expected. > h2. 3. Lag comes from NearRealTIme nature of Solr > Solr provides Near Real Time search capability, that means, the recently > updated documents can be queried in Near Real Time, but it's not real time. > This latency can be controlled by either on client side which requests the > update operation by specifying "commitWithin" parameter, or on the Solr > server side, "autoCommit" and "autoSoftCommit" in > [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits]. > Since commit and updating index can be costly, it's recommended to set this > interval long enough up to the maximum tolerable latency. > However, this can be problematic with GetSolr. For instance, as shown in the > simple NiFi flow below, GetSolr can miss updated documents: > {code} > t1: GetSolr queried > t2: GenerateFlowFile set date = t2 > t3: PutSolrContentStream stored new doc > t4: GetSolr queried again, from t1 to t4, but the new doc hasn't been indexed > t5: Solr completed index > t6: GetSolr queried again, from t4 to t6, the doc didn't match query > {code} > This behavior should be at least documented. > Plus, it would be helpful to add a new configuration property to GetSolr, to > specify commit lag-time so that GetSolr aims older timestamp range to query > documents. > {code} > // with commit lag-time > t1: GetSolr queried > t2: GenerateFlowFile set date = t2 > t3: PutSolrContentStream stored new doc > t4: GetSolr queried again, from (t1 - lag) to (t4 - lag), but the new doc > hasn't been indexed > t5: Solr completed index > t6: GetSolr queried again, from (t4 - lag) to (t6 - lag), the doc can match > query > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)