[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214900#comment-16214900
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user asfgit closed the pull request at:

https://github.com/apache/nifi/pull/2199


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Fix For: 1.5.0
>
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow below, GetSolr can miss updated documents:
> {code}
> t1: GetSolr queried
> t2: GenerateFlowFile set date = t2
> t3: PutSolrContentStream stored new doc
> t4: GetSolr queried again, from t1 to t4, but the new doc hasn't been indexed
> t5: Solr completed index
> t6: GetSolr queried again, from t4 to t6, the doc didn't match query
> {code}
> This behavior should be at least documented.
> Plus, it would be helpful to add a new configuration property to GetSolr, to 
> specify commit lag-time so that GetSolr aims 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-23 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214899#comment-16214899
 ] 

ASF subversion and git services commented on NIFI-3248:
---

Commit c06dee2321360b1eebce420d4edb36b629ee8b64 in nifi's branch 
refs/heads/master from [~jope]
[ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=c06dee2 ]

NIFI-3248: Improvement of GetSolr Processor

This closes #2199.

Signed-off-by: Koji Kawamura 


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Fix For: 1.5.0
>
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow below, GetSolr can miss updated documents:
> {code}
> t1: GetSolr queried
> t2: GenerateFlowFile set date = t2
> t3: PutSolrContentStream stored new doc
> t4: GetSolr queried again, from t1 to t4, but the new doc hasn't been indexed
> t5: Solr completed index
> t6: GetSolr queried again, 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214896#comment-16214896
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user ijokarumawak commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r146215638
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -170,159 +210,213 @@ protected void init(final 
ProcessorInitializationContext context) {
 return this.descriptors;
 }
 
+final static Set propertyNamesForActivatingClearState = new 
HashSet();
+static {
+propertyNamesForActivatingClearState.add(SOLR_TYPE.getName());
+propertyNamesForActivatingClearState.add(SOLR_LOCATION.getName());
+propertyNamesForActivatingClearState.add(COLLECTION.getName());
+propertyNamesForActivatingClearState.add(SOLR_QUERY.getName());
+propertyNamesForActivatingClearState.add(DATE_FIELD.getName());
+propertyNamesForActivatingClearState.add(RETURN_FIELDS.getName());
+}
+
 @Override
 public void onPropertyModified(PropertyDescriptor descriptor, String 
oldValue, String newValue) {
-lastEndDatedRef.set(UNINITIALIZED_LAST_END_DATE_VALUE);
+if 
(propertyNamesForActivatingClearState.contains(descriptor.getName()))
+clearState.set(true);
 }
 
-@OnStopped
-public void onStopped() {
-writeLastEndDate();
-}
+@OnScheduled
+public void clearState(final ProcessContext context) throws 
IOException {
+if (clearState.getAndSet(false)) {
+context.getStateManager().clear(Scope.CLUSTER);
+final Map newStateMap = new 
HashMap();
 
-@OnRemoved
-public void onRemoved() {
-final File lastEndDateCache = new File(FILE_PREFIX + 
getIdentifier());
-if (lastEndDateCache.exists()) {
-lastEndDateCache.delete();
-}
-}
+newStateMap.put(STATE_MANAGER_CURSOR_MARK, "*");
 
-@Override
-public void onTrigger(ProcessContext context, ProcessSession session) 
throws ProcessException {
-final ComponentLog logger = getLogger();
-readLastEndDate();
-
-final SimpleDateFormat sdf = new 
SimpleDateFormat(LAST_END_DATE_PATTERN, Locale.US);
-sdf.setTimeZone(TimeZone.getTimeZone("GMT"));
-final String currDate = sdf.format(new Date());
-
-final boolean initialized = 
!UNINITIALIZED_LAST_END_DATE_VALUE.equals(lastEndDatedRef.get());
-
-final String query = context.getProperty(SOLR_QUERY).getValue();
-final SolrQuery solrQuery = new SolrQuery(query);
-solrQuery.setRows(context.getProperty(BATCH_SIZE).asInteger());
-
-// if initialized then apply a filter to restrict results from the 
last end time til now
-if (initialized) {
-StringBuilder filterQuery = new StringBuilder();
-filterQuery.append(context.getProperty(DATE_FIELD).getValue())
-.append(":{").append(lastEndDatedRef.get()).append(" 
TO ")
-.append(currDate).append("]");
-solrQuery.addFilterQuery(filterQuery.toString());
-logger.info("Applying filter query {}", new 
Object[]{filterQuery.toString()});
-}
+final String initialDate = 
context.getProperty(DATE_FILTER).getValue();
+if (StringUtils.isBlank(initialDate))
+newStateMap.put(STATE_MANAGER_FILTER, "*");
+else
+newStateMap.put(STATE_MANAGER_FILTER, initialDate);
 
-final String returnFields = 
context.getProperty(RETURN_FIELDS).getValue();
-if (returnFields != null && !returnFields.trim().isEmpty()) {
-for (String returnField : returnFields.trim().split("[,]")) {
-solrQuery.addField(returnField.trim());
-}
+context.getStateManager().setState(newStateMap, Scope.CLUSTER);
+
+id_field = null;
 }
+}
 
-final String fullSortClause = 
context.getProperty(SORT_CLAUSE).getValue();
-if (fullSortClause != null && !fullSortClause.trim().isEmpty()) {
-for (String sortClause : fullSortClause.split("[,]")) {
-String[] sortParts = sortClause.trim().split("[ ]");
-solrQuery.addSort(sortParts[0], 
SolrQuery.ORDER.valueOf(sortParts[1]));
-}
+@Override
+protected final Collection 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214894#comment-16214894
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user ijokarumawak commented on the issue:

https://github.com/apache/nifi/pull/2199
  
@JohannesDaniel Thanks for the updates and additional documentation. 
Confirmed that commit time lag is no longer an issue. All LGTM, +1. I'm going 
to squash commits and merge to master. Thanks again for your contribution!


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow below, GetSolr can miss updated documents:
> {code}
> t1: GetSolr queried
> t2: GenerateFlowFile set date = t2
> t3: PutSolrContentStream stored new doc
> t4: GetSolr queried again, from t1 to t4, but the new doc hasn't been indexed
> t5: Solr completed index
> t6: GetSolr queried again, from t4 to t6, the doc 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214885#comment-16214885
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user ijokarumawak commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r146214902
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -66,42 +79,72 @@
 import org.apache.solr.common.SolrDocument;
 import org.apache.solr.common.SolrDocumentList;
 import org.apache.solr.common.SolrInputDocument;
+import org.apache.solr.common.params.CursorMarkParams;
 
-@Tags({"Apache", "Solr", "Get", "Pull"})
+@Tags({"Apache", "Solr", "Get", "Pull", "Records"})
 @InputRequirement(Requirement.INPUT_FORBIDDEN)
-@CapabilityDescription("Queries Solr and outputs the results as a 
FlowFile")
+@CapabilityDescription("Queries Solr and outputs the results as a FlowFile 
in the format of XML or using a Record Writer")
+@Stateful(scopes = {Scope.CLUSTER}, description = "Stores latest date of 
Date Field so that the same data will not be fetched multiple times.")
 public class GetSolr extends SolrProcessor {
 
-public static final PropertyDescriptor SOLR_QUERY = new 
PropertyDescriptor
-.Builder().name("Solr Query")
-.description("A query to execute against Solr")
+public static final String STATE_MANAGER_FILTER = 
"stateManager_filter";
+public static final String STATE_MANAGER_CURSOR_MARK = 
"stateManager_cursorMark";
+public static final AllowableValue MODE_XML = new 
AllowableValue("XML");
+public static final AllowableValue MODE_REC = new 
AllowableValue("Records");
--- End diff --

Thanks for your detailed considerations. I agree it's not easy task to do. 
To not lose your informative comments for future work, I've created another 
JIRA.  https://issues.apache.org/jira/browse/NIFI-4514


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212533#comment-16212533
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user JohannesDaniel commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145946147
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -66,42 +79,72 @@
 import org.apache.solr.common.SolrDocument;
 import org.apache.solr.common.SolrDocumentList;
 import org.apache.solr.common.SolrInputDocument;
+import org.apache.solr.common.params.CursorMarkParams;
 
-@Tags({"Apache", "Solr", "Get", "Pull"})
+@Tags({"Apache", "Solr", "Get", "Pull", "Records"})
 @InputRequirement(Requirement.INPUT_FORBIDDEN)
-@CapabilityDescription("Queries Solr and outputs the results as a 
FlowFile")
+@CapabilityDescription("Queries Solr and outputs the results as a FlowFile 
in the format of XML or using a Record Writer")
+@Stateful(scopes = {Scope.CLUSTER}, description = "Stores latest date of 
Date Field so that the same data will not be fetched multiple times.")
 public class GetSolr extends SolrProcessor {
 
-public static final PropertyDescriptor SOLR_QUERY = new 
PropertyDescriptor
-.Builder().name("Solr Query")
-.description("A query to execute against Solr")
+public static final String STATE_MANAGER_FILTER = 
"stateManager_filter";
+public static final String STATE_MANAGER_CURSOR_MARK = 
"stateManager_cursorMark";
+public static final AllowableValue MODE_XML = new 
AllowableValue("XML");
+public static final AllowableValue MODE_REC = new 
AllowableValue("Records");
--- End diff --

Hmm, and dynamic fields could become a problem... I think this is not 
possible.


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211601#comment-16211601
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user JohannesDaniel commented on the issue:

https://github.com/apache/nifi/pull/2199
  
I dont think that the timezone and the commit issues are still important. 
GetSolr now takes the timestamp directly from the results. Commit delays wont 
be a problem as the state is only updated when new documents are retrieved, no 
matter at which time the query was executed. The same applies to the timezone 
issue.


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow below, GetSolr can miss updated documents:
> {code}
> t1: GetSolr queried
> t2: GenerateFlowFile set date = t2
> t3: PutSolrContentStream stored new doc
> t4: GetSolr queried again, from t1 to t4, but the new doc 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211171#comment-16211171
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user JohannesDaniel commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145727845
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -170,159 +210,213 @@ protected void init(final 
ProcessorInitializationContext context) {
 return this.descriptors;
 }
 
+final static Set propertyNamesForActivatingClearState = new 
HashSet();
+static {
+propertyNamesForActivatingClearState.add(SOLR_TYPE.getName());
+propertyNamesForActivatingClearState.add(SOLR_LOCATION.getName());
+propertyNamesForActivatingClearState.add(COLLECTION.getName());
+propertyNamesForActivatingClearState.add(SOLR_QUERY.getName());
+propertyNamesForActivatingClearState.add(DATE_FIELD.getName());
+propertyNamesForActivatingClearState.add(RETURN_FIELDS.getName());
+}
+
 @Override
 public void onPropertyModified(PropertyDescriptor descriptor, String 
oldValue, String newValue) {
-lastEndDatedRef.set(UNINITIALIZED_LAST_END_DATE_VALUE);
+if 
(propertyNamesForActivatingClearState.contains(descriptor.getName()))
+clearState.set(true);
 }
 
-@OnStopped
-public void onStopped() {
-writeLastEndDate();
-}
+@OnScheduled
+public void clearState(final ProcessContext context) throws 
IOException {
+if (clearState.getAndSet(false)) {
+context.getStateManager().clear(Scope.CLUSTER);
+final Map newStateMap = new 
HashMap();
 
-@OnRemoved
-public void onRemoved() {
-final File lastEndDateCache = new File(FILE_PREFIX + 
getIdentifier());
-if (lastEndDateCache.exists()) {
-lastEndDateCache.delete();
-}
-}
+newStateMap.put(STATE_MANAGER_CURSOR_MARK, "*");
 
-@Override
-public void onTrigger(ProcessContext context, ProcessSession session) 
throws ProcessException {
-final ComponentLog logger = getLogger();
-readLastEndDate();
-
-final SimpleDateFormat sdf = new 
SimpleDateFormat(LAST_END_DATE_PATTERN, Locale.US);
-sdf.setTimeZone(TimeZone.getTimeZone("GMT"));
-final String currDate = sdf.format(new Date());
-
-final boolean initialized = 
!UNINITIALIZED_LAST_END_DATE_VALUE.equals(lastEndDatedRef.get());
-
-final String query = context.getProperty(SOLR_QUERY).getValue();
-final SolrQuery solrQuery = new SolrQuery(query);
-solrQuery.setRows(context.getProperty(BATCH_SIZE).asInteger());
-
-// if initialized then apply a filter to restrict results from the 
last end time til now
-if (initialized) {
-StringBuilder filterQuery = new StringBuilder();
-filterQuery.append(context.getProperty(DATE_FIELD).getValue())
-.append(":{").append(lastEndDatedRef.get()).append(" 
TO ")
-.append(currDate).append("]");
-solrQuery.addFilterQuery(filterQuery.toString());
-logger.info("Applying filter query {}", new 
Object[]{filterQuery.toString()});
-}
+final String initialDate = 
context.getProperty(DATE_FILTER).getValue();
+if (StringUtils.isBlank(initialDate))
+newStateMap.put(STATE_MANAGER_FILTER, "*");
+else
+newStateMap.put(STATE_MANAGER_FILTER, initialDate);
 
-final String returnFields = 
context.getProperty(RETURN_FIELDS).getValue();
-if (returnFields != null && !returnFields.trim().isEmpty()) {
-for (String returnField : returnFields.trim().split("[,]")) {
-solrQuery.addField(returnField.trim());
-}
+context.getStateManager().setState(newStateMap, Scope.CLUSTER);
+
+id_field = null;
 }
+}
 
-final String fullSortClause = 
context.getProperty(SORT_CLAUSE).getValue();
-if (fullSortClause != null && !fullSortClause.trim().isEmpty()) {
-for (String sortClause : fullSortClause.split("[,]")) {
-String[] sortParts = sortClause.trim().split("[ ]");
-solrQuery.addSort(sortParts[0], 
SolrQuery.ORDER.valueOf(sortParts[1]));
-}
+@Override
+protected final Collection 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211151#comment-16211151
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user JohannesDaniel commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145721674
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -66,42 +79,72 @@
 import org.apache.solr.common.SolrDocument;
 import org.apache.solr.common.SolrDocumentList;
 import org.apache.solr.common.SolrInputDocument;
+import org.apache.solr.common.params.CursorMarkParams;
 
-@Tags({"Apache", "Solr", "Get", "Pull"})
+@Tags({"Apache", "Solr", "Get", "Pull", "Records"})
 @InputRequirement(Requirement.INPUT_FORBIDDEN)
-@CapabilityDescription("Queries Solr and outputs the results as a 
FlowFile")
+@CapabilityDescription("Queries Solr and outputs the results as a FlowFile 
in the format of XML or using a Record Writer")
+@Stateful(scopes = {Scope.CLUSTER}, description = "Stores latest date of 
Date Field so that the same data will not be fetched multiple times.")
 public class GetSolr extends SolrProcessor {
 
-public static final PropertyDescriptor SOLR_QUERY = new 
PropertyDescriptor
-.Builder().name("Solr Query")
-.description("A query to execute against Solr")
+public static final String STATE_MANAGER_FILTER = 
"stateManager_filter";
+public static final String STATE_MANAGER_CURSOR_MARK = 
"stateManager_cursorMark";
+public static final AllowableValue MODE_XML = new 
AllowableValue("XML");
+public static final AllowableValue MODE_REC = new 
AllowableValue("Records");
+
+public static final PropertyDescriptor RETURN_TYPE = new 
PropertyDescriptor
+.Builder().name("Return Type")
+.displayName("Return Type")
--- End diff --

The most properties were already available in the prior GetSol processor. I 
expected this to be critical for backwards compatibility. For the new 
properties I chose the same naming pattern.


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211144#comment-16211144
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user JohannesDaniel commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145720938
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -66,42 +79,72 @@
 import org.apache.solr.common.SolrDocument;
 import org.apache.solr.common.SolrDocumentList;
 import org.apache.solr.common.SolrInputDocument;
+import org.apache.solr.common.params.CursorMarkParams;
 
-@Tags({"Apache", "Solr", "Get", "Pull"})
+@Tags({"Apache", "Solr", "Get", "Pull", "Records"})
 @InputRequirement(Requirement.INPUT_FORBIDDEN)
-@CapabilityDescription("Queries Solr and outputs the results as a 
FlowFile")
+@CapabilityDescription("Queries Solr and outputs the results as a FlowFile 
in the format of XML or using a Record Writer")
+@Stateful(scopes = {Scope.CLUSTER}, description = "Stores latest date of 
Date Field so that the same data will not be fetched multiple times.")
 public class GetSolr extends SolrProcessor {
 
-public static final PropertyDescriptor SOLR_QUERY = new 
PropertyDescriptor
-.Builder().name("Solr Query")
-.description("A query to execute against Solr")
+public static final String STATE_MANAGER_FILTER = 
"stateManager_filter";
+public static final String STATE_MANAGER_CURSOR_MARK = 
"stateManager_cursorMark";
+public static final AllowableValue MODE_XML = new 
AllowableValue("XML");
+public static final AllowableValue MODE_REC = new 
AllowableValue("Records");
--- End diff --

Additionally, this requires parsing of response json, as the response 
parsing of Schema API is not really realized in SolrJ


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211134#comment-16211134
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user JohannesDaniel commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145719121
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -66,42 +79,72 @@
 import org.apache.solr.common.SolrDocument;
 import org.apache.solr.common.SolrDocumentList;
 import org.apache.solr.common.SolrInputDocument;
+import org.apache.solr.common.params.CursorMarkParams;
 
-@Tags({"Apache", "Solr", "Get", "Pull"})
+@Tags({"Apache", "Solr", "Get", "Pull", "Records"})
 @InputRequirement(Requirement.INPUT_FORBIDDEN)
-@CapabilityDescription("Queries Solr and outputs the results as a 
FlowFile")
+@CapabilityDescription("Queries Solr and outputs the results as a FlowFile 
in the format of XML or using a Record Writer")
+@Stateful(scopes = {Scope.CLUSTER}, description = "Stores latest date of 
Date Field so that the same data will not be fetched multiple times.")
 public class GetSolr extends SolrProcessor {
 
-public static final PropertyDescriptor SOLR_QUERY = new 
PropertyDescriptor
-.Builder().name("Solr Query")
-.description("A query to execute against Solr")
+public static final String STATE_MANAGER_FILTER = 
"stateManager_filter";
+public static final String STATE_MANAGER_CURSOR_MARK = 
"stateManager_cursorMark";
+public static final AllowableValue MODE_XML = new 
AllowableValue("XML");
+public static final AllowableValue MODE_REC = new 
AllowableValue("Records");
--- End diff --

The difficulty with this is that Solr provides various different field 
types for different kinds of data. For instance, an integer could be derived 
from an Int, TrieInt (version < 7.0) or Pint (version >= 7.0) field. This 
requires a comprehensive fieldtype-datatype mapping.


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211099#comment-16211099
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user JohannesDaniel commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145712892
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -66,42 +79,72 @@
 import org.apache.solr.common.SolrDocument;
 import org.apache.solr.common.SolrDocumentList;
 import org.apache.solr.common.SolrInputDocument;
+import org.apache.solr.common.params.CursorMarkParams;
 
-@Tags({"Apache", "Solr", "Get", "Pull"})
+@Tags({"Apache", "Solr", "Get", "Pull", "Records"})
 @InputRequirement(Requirement.INPUT_FORBIDDEN)
-@CapabilityDescription("Queries Solr and outputs the results as a 
FlowFile")
+@CapabilityDescription("Queries Solr and outputs the results as a FlowFile 
in the format of XML or using a Record Writer")
+@Stateful(scopes = {Scope.CLUSTER}, description = "Stores latest date of 
Date Field so that the same data will not be fetched multiple times.")
 public class GetSolr extends SolrProcessor {
 
-public static final PropertyDescriptor SOLR_QUERY = new 
PropertyDescriptor
-.Builder().name("Solr Query")
-.description("A query to execute against Solr")
+public static final String STATE_MANAGER_FILTER = 
"stateManager_filter";
+public static final String STATE_MANAGER_CURSOR_MARK = 
"stateManager_cursorMark";
+public static final AllowableValue MODE_XML = new 
AllowableValue("XML");
+public static final AllowableValue MODE_REC = new 
AllowableValue("Records");
--- End diff --

Principally yes, by using the Schema API. But I dont expect this to be too 
easy. I suggest that we create a separate ticket for this as it should require 
some deeper considerations.


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211045#comment-16211045
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user ijokarumawak commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145699419
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -66,42 +79,72 @@
 import org.apache.solr.common.SolrDocument;
 import org.apache.solr.common.SolrDocumentList;
 import org.apache.solr.common.SolrInputDocument;
+import org.apache.solr.common.params.CursorMarkParams;
 
-@Tags({"Apache", "Solr", "Get", "Pull"})
+@Tags({"Apache", "Solr", "Get", "Pull", "Records"})
 @InputRequirement(Requirement.INPUT_FORBIDDEN)
-@CapabilityDescription("Queries Solr and outputs the results as a 
FlowFile")
+@CapabilityDescription("Queries Solr and outputs the results as a FlowFile 
in the format of XML or using a Record Writer")
+@Stateful(scopes = {Scope.CLUSTER}, description = "Stores latest date of 
Date Field so that the same data will not be fetched multiple times.")
 public class GetSolr extends SolrProcessor {
 
-public static final PropertyDescriptor SOLR_QUERY = new 
PropertyDescriptor
-.Builder().name("Solr Query")
-.description("A query to execute against Solr")
+public static final String STATE_MANAGER_FILTER = 
"stateManager_filter";
+public static final String STATE_MANAGER_CURSOR_MARK = 
"stateManager_cursorMark";
+public static final AllowableValue MODE_XML = new 
AllowableValue("XML");
+public static final AllowableValue MODE_REC = new 
AllowableValue("Records");
--- End diff --

Just an idea. Configuring a schema for the writer manually can be 
cumbersome. I wonder if it's possible to load a schema from the target 
collection then auto generate NiFi record schema from it. Do you think it's 
doable?


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211049#comment-16211049
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user ijokarumawak commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145698373
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -170,159 +210,213 @@ protected void init(final 
ProcessorInitializationContext context) {
 return this.descriptors;
 }
 
+final static Set propertyNamesForActivatingClearState = new 
HashSet();
+static {
+propertyNamesForActivatingClearState.add(SOLR_TYPE.getName());
+propertyNamesForActivatingClearState.add(SOLR_LOCATION.getName());
+propertyNamesForActivatingClearState.add(COLLECTION.getName());
+propertyNamesForActivatingClearState.add(SOLR_QUERY.getName());
+propertyNamesForActivatingClearState.add(DATE_FIELD.getName());
+propertyNamesForActivatingClearState.add(RETURN_FIELDS.getName());
+}
+
 @Override
 public void onPropertyModified(PropertyDescriptor descriptor, String 
oldValue, String newValue) {
-lastEndDatedRef.set(UNINITIALIZED_LAST_END_DATE_VALUE);
+if 
(propertyNamesForActivatingClearState.contains(descriptor.getName()))
+clearState.set(true);
 }
 
-@OnStopped
-public void onStopped() {
-writeLastEndDate();
-}
+@OnScheduled
+public void clearState(final ProcessContext context) throws 
IOException {
+if (clearState.getAndSet(false)) {
+context.getStateManager().clear(Scope.CLUSTER);
+final Map newStateMap = new 
HashMap();
 
-@OnRemoved
-public void onRemoved() {
-final File lastEndDateCache = new File(FILE_PREFIX + 
getIdentifier());
-if (lastEndDateCache.exists()) {
-lastEndDateCache.delete();
-}
-}
+newStateMap.put(STATE_MANAGER_CURSOR_MARK, "*");
 
-@Override
-public void onTrigger(ProcessContext context, ProcessSession session) 
throws ProcessException {
-final ComponentLog logger = getLogger();
-readLastEndDate();
-
-final SimpleDateFormat sdf = new 
SimpleDateFormat(LAST_END_DATE_PATTERN, Locale.US);
-sdf.setTimeZone(TimeZone.getTimeZone("GMT"));
-final String currDate = sdf.format(new Date());
-
-final boolean initialized = 
!UNINITIALIZED_LAST_END_DATE_VALUE.equals(lastEndDatedRef.get());
-
-final String query = context.getProperty(SOLR_QUERY).getValue();
-final SolrQuery solrQuery = new SolrQuery(query);
-solrQuery.setRows(context.getProperty(BATCH_SIZE).asInteger());
-
-// if initialized then apply a filter to restrict results from the 
last end time til now
-if (initialized) {
-StringBuilder filterQuery = new StringBuilder();
-filterQuery.append(context.getProperty(DATE_FIELD).getValue())
-.append(":{").append(lastEndDatedRef.get()).append(" 
TO ")
-.append(currDate).append("]");
-solrQuery.addFilterQuery(filterQuery.toString());
-logger.info("Applying filter query {}", new 
Object[]{filterQuery.toString()});
-}
+final String initialDate = 
context.getProperty(DATE_FILTER).getValue();
+if (StringUtils.isBlank(initialDate))
+newStateMap.put(STATE_MANAGER_FILTER, "*");
+else
+newStateMap.put(STATE_MANAGER_FILTER, initialDate);
 
-final String returnFields = 
context.getProperty(RETURN_FIELDS).getValue();
-if (returnFields != null && !returnFields.trim().isEmpty()) {
-for (String returnField : returnFields.trim().split("[,]")) {
-solrQuery.addField(returnField.trim());
-}
+context.getStateManager().setState(newStateMap, Scope.CLUSTER);
+
+id_field = null;
 }
+}
 
-final String fullSortClause = 
context.getProperty(SORT_CLAUSE).getValue();
-if (fullSortClause != null && !fullSortClause.trim().isEmpty()) {
-for (String sortClause : fullSortClause.split("[,]")) {
-String[] sortParts = sortClause.trim().split("[ ]");
-solrQuery.addSort(sortParts[0], 
SolrQuery.ORDER.valueOf(sortParts[1]));
-}
+@Override
+protected final Collection 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211047#comment-16211047
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user ijokarumawak commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145696508
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -170,159 +210,213 @@ protected void init(final 
ProcessorInitializationContext context) {
 return this.descriptors;
 }
 
+final static Set propertyNamesForActivatingClearState = new 
HashSet();
+static {
+propertyNamesForActivatingClearState.add(SOLR_TYPE.getName());
+propertyNamesForActivatingClearState.add(SOLR_LOCATION.getName());
+propertyNamesForActivatingClearState.add(COLLECTION.getName());
+propertyNamesForActivatingClearState.add(SOLR_QUERY.getName());
+propertyNamesForActivatingClearState.add(DATE_FIELD.getName());
+propertyNamesForActivatingClearState.add(RETURN_FIELDS.getName());
+}
+
 @Override
 public void onPropertyModified(PropertyDescriptor descriptor, String 
oldValue, String newValue) {
-lastEndDatedRef.set(UNINITIALIZED_LAST_END_DATE_VALUE);
+if 
(propertyNamesForActivatingClearState.contains(descriptor.getName()))
+clearState.set(true);
 }
 
-@OnStopped
-public void onStopped() {
-writeLastEndDate();
-}
+@OnScheduled
+public void clearState(final ProcessContext context) throws 
IOException {
+if (clearState.getAndSet(false)) {
+context.getStateManager().clear(Scope.CLUSTER);
+final Map newStateMap = new 
HashMap();
 
-@OnRemoved
-public void onRemoved() {
-final File lastEndDateCache = new File(FILE_PREFIX + 
getIdentifier());
-if (lastEndDateCache.exists()) {
-lastEndDateCache.delete();
-}
-}
+newStateMap.put(STATE_MANAGER_CURSOR_MARK, "*");
 
-@Override
-public void onTrigger(ProcessContext context, ProcessSession session) 
throws ProcessException {
-final ComponentLog logger = getLogger();
-readLastEndDate();
-
-final SimpleDateFormat sdf = new 
SimpleDateFormat(LAST_END_DATE_PATTERN, Locale.US);
-sdf.setTimeZone(TimeZone.getTimeZone("GMT"));
-final String currDate = sdf.format(new Date());
-
-final boolean initialized = 
!UNINITIALIZED_LAST_END_DATE_VALUE.equals(lastEndDatedRef.get());
-
-final String query = context.getProperty(SOLR_QUERY).getValue();
-final SolrQuery solrQuery = new SolrQuery(query);
-solrQuery.setRows(context.getProperty(BATCH_SIZE).asInteger());
-
-// if initialized then apply a filter to restrict results from the 
last end time til now
-if (initialized) {
-StringBuilder filterQuery = new StringBuilder();
-filterQuery.append(context.getProperty(DATE_FIELD).getValue())
-.append(":{").append(lastEndDatedRef.get()).append(" 
TO ")
-.append(currDate).append("]");
-solrQuery.addFilterQuery(filterQuery.toString());
-logger.info("Applying filter query {}", new 
Object[]{filterQuery.toString()});
-}
+final String initialDate = 
context.getProperty(DATE_FILTER).getValue();
+if (StringUtils.isBlank(initialDate))
+newStateMap.put(STATE_MANAGER_FILTER, "*");
+else
+newStateMap.put(STATE_MANAGER_FILTER, initialDate);
 
-final String returnFields = 
context.getProperty(RETURN_FIELDS).getValue();
-if (returnFields != null && !returnFields.trim().isEmpty()) {
-for (String returnField : returnFields.trim().split("[,]")) {
-solrQuery.addField(returnField.trim());
-}
+context.getStateManager().setState(newStateMap, Scope.CLUSTER);
+
+id_field = null;
 }
+}
 
-final String fullSortClause = 
context.getProperty(SORT_CLAUSE).getValue();
-if (fullSortClause != null && !fullSortClause.trim().isEmpty()) {
-for (String sortClause : fullSortClause.split("[,]")) {
-String[] sortParts = sortClause.trim().split("[ ]");
-solrQuery.addSort(sortParts[0], 
SolrQuery.ORDER.valueOf(sortParts[1]));
-}
+@Override
+protected final Collection 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211043#comment-16211043
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user ijokarumawak commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145688724
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -66,42 +79,72 @@
 import org.apache.solr.common.SolrDocument;
 import org.apache.solr.common.SolrDocumentList;
 import org.apache.solr.common.SolrInputDocument;
+import org.apache.solr.common.params.CursorMarkParams;
 
-@Tags({"Apache", "Solr", "Get", "Pull"})
+@Tags({"Apache", "Solr", "Get", "Pull", "Records"})
 @InputRequirement(Requirement.INPUT_FORBIDDEN)
-@CapabilityDescription("Queries Solr and outputs the results as a 
FlowFile")
+@CapabilityDescription("Queries Solr and outputs the results as a FlowFile 
in the format of XML or using a Record Writer")
+@Stateful(scopes = {Scope.CLUSTER}, description = "Stores latest date of 
Date Field so that the same data will not be fetched multiple times.")
 public class GetSolr extends SolrProcessor {
 
-public static final PropertyDescriptor SOLR_QUERY = new 
PropertyDescriptor
-.Builder().name("Solr Query")
-.description("A query to execute against Solr")
+public static final String STATE_MANAGER_FILTER = 
"stateManager_filter";
+public static final String STATE_MANAGER_CURSOR_MARK = 
"stateManager_cursorMark";
+public static final AllowableValue MODE_XML = new 
AllowableValue("XML");
+public static final AllowableValue MODE_REC = new 
AllowableValue("Records");
+
+public static final PropertyDescriptor RETURN_TYPE = new 
PropertyDescriptor
+.Builder().name("Return Type")
+.displayName("Return Type")
+.description("Write Solr documents to FlowFiles as XML or 
using a Record Writer")
 .required(true)
-.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+.allowableValues(MODE_XML, MODE_REC)
+.defaultValue(MODE_REC.getValue())
--- End diff --

The default value should be MODE_XML as it did before.


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211046#comment-16211046
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user ijokarumawak commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145690789
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -66,42 +79,72 @@
 import org.apache.solr.common.SolrDocument;
 import org.apache.solr.common.SolrDocumentList;
 import org.apache.solr.common.SolrInputDocument;
+import org.apache.solr.common.params.CursorMarkParams;
 
-@Tags({"Apache", "Solr", "Get", "Pull"})
+@Tags({"Apache", "Solr", "Get", "Pull", "Records"})
 @InputRequirement(Requirement.INPUT_FORBIDDEN)
-@CapabilityDescription("Queries Solr and outputs the results as a 
FlowFile")
+@CapabilityDescription("Queries Solr and outputs the results as a FlowFile 
in the format of XML or using a Record Writer")
+@Stateful(scopes = {Scope.CLUSTER}, description = "Stores latest date of 
Date Field so that the same data will not be fetched multiple times.")
 public class GetSolr extends SolrProcessor {
 
-public static final PropertyDescriptor SOLR_QUERY = new 
PropertyDescriptor
-.Builder().name("Solr Query")
-.description("A query to execute against Solr")
+public static final String STATE_MANAGER_FILTER = 
"stateManager_filter";
+public static final String STATE_MANAGER_CURSOR_MARK = 
"stateManager_cursorMark";
+public static final AllowableValue MODE_XML = new 
AllowableValue("XML");
+public static final AllowableValue MODE_REC = new 
AllowableValue("Records");
+
+public static final PropertyDescriptor RETURN_TYPE = new 
PropertyDescriptor
+.Builder().name("Return Type")
+.displayName("Return Type")
--- End diff --

Although I haven't seen a specific guideline or documentation, other 
processors prefer having `name` in lower case looks like a key of property or 
configuration name such as `return_type` so that user can type the name without 
worrying about spacing or case sensitivity, while `displayName` is a more 
verbose human readable name.

`name` would be more important in the world of MiNiFi or other application 
directly talks with NiFi API programatically.

I don't have strong opinion here but just wanted to share what those two 
are.


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211044#comment-16211044
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user ijokarumawak commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145696081
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -170,159 +210,213 @@ protected void init(final 
ProcessorInitializationContext context) {
 return this.descriptors;
 }
 
+final static Set propertyNamesForActivatingClearState = new 
HashSet();
+static {
+propertyNamesForActivatingClearState.add(SOLR_TYPE.getName());
+propertyNamesForActivatingClearState.add(SOLR_LOCATION.getName());
+propertyNamesForActivatingClearState.add(COLLECTION.getName());
+propertyNamesForActivatingClearState.add(SOLR_QUERY.getName());
+propertyNamesForActivatingClearState.add(DATE_FIELD.getName());
+propertyNamesForActivatingClearState.add(RETURN_FIELDS.getName());
+}
+
 @Override
 public void onPropertyModified(PropertyDescriptor descriptor, String 
oldValue, String newValue) {
-lastEndDatedRef.set(UNINITIALIZED_LAST_END_DATE_VALUE);
+if 
(propertyNamesForActivatingClearState.contains(descriptor.getName()))
+clearState.set(true);
 }
 
-@OnStopped
-public void onStopped() {
-writeLastEndDate();
-}
+@OnScheduled
+public void clearState(final ProcessContext context) throws 
IOException {
+if (clearState.getAndSet(false)) {
+context.getStateManager().clear(Scope.CLUSTER);
+final Map newStateMap = new 
HashMap();
 
-@OnRemoved
-public void onRemoved() {
-final File lastEndDateCache = new File(FILE_PREFIX + 
getIdentifier());
-if (lastEndDateCache.exists()) {
-lastEndDateCache.delete();
-}
-}
+newStateMap.put(STATE_MANAGER_CURSOR_MARK, "*");
 
-@Override
-public void onTrigger(ProcessContext context, ProcessSession session) 
throws ProcessException {
-final ComponentLog logger = getLogger();
-readLastEndDate();
-
-final SimpleDateFormat sdf = new 
SimpleDateFormat(LAST_END_DATE_PATTERN, Locale.US);
-sdf.setTimeZone(TimeZone.getTimeZone("GMT"));
-final String currDate = sdf.format(new Date());
-
-final boolean initialized = 
!UNINITIALIZED_LAST_END_DATE_VALUE.equals(lastEndDatedRef.get());
-
-final String query = context.getProperty(SOLR_QUERY).getValue();
-final SolrQuery solrQuery = new SolrQuery(query);
-solrQuery.setRows(context.getProperty(BATCH_SIZE).asInteger());
-
-// if initialized then apply a filter to restrict results from the 
last end time til now
-if (initialized) {
-StringBuilder filterQuery = new StringBuilder();
-filterQuery.append(context.getProperty(DATE_FIELD).getValue())
-.append(":{").append(lastEndDatedRef.get()).append(" 
TO ")
-.append(currDate).append("]");
-solrQuery.addFilterQuery(filterQuery.toString());
-logger.info("Applying filter query {}", new 
Object[]{filterQuery.toString()});
-}
+final String initialDate = 
context.getProperty(DATE_FILTER).getValue();
+if (StringUtils.isBlank(initialDate))
+newStateMap.put(STATE_MANAGER_FILTER, "*");
+else
+newStateMap.put(STATE_MANAGER_FILTER, initialDate);
 
-final String returnFields = 
context.getProperty(RETURN_FIELDS).getValue();
-if (returnFields != null && !returnFields.trim().isEmpty()) {
-for (String returnField : returnFields.trim().split("[,]")) {
-solrQuery.addField(returnField.trim());
-}
+context.getStateManager().setState(newStateMap, Scope.CLUSTER);
+
+id_field = null;
 }
+}
 
-final String fullSortClause = 
context.getProperty(SORT_CLAUSE).getValue();
-if (fullSortClause != null && !fullSortClause.trim().isEmpty()) {
-for (String sortClause : fullSortClause.split("[,]")) {
-String[] sortParts = sortClause.trim().split("[ ]");
-solrQuery.addSort(sortParts[0], 
SolrQuery.ORDER.valueOf(sortParts[1]));
-}
+@Override
+protected final Collection 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211048#comment-16211048
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user ijokarumawak commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145697961
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -170,159 +210,213 @@ protected void init(final 
ProcessorInitializationContext context) {
 return this.descriptors;
 }
 
+final static Set propertyNamesForActivatingClearState = new 
HashSet();
+static {
+propertyNamesForActivatingClearState.add(SOLR_TYPE.getName());
+propertyNamesForActivatingClearState.add(SOLR_LOCATION.getName());
+propertyNamesForActivatingClearState.add(COLLECTION.getName());
+propertyNamesForActivatingClearState.add(SOLR_QUERY.getName());
+propertyNamesForActivatingClearState.add(DATE_FIELD.getName());
+propertyNamesForActivatingClearState.add(RETURN_FIELDS.getName());
+}
+
 @Override
 public void onPropertyModified(PropertyDescriptor descriptor, String 
oldValue, String newValue) {
-lastEndDatedRef.set(UNINITIALIZED_LAST_END_DATE_VALUE);
+if 
(propertyNamesForActivatingClearState.contains(descriptor.getName()))
+clearState.set(true);
 }
 
-@OnStopped
-public void onStopped() {
-writeLastEndDate();
-}
+@OnScheduled
+public void clearState(final ProcessContext context) throws 
IOException {
+if (clearState.getAndSet(false)) {
+context.getStateManager().clear(Scope.CLUSTER);
+final Map newStateMap = new 
HashMap();
 
-@OnRemoved
-public void onRemoved() {
-final File lastEndDateCache = new File(FILE_PREFIX + 
getIdentifier());
-if (lastEndDateCache.exists()) {
-lastEndDateCache.delete();
-}
-}
+newStateMap.put(STATE_MANAGER_CURSOR_MARK, "*");
 
-@Override
-public void onTrigger(ProcessContext context, ProcessSession session) 
throws ProcessException {
-final ComponentLog logger = getLogger();
-readLastEndDate();
-
-final SimpleDateFormat sdf = new 
SimpleDateFormat(LAST_END_DATE_PATTERN, Locale.US);
-sdf.setTimeZone(TimeZone.getTimeZone("GMT"));
-final String currDate = sdf.format(new Date());
-
-final boolean initialized = 
!UNINITIALIZED_LAST_END_DATE_VALUE.equals(lastEndDatedRef.get());
-
-final String query = context.getProperty(SOLR_QUERY).getValue();
-final SolrQuery solrQuery = new SolrQuery(query);
-solrQuery.setRows(context.getProperty(BATCH_SIZE).asInteger());
-
-// if initialized then apply a filter to restrict results from the 
last end time til now
-if (initialized) {
-StringBuilder filterQuery = new StringBuilder();
-filterQuery.append(context.getProperty(DATE_FIELD).getValue())
-.append(":{").append(lastEndDatedRef.get()).append(" 
TO ")
-.append(currDate).append("]");
-solrQuery.addFilterQuery(filterQuery.toString());
-logger.info("Applying filter query {}", new 
Object[]{filterQuery.toString()});
-}
+final String initialDate = 
context.getProperty(DATE_FILTER).getValue();
+if (StringUtils.isBlank(initialDate))
+newStateMap.put(STATE_MANAGER_FILTER, "*");
+else
+newStateMap.put(STATE_MANAGER_FILTER, initialDate);
 
-final String returnFields = 
context.getProperty(RETURN_FIELDS).getValue();
-if (returnFields != null && !returnFields.trim().isEmpty()) {
-for (String returnField : returnFields.trim().split("[,]")) {
-solrQuery.addField(returnField.trim());
-}
+context.getStateManager().setState(newStateMap, Scope.CLUSTER);
+
+id_field = null;
 }
+}
 
-final String fullSortClause = 
context.getProperty(SORT_CLAUSE).getValue();
-if (fullSortClause != null && !fullSortClause.trim().isEmpty()) {
-for (String sortClause : fullSortClause.split("[,]")) {
-String[] sortParts = sortClause.trim().split("[ ]");
-solrQuery.addSort(sortParts[0], 
SolrQuery.ORDER.valueOf(sortParts[1]));
-}
+@Override
+protected final Collection 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210912#comment-16210912
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user ijokarumawak commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145678316
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -126,6 +126,14 @@
 .addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
 .build();
 
+public static final PropertyDescriptor DATE_FILTER = new 
PropertyDescriptor
+.Builder().name("Initial Date Filter")
+.displayName("Initial Date Filter")
+.description("Date value to filter results. Documents with an 
earlier date will not be fetched. The format has to correspond to the date 
pattern of Solr '-MM-DDThh:mm:ssZ'")
+.required(false)
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+.build();
+
--- End diff --

You can change description of processor at `@CapabilityDescription` 
annotation. 
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L72

If more detailed documentation is needed, processor can have 
`additionalDetails.html` like this.

https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/resources/docs/org.apache.nifi.processors.solr.PutSolrContentStream/additionalDetails.html


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210664#comment-16210664
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user JohannesDaniel commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145612359
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -126,6 +126,14 @@
 .addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
 .build();
 
+public static final PropertyDescriptor DATE_FILTER = new 
PropertyDescriptor
+.Builder().name("Initial Date Filter")
+.displayName("Initial Date Filter")
+.description("Date value to filter results. Documents with an 
earlier date will not be fetched. The format has to correspond to the date 
pattern of Solr '-MM-DDThh:mm:ssZ'")
+.required(false)
+.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+.build();
+
--- End diff --

This property should make it quite obvious, how backwards compatibility can 
be achieved. Additionally, I will describe it in the documentation. BTW: Where 
can I change descriptions of processor usage? Did not find them in folder 
nifi-docs...


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209344#comment-16209344
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user ijokarumawak commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145416772
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/SolrProcessor.java
 ---
@@ -275,7 +275,7 @@ protected final boolean isBasicAuthEnabled() {
 }
 
 @Override
-protected final Collection 
customValidate(ValidationContext context) {
+protected Collection 
customValidate(ValidationContext context) {
--- End diff --

Good call. Then the method can be a non-abstract method at SolrProcessor 
that does nothing.


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209339#comment-16209339
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user ijokarumawak commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145416024
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/test/resources/solr/testCollection/conf/schema.xml
 ---
@@ -16,6 +16,16 @@
 
 
 
+
 
 
+
+
+
+
+
+
+
+id
--- End diff --

I agree with that, most indices have unique key. But just asked it because 
it is not mandatory to have an unique key according to [Solr 
documentation](https://wiki.apache.org/solr/SchemaXml#The_Unique_Key_Field). 
Then I prefer to state that unique key is required for this processor to work 
properly in NiFi documentation.


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209331#comment-16209331
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user JohannesDaniel commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145415068
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/SolrProcessor.java
 ---
@@ -275,7 +275,7 @@ protected final boolean isBasicAuthEnabled() {
 }
 
 @Override
-protected final Collection 
customValidate(ValidationContext context) {
+protected Collection 
customValidate(ValidationContext context) {
--- End diff --

ok. by doing so, i will also have to add this method to PutSolrContentStream


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209326#comment-16209326
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user ijokarumawak commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145412969
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/SolrProcessor.java
 ---
@@ -275,7 +275,7 @@ protected final boolean isBasicAuthEnabled() {
 }
 
 @Override
-protected final Collection 
customValidate(ValidationContext context) {
+protected Collection 
customValidate(ValidationContext context) {
--- End diff --

I imagine the reason why this customValidate is marked with `final` is that 
because the original author wanted to avoid sub-classes skip executing 
validation code implemented here. You implemented within GetSolr, and call 
`super.customValidate` from there, so it should be fine, but other sub-class 
can forget to call `super.customValidate` if we remove `final` keyword.

So, I thought it might be safer approach to add an abstract method, such as 
`additionalCustomValidate` at SolrProcessor, then call it from customValidate, 
and let sub-classes implement custom validation in it.


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209300#comment-16209300
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user JohannesDaniel commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145408461
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/test/resources/solr/testCollection/conf/schema.xml
 ---
@@ -16,6 +16,16 @@
 
 
 
+
 
 
+
+
+
+
+
+
+
+id
--- End diff --

the uniqueKey field has to be part of the sorting. Well-configured Solr 
indexes always include this kind of field as many things will not work properly 
without this field. Actually, I have never seen a Solr index without this (and 
I have seen a lot ... ;). 


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209290#comment-16209290
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user JohannesDaniel commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145405902
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/SolrProcessor.java
 ---
@@ -275,7 +275,7 @@ protected final boolean isBasicAuthEnabled() {
 }
 
 @Override
-protected final Collection 
customValidate(ValidationContext context) {
+protected Collection 
customValidate(ValidationContext context) {
--- End diff --

I did within class GetSolr


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow below, GetSolr can miss updated documents:

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209280#comment-16209280
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user JohannesDaniel commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145404225
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -172,157 +203,196 @@ protected void init(final 
ProcessorInitializationContext context) {
 
 @Override
 public void onPropertyModified(PropertyDescriptor descriptor, String 
oldValue, String newValue) {
-lastEndDatedRef.set(UNINITIALIZED_LAST_END_DATE_VALUE);
+clearState.set(true);
--- End diff --

ok, no problem


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209279#comment-16209279
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user JohannesDaniel commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145404160
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -138,10 +168,11 @@ protected void init(final 
ProcessorInitializationContext context) {
 descriptors.add(SOLR_TYPE);
 descriptors.add(SOLR_LOCATION);
 descriptors.add(COLLECTION);
+descriptors.add(RETURN_TYPE);
+descriptors.add(RECORD_WRITER);
 descriptors.add(SOLR_QUERY);
-descriptors.add(RETURN_FIELDS);
-descriptors.add(SORT_CLAUSE);
--- End diff --

This should be save as the sorting only affects documents indexed after 
lastEndDate (documents indexed earlier are excluded by filter query)


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209278#comment-16209278
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user JohannesDaniel commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145403786
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -66,42 +79,64 @@
 import org.apache.solr.common.SolrDocument;
 import org.apache.solr.common.SolrDocumentList;
 import org.apache.solr.common.SolrInputDocument;
+import org.apache.solr.common.params.CursorMarkParams;
 
-@Tags({"Apache", "Solr", "Get", "Pull"})
+@Tags({"Apache", "Solr", "Get", "Pull", "Records"})
 @InputRequirement(Requirement.INPUT_FORBIDDEN)
-@CapabilityDescription("Queries Solr and outputs the results as a 
FlowFile")
+@CapabilityDescription("Queries Solr and outputs the results as a FlowFile 
in the format of XML or using a Record Writer")
+@Stateful(scopes = {Scope.LOCAL}, description = "Stores latest date of 
Date Field so that the same data will not be fetched multiple times.")
--- End diff --

Sorry, this would be the correct filter query:
 fq=dateField:[lastEndDate TO NOW]


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209275#comment-16209275
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user JohannesDaniel commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r145403415
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -66,42 +79,64 @@
 import org.apache.solr.common.SolrDocument;
 import org.apache.solr.common.SolrDocumentList;
 import org.apache.solr.common.SolrInputDocument;
+import org.apache.solr.common.params.CursorMarkParams;
 
-@Tags({"Apache", "Solr", "Get", "Pull"})
+@Tags({"Apache", "Solr", "Get", "Pull", "Records"})
 @InputRequirement(Requirement.INPUT_FORBIDDEN)
-@CapabilityDescription("Queries Solr and outputs the results as a 
FlowFile")
+@CapabilityDescription("Queries Solr and outputs the results as a FlowFile 
in the format of XML or using a Record Writer")
+@Stateful(scopes = {Scope.LOCAL}, description = "Stores latest date of 
Date Field so that the same data will not be fetched multiple times.")
--- End diff --

Do you really think that it is required to read the file? Backwards 
compatibility could also be realized by adding a filter query like 
fq=dateField:[* TO lastEndDate]. The user only had to specify the value of 
lastEndDate e. g. to an property of the processor.


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203452#comment-16203452
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user ijokarumawak commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r144533090
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -66,42 +79,64 @@
 import org.apache.solr.common.SolrDocument;
 import org.apache.solr.common.SolrDocumentList;
 import org.apache.solr.common.SolrInputDocument;
+import org.apache.solr.common.params.CursorMarkParams;
 
-@Tags({"Apache", "Solr", "Get", "Pull"})
+@Tags({"Apache", "Solr", "Get", "Pull", "Records"})
 @InputRequirement(Requirement.INPUT_FORBIDDEN)
-@CapabilityDescription("Queries Solr and outputs the results as a 
FlowFile")
+@CapabilityDescription("Queries Solr and outputs the results as a FlowFile 
in the format of XML or using a Record Writer")
+@Stateful(scopes = {Scope.LOCAL}, description = "Stores latest date of 
Date Field so that the same data will not be fetched multiple times.")
--- End diff --

State scope should be CLUSTER, I think. Also, capability description should 
mention that this processor is designed to run on Primary Node only. Please 
refer ListHDFS processor documentation.

Or does this processor work nicely in distributed fashion by utilizing 
multiple NiFi nodes against a Solr cluster?


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203447#comment-16203447
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user ijokarumawak commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r144532208
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -66,42 +79,64 @@
 import org.apache.solr.common.SolrDocument;
 import org.apache.solr.common.SolrDocumentList;
 import org.apache.solr.common.SolrInputDocument;
+import org.apache.solr.common.params.CursorMarkParams;
 
-@Tags({"Apache", "Solr", "Get", "Pull"})
+@Tags({"Apache", "Solr", "Get", "Pull", "Records"})
 @InputRequirement(Requirement.INPUT_FORBIDDEN)
-@CapabilityDescription("Queries Solr and outputs the results as a 
FlowFile")
+@CapabilityDescription("Queries Solr and outputs the results as a FlowFile 
in the format of XML or using a Record Writer")
+@Stateful(scopes = {Scope.LOCAL}, description = "Stores latest date of 
Date Field so that the same data will not be fetched multiple times.")
--- End diff --

GetSolr used to use local file to store lastEndDate. We need migration code 
so that lastEndDate to be taken over to managed state when there's no state but 
the lastEndDate file exists.


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203450#comment-16203450
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user ijokarumawak commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r144526595
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/SolrProcessor.java
 ---
@@ -275,7 +275,7 @@ protected final boolean isBasicAuthEnabled() {
 }
 
 @Override
-protected final Collection 
customValidate(ValidationContext context) {
+protected Collection 
customValidate(ValidationContext context) {
--- End diff --

Shouldn't we add another protected method to override at sub-classes?


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203448#comment-16203448
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user ijokarumawak commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r144530918
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -172,157 +203,196 @@ protected void init(final 
ProcessorInitializationContext context) {
 
 @Override
 public void onPropertyModified(PropertyDescriptor descriptor, String 
oldValue, String newValue) {
-lastEndDatedRef.set(UNINITIALIZED_LAST_END_DATE_VALUE);
+clearState.set(true);
--- End diff --

Probably we'd like to clear state only when following properties get 
changed? It would be a bad UX if state is cleared when user re-configure batch 
size.
- SOLR_TYPE
- SOLR_LOCATION
- COLLECTION
- SOLR_QUERY
- DATE_FIELD
- RETURN_FIELDS



> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203449#comment-16203449
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user ijokarumawak commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r144530989
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -138,10 +168,11 @@ protected void init(final 
ProcessorInitializationContext context) {
 descriptors.add(SOLR_TYPE);
 descriptors.add(SOLR_LOCATION);
 descriptors.add(COLLECTION);
+descriptors.add(RETURN_TYPE);
+descriptors.add(RECORD_WRITER);
 descriptors.add(SOLR_QUERY);
-descriptors.add(RETURN_FIELDS);
-descriptors.add(SORT_CLAUSE);
--- End diff --

Is it safe to remove an existing property? The existing code should not 
sort result anyway, or should store last sorted field value to paginate 
properly when docs with the same date split more than one page. So I think it's 
safe..


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203446#comment-16203446
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user ijokarumawak commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r144527126
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/test/resources/solr/testCollection/conf/schema.xml
 ---
@@ -16,6 +16,16 @@
 
 
 
+
 
 
+
+
+
+
+
+
+
+id
--- End diff --

What if Solr doc doesn't have an uniqueKey? Does this processor still work 
without uniqueKey??


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow below, GetSolr can miss updated documents:
> {code}
> t1: GetSolr queried
> t2: 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203451#comment-16203451
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user ijokarumawak commented on a diff in the pull request:

https://github.com/apache/nifi/pull/2199#discussion_r144533800
  
--- Diff: 
nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java
 ---
@@ -172,157 +203,196 @@ protected void init(final 
ProcessorInitializationContext context) {
 
 @Override
 public void onPropertyModified(PropertyDescriptor descriptor, String 
oldValue, String newValue) {
-lastEndDatedRef.set(UNINITIALIZED_LAST_END_DATE_VALUE);
+clearState.set(true);
 }
 
-@OnStopped
-public void onStopped() {
-writeLastEndDate();
-}
+@OnScheduled
+public void onScheduled2(final ProcessContext context) throws 
IOException {
--- End diff --

Please change method name appropriately to represent what it does, such as 
`clearState`. The annotation explains when it's called.


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195076#comment-16195076
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

GitHub user JohannesDaniel opened a pull request:

https://github.com/apache/nifi/pull/2199

NIFI-3248: Improvement of GetSolr Processor

Thank you for submitting a contribution to Apache NiFi.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

### For all changes:
- [ ] Is there a JIRA ticket associated with this PR? Is it referenced 
 in the commit message?

- [ ] Does your PR title start with NIFI- where  is the JIRA number 
you are trying to resolve? Pay particular attention to the hyphen "-" character.

- [ ] Has your PR been rebased against the latest commit within the target 
branch (typically master)?

- [ ] Is your initial contribution a single, squashed commit?

### For code changes:
- [ ] Have you ensured that the full suite of tests is executed via mvn 
-Pcontrib-check clean install at the root nifi folder?
- [ ] Have you written or updated unit tests to verify your changes?
- [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
- [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file under nifi-assembly?
- [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found under nifi-assembly?
- [ ] If adding new Properties, have you added .displayName in addition to 
.name (programmatic access) for each of the new properties?

### For documentation related changes:
- [ ] Have you ensured that format looks appropriate for the output in 
which it is rendered?

### Note:
Please ensure that once the PR is submitted, you check travis-ci for build 
issues and submit an update to your PR as soon as possible.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/JohannesDaniel/nifi NIFI-3248

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi/pull/2199.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2199


commit 8a5f7e54edc5640655edd19f15d22fada6ca9900
Author: JohannesDaniel 
Date:   2017-10-05T20:57:53Z

NIFI-3248: Improvement of GetSolr Processor




> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-10-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194280#comment-16194280
 ] 

ASF GitHub Bot commented on NIFI-3248:
--

Github user JohannesDaniel closed the pull request at:

https://github.com/apache/nifi/pull/2196


> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow below, GetSolr can miss updated documents:
> {code}
> t1: GetSolr queried
> t2: GenerateFlowFile set date = t2
> t3: PutSolrContentStream stored new doc
> t4: GetSolr queried again, from t1 to t4, but the new doc hasn't been indexed
> t5: Solr completed index
> t6: GetSolr queried again, from t4 to t6, the doc didn't match query
> {code}
> This behavior should be at least documented.
> Plus, it would be helpful to add a new configuration property to GetSolr, to 
> specify commit lag-time so that GetSolr aims older timestamp range to 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-09-19 Thread Koji Kawamura (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172645#comment-16172645
 ] 

Koji Kawamura commented on NIFI-3248:
-

Hi [~jope], thanks for your update. By looking at (1), I wonder if NiFi will be 
able to follow situations such as [Splitting 
Shards|https://cwiki.apache.org/confluence/display/solr/Collections+API], 
especially if NiFi user has to specify shard names. Once a shard is split the 
processor configuration has to be updated manually. Also the processor has to 
handle state management properly.

If dealing with Solr internal mechanism (_version_ and sharding) is too complex 
from NiFi codebase, I'd be fine to implement it with "StreamSolr" approach that 
Bryan mentioned earlier, and keep existing GetSolr implementation simple as of 
now.

(3) This is great!
(4) I like this enhancement. Since many Record aware processors added 
separately in addition to existing processors, it'd be also reasonable to 
implement this capability at the new "StreamSolr" processor if we are going to 
add that. In order to support record, processor has to work with 
RecordSetWriter and schema, so processor properties and documentation may 
become more complex if we support this at GetSolr. Maybe I just overly concern 
about it, but adding many features within a single processor would make it 
harder to maintain overtime.

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-09-19 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172249#comment-16172249
 ] 

Johannes Peter commented on NIFI-3248:
--

Update:

I am almost done with the new processor implementation. Quick update: 

(1) Meanwhile I had a little conversation with Cassandra Targett (Solr PMC), 
and she helped me clarifying some things about field \_version\_. 
Unfortunately, it is not possible to convert a value of this field into a valid 
timestamp. The values of this field are monotonically increasing depending on 
indexing time, but only at shard level, not at collection level. I am sorry for 
the confusion. The processor therefore iterates over shards if (a) Solr runs in 
cloud mode and (b) \_version\_ is used to track document retrieval instead of a 
dedicated date field. Although this way might require more queries and 
therefore be slower if collections comprise many shards, I implemented this to 
make the processor suitable for many more collections. The shard names 
currently have to be specified by property as I yet have not found a reliable 
way to figure them out automatically (shard names != core names).
(2) I implemented an option to make the use of filter query caches 
configurable. 
(3) The processor now makes use of the StateManager. 
(4) I will add an option to convert results into records.

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-09-04 Thread Bryan Bende (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16152919#comment-16152919
 ] 

Bryan Bende commented on NIFI-3248:
---

I like the ideas and improvements here, a couple of thoughts I had:

1) It would be nice to incorporate a RecordWriter for the output of this 
processor. I remember when I originally worked on it, I wanted to just send out 
the raw response from Solr so that it could be in varying formats that you can 
request from Solr, but I couldn't get access to the underlying raw response in 
SolrJ, so instead I got the SolrDocuments and re-serialized them back to the 
Solr XML format. Would be nice to have a way to convert a SolrDocument to a 
Record and then pass it to a record writer.

2) Using cursor mark sounds like a good idea. You need to include the uniqueKey 
field from the schema when using cursor mark so it can act as a tie breaker... 
basically imagine you are sorting just on a field called create_date, and two 
documents have the same create_date, and those two documents fall on the 
boundary between page 1 and page 2, then the cursor won't know which of those 
docs should be the start of page 2 the next time, but if you include the 
uniqueKey as a secondary sort then its a consistent tie breaker. In our case, 
the version field should always be unique, but I believe Solr actually throws 
an exception if you submit a query with cursor mark and don't use the uniqueKey 
in the sort clause, so just mentioning it.

3) Separate from this ticket, I think it would also be cool to look at building 
a "StreamSolr" processor (or some better name) that could leverage all of 
Solr's new streaming expressions which can do some really cool analytical type 
things:
https://lucene.apache.org/solr/guide/6_6/streaming-expressions.html#streaming-expressions

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-09-04 Thread Koji Kawamura (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16152566#comment-16152566
 ] 

Koji Kawamura commented on NIFI-3248:
-

[~jope]
Thanks for the elaborated clarifications.
(1) I can not imagine use-cases where sorting by ID would be useful with 
GetSolr. Like you said, users want to see updated Solr documents in the target 
system to replace old ones. So sorting condition probably doesn't need to be 
configurable. What I was trying to say is, that if we are going to change 
GetSolr behavior, migration of existing GetSolr processor instances in existing 
NiFi flows should be taken care of.
(4) As you mentioned, GetSolr does not have to cover all Solr capabilities. It 
should focus on getting newly updated documents from Solr.

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow below, GetSolr can miss 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-09-04 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16152432#comment-16152432
 ] 

Johannes Peter commented on NIFI-3248:
--

(1) Sorting by ID ensures that each document ist retrieved once, even if the 
document is updated. Sorting by \_version\_ asc ensures that each version of a 
document is retrieved once, as updated documents are "appended" at the end. I 
personally expect that someone, who uses Solr as a source, wants to see updated 
Solr documents in the target system to replace old ones. However, we could make 
this configurable. 
(2) The parameter fq provides the same query capabilities like q and can be 
used in the same way. The essential difference is that q basically is used to 
calculate relevancy, whereas fq is basically used to filter and to improve 
performance. In this case, we don't need relevancy as we sort by indexing time. 
Nevertheless, I see the point that users expect a property where they can 
configure the main query. 
(3) \_version\_ behaves like a timestamp, so there should be a little chance 
that two documents within a collection have the same value (in a cluster). I 
know that there is a way to convert it into a timestamp, but I first have to 
figure out how to do this exactly. Sorting by "\_version\_ asc" and using 
cursor marks should make the retrieval reliable to a very high degree. 
(4) I want to emphasize again that the logic and the purpose of GetSolr doesn't 
cover the capabilities of Solr sufficiently. There should be an additional 
processor to use Solr not only as a source, but also as a query layer. Features 
like faceting, grouping, pivot (e. g. for analytical purposes), spellchecking 
(e. g. for OCR or NLP), etc. etc. etc. are not covered by GetSolr (and 
shouldn't be included as the main focus should rely on the reliable retrieval). 
However, there should be a more flexible option to query Solr within workflows.

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-09-03 Thread Koji Kawamura (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16152023#comment-16152023
 ] 

Koji Kawamura commented on NIFI-3248:
-

[~jope] Thanks for the detailed proposal to improve GetSolr behavior. By 
looking at the shared document about pagination and cursor, I agree it would 
provide more reliable result to keep getting updated documents from Solr 
(tailing updated docs) as written in the documentation.

1. Instead of using 'Date field', use '_version_' to sort. I agree. Just to 
provide backward compatibility, we can make 'Date Field' optional and if 'Date 
Field' is not specified, use '_version. Users would expect that existing 
GetSolr processor to behave as it is.
2. Always iterating through result. I agree with this idea. It should provides 
the best latency.
3. Use Cursor instead of 'start' parameter. Good idea. I looked at the GetSolr 
and found that it does not use NiFi Managed State feature. Instead, it writes a 
file under conf dir locally. We should use Managed State instead of local file. 
Also, if we support backward compatibility, we need some logic to convert 
existing last date to equivalent '_version_'. (probably by querying the stored 
last date and use the last document '_version_' subsequently?)
4. I agree with the idea of using fq for better performance.
5. 'users should not be enabled to change the parameters sort and q'. I agree 
with not letting user modify sort condition. But I imagine some use-cases are 
only interested in particular set of documents to track updates. So I would 
expect new implementation still support providing query options.

I have a question regarding to the '_version_' value. Is this globally unique 
within a collection among all documents? The actual value looks like a unix 
epoch in microseconds resolution. By using '_version_' as sort condition and 
cursor, can we eliminate duplication or missing document completely? 

{quote}
Cursor mark values are computed based on the sort values of each document in 
the result, which means multiple documents with identical sort values will 
produce identical Cursor mark values if one of them is the last document on a 
page of results. In that situation, the subsequent request using that 
cursorMark would not know which of the documents with the identical mark values 
should be skipped. Requiring that the uniqueKey field be used as a clause in 
the sort criteria guarantees that a deterministic ordering will be returned, 
and that every cursorMark value will identify a unique point in the sequence of 
documents.
{quote}
https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html#using-cursors

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-09-03 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16151784#comment-16151784
 ] 

Johannes Peter commented on NIFI-3248:
--

[~ijokarumawak], [~bbende]
I examined the current GetSolr implementation and I found several issues, which 
I want to discuss:
(1) Currently, a date field needs to be included into the index schema and the 
Solr documents for indexing. Although this can be realized easily via Solrs' 
TimestampUpdateProcessor, it should be better, simply to use Solrs' \_version\_ 
field for filtering subsequent retrieval. This field is included in every 
well-configured Solr index as it is required for several functionalities. By 
doing so, this processor could also be used for indexes, which were not created 
considering NiFi interactions. 
(2) Iterating through a resultset will only be done if the processor runs the 
first time. This will be problematic if the amount of newly indexed documents 
in a trigger interval exceeds the configured batch size.
(3) Successively increasing the start parameter to retrieve Solr documents in 
batches is accompanied by two problems in this context. First, this way shows a 
poor performance for large collections. Second, updating the index during the 
iteration will probably lead to duplicates or a loss of documents in the case 
that positions of documents change due to newly indexed documents or deletions. 
Instead of increasing the start parameter, cursor marks should be used, and the 
sorting should be fixed to an ascending order of the time when documents were 
indexed (\_version\_ field). More details on this can be retrieved here 
https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html
(4) Using the fq-parameter instead of the q-parameter should improve the 
performance in some cases, as Solr is able to use caches for fq. The 
q-parameter should be fixed to "*:*". 

As a consequence, I suggest to redesign the GetSolr processor in a way that it 
mainly focuses on retrieving documents reliably. This can be done better by 
using cursor marks and the \_version\_ field. Additionally, users should not be 
enabled to change the parameters sort and q. The full query capabilities of 
Solr could be made available by integrating an additional processor, e. g. 
"FetchSolr".

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-08-31 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149066#comment-16149066
 ] 

Johannes Peter commented on NIFI-3248:
--

[~ijokarumawak] Sure. I will start within next week.

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Koji Kawamura
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow below, GetSolr can miss updated documents:
> {code}
> t1: GetSolr queried
> t2: GenerateFlowFile set date = t2
> t3: PutSolrContentStream stored new doc
> t4: GetSolr queried again, from t1 to t4, but the new doc hasn't been indexed
> t5: Solr completed index
> t6: GetSolr queried again, from t4 to t6, the doc didn't match query
> {code}
> This behavior should be at least documented.
> Plus, it would be helpful to add a new configuration property to GetSolr, to 
> specify commit lag-time so that GetSolr aims older timestamp range to query 
> documents.
> {code}
> // with commit 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-08-30 Thread Koji Kawamura (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147138#comment-16147138
 ] 

Koji Kawamura commented on NIFI-3248:
-

[~jope] No, I haven't considered to use _version_. Actually, I haven't been 
able to work on this.. Sorry for the delay. Are you interested in creating a 
PullRequest for this by any chance?

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Koji Kawamura
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow below, GetSolr can miss updated documents:
> {code}
> t1: GetSolr queried
> t2: GenerateFlowFile set date = t2
> t3: PutSolrContentStream stored new doc
> t4: GetSolr queried again, from t1 to t4, but the new doc hasn't been indexed
> t5: Solr completed index
> t6: GetSolr queried again, from t4 to t6, the doc didn't match query
> {code}
> This behavior should be at least documented.
> Plus, it would be helpful to add a new configuration property to 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-08-29 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16145306#comment-16145306
 ] 

Johannes Peter commented on NIFI-3248:
--

Have you considered to use the Solr field "_version_" yet? It can be treated 
like a timestamp. It also can be transformed to a timestamp. E. g. sorting for 
"_version_ desc" sorts documents depending on their time of indexing. 

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Koji Kawamura
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow below, GetSolr can miss updated documents:
> {code}
> t1: GetSolr queried
> t2: GenerateFlowFile set date = t2
> t3: PutSolrContentStream stored new doc
> t4: GetSolr queried again, from t1 to t4, but the new doc hasn't been indexed
> t5: Solr completed index
> t6: GetSolr queried again, from t4 to t6, the doc didn't match query
> {code}
> This behavior should be at least documented.
> Plus, it would be 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-01-06 Thread Koji Kawamura (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806843#comment-15806843
 ] 

Koji Kawamura commented on NIFI-3248:
-

[~bbende]  I'm going to create a PR shortly. Thanks!

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Koji Kawamura
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow below, GetSolr can miss updated documents:
> {code}
> t1: GetSolr queried
> t2: GenerateFlowFile set date = t2
> t3: PutSolrContentStream stored new doc
> t4: GetSolr queried again, from t1 to t4, but the new doc hasn't been indexed
> t5: Solr completed index
> t6: GetSolr queried again, from t4 to t6, the doc didn't match query
> {code}
> This behavior should be at least documented.
> Plus, it would be helpful to add a new configuration property to GetSolr, to 
> specify commit lag-time so that GetSolr aims older timestamp range to query 
> documents.
> {code}
> // with commit 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-01-06 Thread Bryan Bende (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804647#comment-15804647
 ] 

Bryan Bende commented on NIFI-3248:
---

[~ijokarumawak] I understand now, moving both 'from' and 'to' makes sense, good 
idea!

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow below, GetSolr can miss updated documents:
> {code}
> t1: GetSolr queried
> t2: GenerateFlowFile set date = t2
> t3: PutSolrContentStream stored new doc
> t4: GetSolr queried again, from t1 to t4, but the new doc hasn't been indexed
> t5: Solr completed index
> t6: GetSolr queried again, from t4 to t6, the doc didn't match query
> {code}
> This behavior should be at least documented.
> Plus, it would be helpful to add a new configuration property to GetSolr, to 
> specify commit lag-time so that GetSolr aims older timestamp range to query 
> documents.
> {code}
> // with commit lag-time
> 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-01-05 Thread Koji Kawamura (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15803259#comment-15803259
 ] 

Koji Kawamura commented on NIFI-3248:
-

[~bbende] Thanks for the comment. I was thinking that the new lag-time property 
to move both 'from' and 'to' timestamp for GetSolr. If we only move 'from' 
timestamp, we might need the document IDs fetched at the previous iteration, 
but if we also move 'to' timestamp, I don't think we need that. In case someone 
sets the lag grater than the actual commit lag, the documents will need more 
latency to be fetched, but it will be fetched only once, because the timestamp 
range filter doesn't overlap.

Maybe I'm missing some other concerns. Please let me know if it's better to 
overlap the range filter rather than simply sliding it to the past for the 
specified amount of time.

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-01-03 Thread Bryan Bende (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15795156#comment-15795156
 ] 

Bryan Bende commented on NIFI-3248:
---

[~ijokarumawak] thanks for writing this up! I like the idea of the commit 
lag-time, not sure the best name for it but maybe something like "Time Range 
Overlap"?

I was thinking that if we add that, we might also need to keep track of all the 
document ids from the previous iteration to avoid duplicates, in case someone 
sets the overlap greater than the actual commit lag. We could compare each id 
against the last batch and only emit the document if it wasn't in the last 
batch. Thoughts?

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow below, GetSolr can miss updated documents:
> {code}
> t1: GetSolr queried
> t2: GenerateFlowFile set date = t2
> t3: PutSolrContentStream stored new doc
> t4: GetSolr queried again, from t1 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2016-12-26 Thread Koji Kawamura (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779382#comment-15779382
 ] 

Koji Kawamura commented on NIFI-3248:
-

Updated JIRA description to clearly explain concerns and possible solutions for 
those if needed.

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow below, GetSolr can miss updated documents:
> {code}
> t1: GetSolr queried
> t2: GenerateFlowFile set date = t2
> t3: PutSolrContentStream stored new doc
> t4: GetSolr queried again, from t1 to t4, but the new doc hasn't been indexed
> t5: Solr completed index
> t6: GetSolr queried again, from t4 to t6, the doc didn't match query
> {code}
> This behavior should be at least documented.
> Plus, it would be helpful to add a new configuration property to GetSolr, to 
> specify commit lag-time so that GetSolr aims older timestamp range to query 
> documents.
> {code}
> // with