[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents
[ https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152023#comment-16152023 ] Koji Kawamura commented on NIFI-3248: - [~jope] Thanks for the detailed proposal to improve GetSolr behavior. By looking at the shared document about pagination and cursor, I agree it would provide more reliable result to keep getting updated documents from Solr (tailing updated docs) as written in the documentation. 1. Instead of using 'Date field', use '_version_' to sort. I agree. Just to provide backward compatibility, we can make 'Date Field' optional and if 'Date Field' is not specified, use '_version. Users would expect that existing GetSolr processor to behave as it is. 2. Always iterating through result. I agree with this idea. It should provides the best latency. 3. Use Cursor instead of 'start' parameter. Good idea. I looked at the GetSolr and found that it does not use NiFi Managed State feature. Instead, it writes a file under conf dir locally. We should use Managed State instead of local file. Also, if we support backward compatibility, we need some logic to convert existing last date to equivalent '_version_'. (probably by querying the stored last date and use the last document '_version_' subsequently?) 4. I agree with the idea of using fq for better performance. 5. 'users should not be enabled to change the parameters sort and q'. I agree with not letting user modify sort condition. But I imagine some use-cases are only interested in particular set of documents to track updates. So I would expect new implementation still support providing query options. I have a question regarding to the '_version_' value. Is this globally unique within a collection among all documents? The actual value looks like a unix epoch in microseconds resolution. By using '_version_' as sort condition and cursor, can we eliminate duplication or missing document completely? {quote} Cursor mark values are computed based on the sort values of each document in the result, which means multiple documents with identical sort values will produce identical Cursor mark values if one of them is the last document on a page of results. In that situation, the subsequent request using that cursorMark would not know which of the documents with the identical mark values should be skipped. Requiring that the uniqueKey field be used as a clause in the sort criteria guarantees that a deterministic ordering will be returned, and that every cursorMark value will identify a unique point in the sequence of documents. {quote} https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html#using-cursors > GetSolr can miss recently updated documents > --- > > Key: NIFI-3248 > URL: https://issues.apache.org/jira/browse/NIFI-3248 > Project: Apache NiFi > Issue Type: Bug > Components: Extensions >Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, > 1.0.1 >Reporter: Koji Kawamura >Assignee: Johannes Peter > Attachments: nifi-flow.png, query-result-with-curly-bracket.png, > query-result-with-square-bracket.png > > > GetSolr holds the last query timestamp so that it only fetches documents > those have been added or updated since the last query. > However, GetSolr misses some of those updated documents, and once the > documents date field value becomes older than last query timestamp, the > document won't be able to be queried by GetSolr any more. > This JIRA is for tracking the process of investigating this behavior, and > discussion on them. > Here are things that can be a cause of this behavior: > |#|Short description|Should we address it?| > |1|Timestamp range filter, curly or square bracket?|No| > |2|Timezone difference between update and query|Additional docs might be > helpful| > |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, > add 'commit lag-time'?| > h2. 1. Timestamp range filter, curly or square bracket? > At the first glance, using curly and square bracket in mix looked strange > ([source > code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]). > But these difference has a meaning. > The square bracket on the range query is inclusive and the curly bracket is > exclusive. If we use inclusive on both sides and a document has a time stamp > exactly on the boundary then it could be returned in two consecutive > executions, and we only want it in one. > This is intentional, and it should be as it is. > h2. 2. Timezone difference between update and query > Solr treats date fields as [UTC > representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
[jira] [Created] (NIFI-4348) isExpressionLanguagePresent throws NPE when attribute is null
Kay-Uwe Moosheimer created NIFI-4348: Summary: isExpressionLanguagePresent throws NPE when attribute is null Key: NIFI-4348 URL: https://issues.apache.org/jira/browse/NIFI-4348 Project: Apache NiFi Issue Type: Improvement Components: Core Framework Affects Versions: 1.3.0 Reporter: Kay-Uwe Moosheimer Priority: Trivial The following code throws a NPE: PropertyValue property = context.getProperty(SOME_PROPERTY); if (property.isExpressionLanguagePresent()) { when the property is not set (NULL). So I have to write PropertyValue property = context.getProperty(SOME_PROPERTY); if (property.isSet() && property.isExpressionLanguagePresent()) { It would be great if the method isExpressionLanguagePresent() checks for NULL and then return false. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NIFI-4347) Extend documentation with double-click shortcuts info
[ https://issues.apache.org/jira/browse/NIFI-4347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16151788#comment-16151788 ] ASF GitHub Bot commented on NIFI-4347: -- GitHub user yuri1969 opened a pull request: https://github.com/apache/nifi/pull/2126 NIFI-4347 - Extend documentation with... ...double-click shortcuts info Thank you for submitting a contribution to Apache NiFi. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with NIFI- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically master)? - [x] Is your initial contribution a single, squashed commit? ### For code changes: - [ ] Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder? - [ ] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly? - [ ] If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties? ### For documentation related changes: - [x] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yuri1969/nifi NIFI-4347 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nifi/pull/2126.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2126 commit 01d5a4180e01256800e8dfd6fc38ab8b5e952637 Author: yuri1969 <1969yuri1...@gmail.com> Date: 2017-09-03T12:25:39Z NIFI-4347 - Extend documentation with... ...double-click shortcuts info > Extend documentation with double-click shortcuts info > - > > Key: NIFI-4347 > URL: https://issues.apache.org/jira/browse/NIFI-4347 > Project: Apache NiFi > Issue Type: Improvement > Components: Documentation & Website >Reporter: Yuri >Priority: Trivial > > Recent additions of double-click triggered shortcuts should be included in > the documentation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[GitHub] nifi pull request #2126: NIFI-4347 - Extend documentation with...
GitHub user yuri1969 opened a pull request: https://github.com/apache/nifi/pull/2126 NIFI-4347 - Extend documentation with... ...double-click shortcuts info Thank you for submitting a contribution to Apache NiFi. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [x] Does your PR title start with NIFI- where is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [x] Has your PR been rebased against the latest commit within the target branch (typically master)? - [x] Is your initial contribution a single, squashed commit? ### For code changes: - [ ] Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder? - [ ] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly? - [ ] If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties? ### For documentation related changes: - [x] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yuri1969/nifi NIFI-4347 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nifi/pull/2126.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2126 commit 01d5a4180e01256800e8dfd6fc38ab8b5e952637 Author: yuri1969 <1969yuri1...@gmail.com> Date: 2017-09-03T12:25:39Z NIFI-4347 - Extend documentation with... ...double-click shortcuts info --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (NIFI-4347) Extend documentation with double-click shortcuts info
Yuri created NIFI-4347: -- Summary: Extend documentation with double-click shortcuts info Key: NIFI-4347 URL: https://issues.apache.org/jira/browse/NIFI-4347 Project: Apache NiFi Issue Type: Improvement Components: Documentation & Website Reporter: Yuri Priority: Trivial Recent additions of double-click triggered shortcuts should be included in the documentation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (NIFI-3248) GetSolr can miss recently updated documents
[ https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16151784#comment-16151784 ] Johannes Peter edited comment on NIFI-3248 at 9/3/17 12:17 PM: --- [~ijokarumawak], [~bbende] I examined the current GetSolr implementation and I found several issues, which I want to discuss: (1) Currently, a date field needs to be included into the index schema and the Solr documents for indexing. Although this can be realized easily via Solrs' TimestampUpdateProcessor, it should be better, simply to use Solrs' \_version\_ field for filtering subsequent retrieval. This field is included in every well-configured Solr index as it is required for several functionalities. By doing so, this processor could also be used for indexes, which were not created considering NiFi interactions. (2) Iterating through a resultset will only be done if the processor runs the first time. This will be problematic if the amount of newly indexed documents in a trigger interval exceeds the configured batch size. (3) Successively increasing the start parameter to retrieve Solr documents in batches is accompanied by two problems in this context. First, this way shows a poor performance for large collections. Second, updating the index during the iteration will probably lead to duplicates or a loss of documents in the case that positions of documents change due to newly indexed documents or deletions. Instead of increasing the start parameter, cursor marks should be used, and the sorting should be fixed to an ascending order of the time when documents were indexed (\_version\_ field). More details on this can be retrieved here https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html (4) Using the fq-parameter instead of the q-parameter should improve the performance in some cases, as Solr is able to use caches for fq. The q-parameter should be fixed to "\*:\*". As a consequence, I suggest to redesign the GetSolr processor in a way that it mainly focuses on retrieving documents reliably. This can be done better by using cursor marks and the \_version\_ field. Additionally, users should not be enabled to change the parameters sort and q. The full query capabilities of Solr could be made available by integrating an additional processor, e. g. "FetchSolr". was (Author: jope): [~ijokarumawak], [~bbende] I examined the current GetSolr implementation and I found several issues, which I want to discuss: (1) Currently, a date field needs to be included into the index schema and the Solr documents for indexing. Although this can be realized easily via Solrs' TimestampUpdateProcessor, it should be better, simply to use Solrs' \_version\_ field for filtering subsequent retrieval. This field is included in every well-configured Solr index as it is required for several functionalities. By doing so, this processor could also be used for indexes, which were not created considering NiFi interactions. (2) Iterating through a resultset will only be done if the processor runs the first time. This will be problematic if the amount of newly indexed documents in a trigger interval exceeds the configured batch size. (3) Successively increasing the start parameter to retrieve Solr documents in batches is accompanied by two problems in this context. First, this way shows a poor performance for large collections. Second, updating the index during the iteration will probably lead to duplicates or a loss of documents in the case that positions of documents change due to newly indexed documents or deletions. Instead of increasing the start parameter, cursor marks should be used, and the sorting should be fixed to an ascending order of the time when documents were indexed (\_version\_ field). More details on this can be retrieved here https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html (4) Using the fq-parameter instead of the q-parameter should improve the performance in some cases, as Solr is able to use caches for fq. The q-parameter should be fixed to "*:*". As a consequence, I suggest to redesign the GetSolr processor in a way that it mainly focuses on retrieving documents reliably. This can be done better by using cursor marks and the \_version\_ field. Additionally, users should not be enabled to change the parameters sort and q. The full query capabilities of Solr could be made available by integrating an additional processor, e. g. "FetchSolr". > GetSolr can miss recently updated documents > --- > > Key: NIFI-3248 > URL: https://issues.apache.org/jira/browse/NIFI-3248 > Project: Apache NiFi > Issue Type: Bug > Components: Extensions >Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, > 1.0.1 >Reporte
[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents
[ https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16151784#comment-16151784 ] Johannes Peter commented on NIFI-3248: -- [~ijokarumawak], [~bbende] I examined the current GetSolr implementation and I found several issues, which I want to discuss: (1) Currently, a date field needs to be included into the index schema and the Solr documents for indexing. Although this can be realized easily via Solrs' TimestampUpdateProcessor, it should be better, simply to use Solrs' \_version\_ field for filtering subsequent retrieval. This field is included in every well-configured Solr index as it is required for several functionalities. By doing so, this processor could also be used for indexes, which were not created considering NiFi interactions. (2) Iterating through a resultset will only be done if the processor runs the first time. This will be problematic if the amount of newly indexed documents in a trigger interval exceeds the configured batch size. (3) Successively increasing the start parameter to retrieve Solr documents in batches is accompanied by two problems in this context. First, this way shows a poor performance for large collections. Second, updating the index during the iteration will probably lead to duplicates or a loss of documents in the case that positions of documents change due to newly indexed documents or deletions. Instead of increasing the start parameter, cursor marks should be used, and the sorting should be fixed to an ascending order of the time when documents were indexed (\_version\_ field). More details on this can be retrieved here https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html (4) Using the fq-parameter instead of the q-parameter should improve the performance in some cases, as Solr is able to use caches for fq. The q-parameter should be fixed to "*:*". As a consequence, I suggest to redesign the GetSolr processor in a way that it mainly focuses on retrieving documents reliably. This can be done better by using cursor marks and the \_version\_ field. Additionally, users should not be enabled to change the parameters sort and q. The full query capabilities of Solr could be made available by integrating an additional processor, e. g. "FetchSolr". > GetSolr can miss recently updated documents > --- > > Key: NIFI-3248 > URL: https://issues.apache.org/jira/browse/NIFI-3248 > Project: Apache NiFi > Issue Type: Bug > Components: Extensions >Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, > 1.0.1 >Reporter: Koji Kawamura >Assignee: Johannes Peter > Attachments: nifi-flow.png, query-result-with-curly-bracket.png, > query-result-with-square-bracket.png > > > GetSolr holds the last query timestamp so that it only fetches documents > those have been added or updated since the last query. > However, GetSolr misses some of those updated documents, and once the > documents date field value becomes older than last query timestamp, the > document won't be able to be queried by GetSolr any more. > This JIRA is for tracking the process of investigating this behavior, and > discussion on them. > Here are things that can be a cause of this behavior: > |#|Short description|Should we address it?| > |1|Timestamp range filter, curly or square bracket?|No| > |2|Timezone difference between update and query|Additional docs might be > helpful| > |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, > add 'commit lag-time'?| > h2. 1. Timestamp range filter, curly or square bracket? > At the first glance, using curly and square bracket in mix looked strange > ([source > code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]). > But these difference has a meaning. > The square bracket on the range query is inclusive and the curly bracket is > exclusive. If we use inclusive on both sides and a document has a time stamp > exactly on the boundary then it could be returned in two consecutive > executions, and we only want it in one. > This is intentional, and it should be as it is. > h2. 2. Timezone difference between update and query > Solr treats date fields as [UTC > representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|]. > If date field String value of an updated document represents time without > timezone, and NiFi is running on an environment using timezone other than > UTC, GetSolr can't perform date range query as users expect. > Let's say NiFi is running with JST(UTC+9). A process added a document to Solr > at 15:00 JST. But the date field doesn't have timezone. So, Solr index