[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-09-03 Thread Koji Kawamura (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152023#comment-16152023
 ] 

Koji Kawamura commented on NIFI-3248:
-

[~jope] Thanks for the detailed proposal to improve GetSolr behavior. By 
looking at the shared document about pagination and cursor, I agree it would 
provide more reliable result to keep getting updated documents from Solr 
(tailing updated docs) as written in the documentation.

1. Instead of using 'Date field', use '_version_' to sort. I agree. Just to 
provide backward compatibility, we can make 'Date Field' optional and if 'Date 
Field' is not specified, use '_version. Users would expect that existing 
GetSolr processor to behave as it is.
2. Always iterating through result. I agree with this idea. It should provides 
the best latency.
3. Use Cursor instead of 'start' parameter. Good idea. I looked at the GetSolr 
and found that it does not use NiFi Managed State feature. Instead, it writes a 
file under conf dir locally. We should use Managed State instead of local file. 
Also, if we support backward compatibility, we need some logic to convert 
existing last date to equivalent '_version_'. (probably by querying the stored 
last date and use the last document '_version_' subsequently?)
4. I agree with the idea of using fq for better performance.
5. 'users should not be enabled to change the parameters sort and q'. I agree 
with not letting user modify sort condition. But I imagine some use-cases are 
only interested in particular set of documents to track updates. So I would 
expect new implementation still support providing query options.

I have a question regarding to the '_version_' value. Is this globally unique 
within a collection among all documents? The actual value looks like a unix 
epoch in microseconds resolution. By using '_version_' as sort condition and 
cursor, can we eliminate duplication or missing document completely? 

{quote}
Cursor mark values are computed based on the sort values of each document in 
the result, which means multiple documents with identical sort values will 
produce identical Cursor mark values if one of them is the last document on a 
page of results. In that situation, the subsequent request using that 
cursorMark would not know which of the documents with the identical mark values 
should be skipped. Requiring that the uniqueKey field be used as a clause in 
the sort criteria guarantees that a deterministic ordering will be returned, 
and that every cursorMark value will identify a unique point in the sequence of 
documents.
{quote}
https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html#using-cursors

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].

[jira] [Created] (NIFI-4348) isExpressionLanguagePresent throws NPE when attribute is null

2017-09-03 Thread Kay-Uwe Moosheimer (JIRA)
Kay-Uwe Moosheimer created NIFI-4348:


 Summary: isExpressionLanguagePresent throws NPE when attribute is 
null
 Key: NIFI-4348
 URL: https://issues.apache.org/jira/browse/NIFI-4348
 Project: Apache NiFi
  Issue Type: Improvement
  Components: Core Framework
Affects Versions: 1.3.0
Reporter: Kay-Uwe Moosheimer
Priority: Trivial


The following code throws a NPE:

PropertyValue property = context.getProperty(SOME_PROPERTY);
if (property.isExpressionLanguagePresent()) {

when the property is not set (NULL).
So I have to write

PropertyValue property = context.getProperty(SOME_PROPERTY);
if (property.isSet() && property.isExpressionLanguagePresent()) {

It would be great if the method isExpressionLanguagePresent() checks for NULL 
and then return false.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NIFI-4347) Extend documentation with double-click shortcuts info

2017-09-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16151788#comment-16151788
 ] 

ASF GitHub Bot commented on NIFI-4347:
--

GitHub user yuri1969 opened a pull request:

https://github.com/apache/nifi/pull/2126

NIFI-4347 - Extend documentation with...

...double-click shortcuts info

Thank you for submitting a contribution to Apache NiFi.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

### For all changes:
- [x] Is there a JIRA ticket associated with this PR? Is it referenced 
 in the commit message?

- [x] Does your PR title start with NIFI- where  is the JIRA number 
you are trying to resolve? Pay particular attention to the hyphen "-" character.

- [x] Has your PR been rebased against the latest commit within the target 
branch (typically master)?

- [x] Is your initial contribution a single, squashed commit?

### For code changes:
- [ ] Have you ensured that the full suite of tests is executed via mvn 
-Pcontrib-check clean install at the root nifi folder?
- [ ] Have you written or updated unit tests to verify your changes?
- [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
- [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file under nifi-assembly?
- [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found under nifi-assembly?
- [ ] If adding new Properties, have you added .displayName in addition to 
.name (programmatic access) for each of the new properties?

### For documentation related changes:
- [x] Have you ensured that format looks appropriate for the output in 
which it is rendered?

### Note:
Please ensure that once the PR is submitted, you check travis-ci for build 
issues and submit an update to your PR as soon as possible.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yuri1969/nifi NIFI-4347

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi/pull/2126.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2126


commit 01d5a4180e01256800e8dfd6fc38ab8b5e952637
Author: yuri1969 <1969yuri1...@gmail.com>
Date:   2017-09-03T12:25:39Z

NIFI-4347 - Extend documentation with...

...double-click shortcuts info




> Extend documentation with double-click shortcuts info
> -
>
> Key: NIFI-4347
> URL: https://issues.apache.org/jira/browse/NIFI-4347
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Documentation & Website
>Reporter: Yuri
>Priority: Trivial
>
> Recent additions of double-click triggered shortcuts should be included in 
> the documentation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] nifi pull request #2126: NIFI-4347 - Extend documentation with...

2017-09-03 Thread yuri1969
GitHub user yuri1969 opened a pull request:

https://github.com/apache/nifi/pull/2126

NIFI-4347 - Extend documentation with...

...double-click shortcuts info

Thank you for submitting a contribution to Apache NiFi.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

### For all changes:
- [x] Is there a JIRA ticket associated with this PR? Is it referenced 
 in the commit message?

- [x] Does your PR title start with NIFI- where  is the JIRA number 
you are trying to resolve? Pay particular attention to the hyphen "-" character.

- [x] Has your PR been rebased against the latest commit within the target 
branch (typically master)?

- [x] Is your initial contribution a single, squashed commit?

### For code changes:
- [ ] Have you ensured that the full suite of tests is executed via mvn 
-Pcontrib-check clean install at the root nifi folder?
- [ ] Have you written or updated unit tests to verify your changes?
- [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
- [ ] If applicable, have you updated the LICENSE file, including the main 
LICENSE file under nifi-assembly?
- [ ] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found under nifi-assembly?
- [ ] If adding new Properties, have you added .displayName in addition to 
.name (programmatic access) for each of the new properties?

### For documentation related changes:
- [x] Have you ensured that format looks appropriate for the output in 
which it is rendered?

### Note:
Please ensure that once the PR is submitted, you check travis-ci for build 
issues and submit an update to your PR as soon as possible.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yuri1969/nifi NIFI-4347

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi/pull/2126.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2126


commit 01d5a4180e01256800e8dfd6fc38ab8b5e952637
Author: yuri1969 <1969yuri1...@gmail.com>
Date:   2017-09-03T12:25:39Z

NIFI-4347 - Extend documentation with...

...double-click shortcuts info




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (NIFI-4347) Extend documentation with double-click shortcuts info

2017-09-03 Thread Yuri (JIRA)
Yuri created NIFI-4347:
--

 Summary: Extend documentation with double-click shortcuts info
 Key: NIFI-4347
 URL: https://issues.apache.org/jira/browse/NIFI-4347
 Project: Apache NiFi
  Issue Type: Improvement
  Components: Documentation & Website
Reporter: Yuri
Priority: Trivial


Recent additions of double-click triggered shortcuts should be included in the 
documentation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (NIFI-3248) GetSolr can miss recently updated documents

2017-09-03 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16151784#comment-16151784
 ] 

Johannes Peter edited comment on NIFI-3248 at 9/3/17 12:17 PM:
---

[~ijokarumawak], [~bbende]
I examined the current GetSolr implementation and I found several issues, which 
I want to discuss:
(1) Currently, a date field needs to be included into the index schema and the 
Solr documents for indexing. Although this can be realized easily via Solrs' 
TimestampUpdateProcessor, it should be better, simply to use Solrs' \_version\_ 
field for filtering subsequent retrieval. This field is included in every 
well-configured Solr index as it is required for several functionalities. By 
doing so, this processor could also be used for indexes, which were not created 
considering NiFi interactions. 
(2) Iterating through a resultset will only be done if the processor runs the 
first time. This will be problematic if the amount of newly indexed documents 
in a trigger interval exceeds the configured batch size.
(3) Successively increasing the start parameter to retrieve Solr documents in 
batches is accompanied by two problems in this context. First, this way shows a 
poor performance for large collections. Second, updating the index during the 
iteration will probably lead to duplicates or a loss of documents in the case 
that positions of documents change due to newly indexed documents or deletions. 
Instead of increasing the start parameter, cursor marks should be used, and the 
sorting should be fixed to an ascending order of the time when documents were 
indexed (\_version\_ field). More details on this can be retrieved here 
https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html
(4) Using the fq-parameter instead of the q-parameter should improve the 
performance in some cases, as Solr is able to use caches for fq. The 
q-parameter should be fixed to "\*:\*". 

As a consequence, I suggest to redesign the GetSolr processor in a way that it 
mainly focuses on retrieving documents reliably. This can be done better by 
using cursor marks and the \_version\_ field. Additionally, users should not be 
enabled to change the parameters sort and q. The full query capabilities of 
Solr could be made available by integrating an additional processor, e. g. 
"FetchSolr".


was (Author: jope):
[~ijokarumawak], [~bbende]
I examined the current GetSolr implementation and I found several issues, which 
I want to discuss:
(1) Currently, a date field needs to be included into the index schema and the 
Solr documents for indexing. Although this can be realized easily via Solrs' 
TimestampUpdateProcessor, it should be better, simply to use Solrs' \_version\_ 
field for filtering subsequent retrieval. This field is included in every 
well-configured Solr index as it is required for several functionalities. By 
doing so, this processor could also be used for indexes, which were not created 
considering NiFi interactions. 
(2) Iterating through a resultset will only be done if the processor runs the 
first time. This will be problematic if the amount of newly indexed documents 
in a trigger interval exceeds the configured batch size.
(3) Successively increasing the start parameter to retrieve Solr documents in 
batches is accompanied by two problems in this context. First, this way shows a 
poor performance for large collections. Second, updating the index during the 
iteration will probably lead to duplicates or a loss of documents in the case 
that positions of documents change due to newly indexed documents or deletions. 
Instead of increasing the start parameter, cursor marks should be used, and the 
sorting should be fixed to an ascending order of the time when documents were 
indexed (\_version\_ field). More details on this can be retrieved here 
https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html
(4) Using the fq-parameter instead of the q-parameter should improve the 
performance in some cases, as Solr is able to use caches for fq. The 
q-parameter should be fixed to "*:*". 

As a consequence, I suggest to redesign the GetSolr processor in a way that it 
mainly focuses on retrieving documents reliably. This can be done better by 
using cursor marks and the \_version\_ field. Additionally, users should not be 
enabled to change the parameters sort and q. The full query capabilities of 
Solr could be made available by integrating an additional processor, e. g. 
"FetchSolr".

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporte

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-09-03 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16151784#comment-16151784
 ] 

Johannes Peter commented on NIFI-3248:
--

[~ijokarumawak], [~bbende]
I examined the current GetSolr implementation and I found several issues, which 
I want to discuss:
(1) Currently, a date field needs to be included into the index schema and the 
Solr documents for indexing. Although this can be realized easily via Solrs' 
TimestampUpdateProcessor, it should be better, simply to use Solrs' \_version\_ 
field for filtering subsequent retrieval. This field is included in every 
well-configured Solr index as it is required for several functionalities. By 
doing so, this processor could also be used for indexes, which were not created 
considering NiFi interactions. 
(2) Iterating through a resultset will only be done if the processor runs the 
first time. This will be problematic if the amount of newly indexed documents 
in a trigger interval exceeds the configured batch size.
(3) Successively increasing the start parameter to retrieve Solr documents in 
batches is accompanied by two problems in this context. First, this way shows a 
poor performance for large collections. Second, updating the index during the 
iteration will probably lead to duplicates or a loss of documents in the case 
that positions of documents change due to newly indexed documents or deletions. 
Instead of increasing the start parameter, cursor marks should be used, and the 
sorting should be fixed to an ascending order of the time when documents were 
indexed (\_version\_ field). More details on this can be retrieved here 
https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html
(4) Using the fq-parameter instead of the q-parameter should improve the 
performance in some cases, as Solr is able to use caches for fq. The 
q-parameter should be fixed to "*:*". 

As a consequence, I suggest to redesign the GetSolr processor in a way that it 
mainly focuses on retrieving documents reliably. This can be done better by 
using cursor marks and the \_version\_ field. Additionally, users should not be 
enabled to change the parameters sort and q. The full query capabilities of 
Solr could be made available by integrating an additional processor, e. g. 
"FetchSolr".

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr index