[jira] [Commented] (NIFI-5224) Add SolrClientService

2018-09-26 Thread Johannes Peter (JIRA)


[ 
https://issues.apache.org/jira/browse/NIFI-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628391#comment-16628391
 ] 

Johannes Peter commented on NIFI-5224:
--

sorry [~mike.thomsen] for not responding for some reasons the jira 
notifications arrived in my spam folder. I already have started developing 
this, but haven't done too much, so go ahead. in the meanwhile, my second kid 
was born who consumes my entire open source time ;)

> Add SolrClientService
> -
>
> Key: NIFI-5224
> URL: https://issues.apache.org/jira/browse/NIFI-5224
> Project: Apache NiFi
>  Issue Type: Improvement
>Reporter: Johannes Peter
>Assignee: Mike Thomsen
>Priority: Major
>
> The Solr CRUD functions that are currently included in SolrUtils should be 
> moved to a controller service. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NIFI-5224) Add SolrClientService

2018-05-24 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488556#comment-16488556
 ] 

Johannes Peter commented on NIFI-5224:
--

"CRUD" might be a bit misleading in this case. I actually intended with this 
ticket what [~bende] stated.
[~mike.thomsen]

> Add SolrClientService
> -
>
> Key: NIFI-5224
> URL: https://issues.apache.org/jira/browse/NIFI-5224
> Project: Apache NiFi
>  Issue Type: Improvement
>Reporter: Johannes Peter
>Assignee: Johannes Peter
>Priority: Major
>
> The Solr CRUD functions that are currently included in SolrUtils should be 
> moved to a controller service. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (NIFI-5224) Add SolrClientService

2018-05-21 Thread Johannes Peter (JIRA)

 [ 
https://issues.apache.org/jira/browse/NIFI-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johannes Peter reassigned NIFI-5224:


Assignee: Johannes Peter

> Add SolrClientService
> -
>
> Key: NIFI-5224
> URL: https://issues.apache.org/jira/browse/NIFI-5224
> Project: Apache NiFi
>  Issue Type: Improvement
>Reporter: Johannes Peter
>Assignee: Johannes Peter
>Priority: Major
>
> The Solr CRUD functions that are currently included in SolrUtils should be 
> moved to a controller service. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NIFI-5224) Add SolrClientService

2018-05-21 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483177#comment-16483177
 ] 

Johannes Peter commented on NIFI-5224:
--

[~mike.thomsen] Related to this discussion: 
https://github.com/apache/nifi/pull/2517#discussion_r173344378

> Add SolrClientService
> -
>
> Key: NIFI-5224
> URL: https://issues.apache.org/jira/browse/NIFI-5224
> Project: Apache NiFi
>  Issue Type: Improvement
>Reporter: Johannes Peter
>Priority: Major
>
> The Solr CRUD functions that are currently included in SolrUtils should be 
> moved to a controller service. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NIFI-5224) Add SolrClientService

2018-05-21 Thread Johannes Peter (JIRA)
Johannes Peter created NIFI-5224:


 Summary: Add SolrClientService
 Key: NIFI-5224
 URL: https://issues.apache.org/jira/browse/NIFI-5224
 Project: Apache NiFi
  Issue Type: Improvement
Reporter: Johannes Peter


The Solr CRUD functions that are currently included in SolrUtils should be 
moved to a controller service. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NIFI-5223) Allow the usage of expression language for properties of RecordSetWriters

2018-05-21 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-5223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483169#comment-16483169
 ] 

Johannes Peter commented on NIFI-5223:
--

[~mikerthomsen] related to this discussion:
https://github.com/apache/nifi/pull/2675#discussion_r187770744
How could this be considered as optional?
Where do you discuss such things? In the Developers Mailing List?

> Allow the usage of expression language for properties of RecordSetWriters
> -
>
> Key: NIFI-5223
> URL: https://issues.apache.org/jira/browse/NIFI-5223
> Project: Apache NiFi
>  Issue Type: Improvement
>Reporter: Johannes Peter
>Assignee: Johannes Peter
>Priority: Major
>
> To allow the usage of expression language for properties of RecordSetWriters, 
> the method createWriter of the interface RecordSetWriterFactory has to be 
> enhanced by a parameter to provide a map containing variables of a FlowFile. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (NIFI-5223) Allow the usage of expression language for properties of RecordSetWriters

2018-05-21 Thread Johannes Peter (JIRA)

 [ 
https://issues.apache.org/jira/browse/NIFI-5223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johannes Peter reassigned NIFI-5223:


Assignee: Johannes Peter

> Allow the usage of expression language for properties of RecordSetWriters
> -
>
> Key: NIFI-5223
> URL: https://issues.apache.org/jira/browse/NIFI-5223
> Project: Apache NiFi
>  Issue Type: Improvement
>Reporter: Johannes Peter
>Assignee: Johannes Peter
>Priority: Major
>
> To allow the usage of expression language for properties of RecordSetWriters, 
> the method createWriter of the interface RecordSetWriterFactory has to be 
> enhanced by a parameter to provide a map containing variables of a FlowFile. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NIFI-5223) Allow the usage of expression language for properties of RecordSetWriters

2018-05-21 Thread Johannes Peter (JIRA)
Johannes Peter created NIFI-5223:


 Summary: Allow the usage of expression language for properties of 
RecordSetWriters
 Key: NIFI-5223
 URL: https://issues.apache.org/jira/browse/NIFI-5223
 Project: Apache NiFi
  Issue Type: Improvement
Reporter: Johannes Peter


To allow the usage of expression language for properties of RecordSetWriters, 
the method createWriter of the interface RecordSetWriterFactory has to be 
enhanced by a parameter to provide a map containing variables of a FlowFile. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NIFI-5189) If a schema is accessed using 'Use 'Schema Text' Property', the name of the schema is not available

2018-05-14 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16474006#comment-16474006
 ] 

Johannes Peter commented on NIFI-5189:
--

[~markap14] Will open a PR for this today

> If a schema is accessed using 'Use 'Schema Text' Property', the name of the 
> schema is not available
> ---
>
> Key: NIFI-5189
> URL: https://issues.apache.org/jira/browse/NIFI-5189
> Project: Apache NiFi
>  Issue Type: Bug
>Reporter: Johannes Peter
>Assignee: Johannes Peter
>Priority: Major
>
> If a schema is accessed using 'Use 'Schema Text' Property', the Avro schema 
> object will be transformed to a RecordSchema using the method 
> AvroTypeUtil.create(Schema avroSchema). This method returns a RecordSchema 
> with an empty SchemaIdentifier. Therefore, the name of the schema cannot be 
> accessed. The method should at least return a RecordSchema with a 
> SchemaIdentifier containing the name of the schema. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NIFI-5189) If a schema is accessed using 'Use 'Schema Text' Property', the name of the schema is not available

2018-05-14 Thread Johannes Peter (JIRA)
Johannes Peter created NIFI-5189:


 Summary: If a schema is accessed using 'Use 'Schema Text' 
Property', the name of the schema is not available
 Key: NIFI-5189
 URL: https://issues.apache.org/jira/browse/NIFI-5189
 Project: Apache NiFi
  Issue Type: Bug
Reporter: Johannes Peter
Assignee: Johannes Peter


If a schema is accessed using 'Use 'Schema Text' Property', the Avro schema 
object will be transformed to a RecordSchema using the method 
AvroTypeUtil.create(Schema avroSchema). This method returns a RecordSchema with 
an empty SchemaIdentifier. Therefore, the name of the schema cannot be 
accessed. The method should at least return a RecordSchema with a 
SchemaIdentifier containing the name of the schema. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NIFI-5113) Add XML record writer

2018-04-24 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451277#comment-16451277
 ] 

Johannes Peter edited comment on NIFI-5113 at 4/24/18 9:59 PM:
---

[~markap14]

Hi Mark,

I am wondering how we can solve the following issue:
Assuming we have the following record:

{code}
MapRecord[{ID=1, NAME=Cleve Butler, AGE=42}]
{code}

Defining a schema for this is straightforward, as long as all keys shall be 
tags and all values shall be characters:

Schema:
{code}
{
  "namespace": "nifi",
  "name": "PERSON",
  "type": "record",
  "fields": [
{ "name": "ID", "type": "string" },
{ "name": "NAME", "type": "string" },
{ "name": "AGE", "type": "int" },
{ "name": "COUNTRY", "type": "string" }
  ]
}
{code}

Result:
{code}

  1
  Cleve Butler
  42

{code}

However, I am wondering, how the schema can be defined to write XML with ID as 
attribute:

{code}

  Cleve Butler
  42

{code}

One way could be to instruct users to define a prefix for attributes via a 
property. Let's assume, the value of the property is "ATTR_".

The schema then has to be defined like this:
Schema:
{code}
{
  "namespace": "nifi",
  "name": "PERSON",
  "type": "record",
  "fields": [
{ "name": "ATTR_ID", "type": "string" },
{ "name": "NAME", "type": "string" },
{ "name": "AGE", "type": "int" },
{ "name": "COUNTRY", "type": "string" }
  ]
}
{code}

When WriteXMLResult is created, the schema is checked for fields starting with 
"ATTR_". Matching fields are replaced by fields without the prefix. The 
reference to these fields are put into a list. When the above record is written 
to XML, the writer can check for each field, whether its reference is contained 
in the list. If that is the case, the field is written to the XML as attribute.

This is the best workaround I have identified so far. Do you have any other 
ideas? Are there already any plans to enhance records / schemas by metadata / 
attributes?



was (Author: jope):
[~markap14]

Hi Mark,

I am wondering how we can solve the following issue:
Assuming we have the following record:

{code}
MapRecord[{ID=1, NAME=Cleve Butler, AGE=42}]
{code}

Defining a schema for this is straightforward, as long as all keys shall be 
tags and all values shall be characters:

Schema:
{code}
{
  "namespace": "nifi",
  "name": "PERSON",
  "type": "record",
  "fields": [
{ "name": "ID", "type": "string" },
{ "name": "NAME", "type": "string" },
{ "name": "AGE", "type": "int" },
{ "name": "COUNTRY", "type": "string" }
  ]
}
{code}

Result:
{code}

  1
  Cleve Butler
  42

{code}

However, I am wondering, how the schema can be defined to write XML with ID as 
attribute:

{code}

  Cleve Butler
  42

{code}

One way could be to instruct users to define a prefix for attributes via a 
property. Let's assume, the value of the property is "ATTR_".

The schema then has to be defined like this:
Schema:
{code}
{
  "namespace": "nifi",
  "name": "PERSON",
  "type": "record",
  "fields": [
{ "name": "ATTR_ID", "type": "string" },
{ "name": "NAME", "type": "string" },
{ "name": "AGE", "type": "int" },
{ "name": "COUNTRY", "type": "string" }
  ]
}
{code}

When WriteXMLResult is created, the schema is checked for fields starting with 
"ATTR_". Matching fields are replaced by fields without the prefix. The 
reference to these fields is put into a list. When the above record is written 
to XML, the writer can check for each field, whether its reference is contained 
in the list. If that is the case, the field is written to the XML as attribute.

This is the best workaround I have identified so far. Do you have any other 
ideas? Are there already any plans to enhance records / schemas by metadata / 
attributes?


> Add XML record writer
> -
>
> Key: NIFI-5113
> URL: https://issues.apache.org/jira/browse/NIFI-5113
> Project: Apache NiFi
>  Issue Type: New Feature
>Reporter: Johannes Peter
>Assignee: Johannes Peter
>Priority: Major
>
> Corresponding writer for the XML record reader



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NIFI-5113) Add XML record writer

2018-04-24 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451277#comment-16451277
 ] 

Johannes Peter edited comment on NIFI-5113 at 4/24/18 9:57 PM:
---

[~markap14]

Hi Mark,

I am wondering how we can solve the following issue:
Assuming we have the following record:

{code}
MapRecord[{ID=1, NAME=Cleve Butler, AGE=42}]
{code}

Defining a schema for this is straightforward, as long as all keys shall be 
tags and all values shall be characters:

Schema:
{code}
{
  "namespace": "nifi",
  "name": "PERSON",
  "type": "record",
  "fields": [
{ "name": "ID", "type": "string" },
{ "name": "NAME", "type": "string" },
{ "name": "AGE", "type": "int" },
{ "name": "COUNTRY", "type": "string" }
  ]
}
{code}

Result:
{code}

  1
  Cleve Butler
  42

{code}

However, I am wondering, how the schema can be defined to write XML with ID as 
attribute:

{code}

  Cleve Butler
  42

{code}

One way could be to instruct users to define a prefix for attributes via a 
property. Let's assume, the value of the property is "ATTR_".

The schema then has to be defined like this:
Schema:
{code}
{
  "namespace": "nifi",
  "name": "PERSON",
  "type": "record",
  "fields": [
{ "name": "ATTR_ID", "type": "string" },
{ "name": "NAME", "type": "string" },
{ "name": "AGE", "type": "int" },
{ "name": "COUNTRY", "type": "string" }
  ]
}
{code}

When WriteXMLResult is created, the schema is checked for fields starting with 
"ATTR_". Matching fields are replaced by fields without the prefix. The 
reference to these fields is put into a list. When the above record is written 
to XML, the writer can check for each field, whether its reference is contained 
in the list. If that is the case, the field is written to the XML as attribute.

This is the best workaround I have identified so far. Do you have any other 
ideas? Are there already any plans to enhance records / schemas by metadata / 
attributes?



was (Author: jope):
[~markap14]

Hi Mark,

I am wondering how we can solve the following issue:
Assuming we have the following record:

{code}
MapRecord[{ID=1, NAME=Cleve Butler, AGE=42}]
{code}

Defining a schema for this is straightforward, as long as all keys shall be 
tags and all values shall be characters:

Schema:
{code}
{
  "namespace": "nifi",
  "name": "PERSON",
  "type": "record",
  "fields": [
{ "name": "ID", "type": "string" },
{ "name": "NAME", "type": "string" },
{ "name": "AGE", "type": "int" },
{ "name": "COUNTRY", "type": "string" }
  ]
}
{code}

Result:
{code}

  1
  Cleve Butler
  42

{code}

However, I am wondering, how the schema can be defined to write XML with ID as 
attribute:

{code}

  Cleve Butler
  42

{code}

One way could be to instruct users to define a prefix for attributes via a 
property. Let's assume, the value of the property is "ATTR_".

The schema then has to be defined like this:
Schema:
{code}
{
  "namespace": "nifi",
  "name": "test",
  "type": "record",
  "fields": [
{ "name": "ATTR_ID", "type": "string" },
{ "name": "NAME", "type": "string" },
{ "name": "AGE", "type": "int" },
{ "name": "COUNTRY", "type": "string" }
  ]
}
{code}

When WriteXMLResult is created, the schema is checked for fields starting with 
"ATTR_". Matching fields are replaced by fields without the prefix. The 
reference to these fields is put into a list. When the above record is written 
to XML, the writer can check for each field, whether its reference is contained 
in the list. If that is the case, the field is written to the XML as attribute.

This is the best workaround I have identified so far. Do you have any other 
ideas? Are there already any plans to enhance records / schemas by metadata / 
attributes?


> Add XML record writer
> -
>
> Key: NIFI-5113
> URL: https://issues.apache.org/jira/browse/NIFI-5113
> Project: Apache NiFi
>  Issue Type: New Feature
>Reporter: Johannes Peter
>Assignee: Johannes Peter
>Priority: Major
>
> Corresponding writer for the XML record reader



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NIFI-5113) Add XML record writer

2018-04-24 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451277#comment-16451277
 ] 

Johannes Peter edited comment on NIFI-5113 at 4/24/18 9:56 PM:
---

[~markap14]

Hi Mark,

I am wondering how we can solve the following issue:
Assuming we have the following record:

{code}
MapRecord[{ID=1, NAME=Cleve Butler, AGE=42}]
{code}

Defining a schema for this is straightforward, as long as all keys shall be 
tags and all values shall be characters:

Schema:
{code}
{
  "namespace": "nifi",
  "name": "PERSON",
  "type": "record",
  "fields": [
{ "name": "ID", "type": "string" },
{ "name": "NAME", "type": "string" },
{ "name": "AGE", "type": "int" },
{ "name": "COUNTRY", "type": "string" }
  ]
}
{code}

Result:
{code}

  1
  Cleve Butler
  42

{code}

However, I am wondering, how the schema can be defined to write XML with ID as 
attribute:

{code}

  Cleve Butler
  42

{code}

One way could be to instruct users to define a prefix for attributes via a 
property. Let's assume, the value of the property is "ATTR_".

The schema then has to be defined like this:
Schema:
{code}
{
  "namespace": "nifi",
  "name": "test",
  "type": "record",
  "fields": [
{ "name": "ATTR_ID", "type": "string" },
{ "name": "NAME", "type": "string" },
{ "name": "AGE", "type": "int" },
{ "name": "COUNTRY", "type": "string" }
  ]
}
{code}

When WriteXMLResult is created, the schema is checked for fields starting with 
"ATTR_". Matching fields are replaced by fields without the prefix. The 
reference to these fields is put into a list. When the above record is written 
to XML, the writer can check for each field, whether its reference is contained 
in the list. If that is the case, the field is written to the XML as attribute.

This is the best workaround I have identified so far. Do you have any other 
ideas? Are there already any plans to enhance records / schemas by metadata / 
attributes?



was (Author: jope):
[~markap14]

Hi Mark,

I am wondering how we can solve the following issue:
Assuming we have the following record:

{code}
MapRecord[{ID=1, NAME=Cleve Butler, AGE=42}]
{code}

Defining a schema for this is straightforward, as long as all keys shall be 
tags and all values shall be characters:

Schema:
{code}
{
  "namespace": "nifi",
  "name": "test",
  "type": "record",
  "fields": [
{ "name": "ID", "type": "string" },
{ "name": "NAME", "type": "string" },
{ "name": "AGE", "type": "int" },
{ "name": "COUNTRY", "type": "string" }
  ]
}
{code}

Result:
{code}

  1
  Cleve Butler
  42

{code}

However, I am wondering, how the schema can be defined to write XML with ID as 
attribute:

{code}

  Cleve Butler
  42

{code}

One way could be to instruct users to define a prefix for attributes via a 
property. Let's assume, the value of the property is "ATTR_".

The schema then has to be defined like this:
Schema:
{code}
{
  "namespace": "nifi",
  "name": "test",
  "type": "record",
  "fields": [
{ "name": "ATTR_ID", "type": "string" },
{ "name": "NAME", "type": "string" },
{ "name": "AGE", "type": "int" },
{ "name": "COUNTRY", "type": "string" }
  ]
}
{code}

When WriteXMLResult is created, the schema is checked for fields starting with 
"ATTR_". Matching fields are replaced by fields without the prefix. The 
reference to these fields is put into a list. When the above record is written 
to XML, the writer can check for each field, whether its reference is contained 
in the list. If that is the case, the field is written to the XML as attribute.

This is the best workaround I have identified so far. Do you have any other 
ideas? Are there already any plans to enhance records / schemas by metadata / 
attributes?


> Add XML record writer
> -
>
> Key: NIFI-5113
> URL: https://issues.apache.org/jira/browse/NIFI-5113
> Project: Apache NiFi
>  Issue Type: New Feature
>Reporter: Johannes Peter
>Assignee: Johannes Peter
>Priority: Major
>
> Corresponding writer for the XML record reader



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NIFI-5113) Add XML record writer

2018-04-24 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451277#comment-16451277
 ] 

Johannes Peter edited comment on NIFI-5113 at 4/24/18 9:56 PM:
---

[~markap14]

Hi Mark,

I am wondering how we can solve the following issue:
Assuming we have the following record:

{code}
MapRecord[{ID=1, NAME=Cleve Butler, AGE=42}]
{code}

Defining a schema for this is straightforward, as long as all keys shall be 
tags and all values shall be characters:

Schema:
{code}
{
  "namespace": "nifi",
  "name": "test",
  "type": "record",
  "fields": [
{ "name": "ID", "type": "string" },
{ "name": "NAME", "type": "string" },
{ "name": "AGE", "type": "int" },
{ "name": "COUNTRY", "type": "string" }
  ]
}
{code}

Result:
{code}

  1
  Cleve Butler
  42

{code}

However, I am wondering, how the schema can be defined to write XML with ID as 
attribute:

{code}

  Cleve Butler
  42

{code}

One way could be to instruct users to define a prefix for attributes via a 
property. Let's assume, the value of the property is "ATTR_".

The schema then has to be defined like this:
Schema:
{code}
{
  "namespace": "nifi",
  "name": "test",
  "type": "record",
  "fields": [
{ "name": "ATTR_ID", "type": "string" },
{ "name": "NAME", "type": "string" },
{ "name": "AGE", "type": "int" },
{ "name": "COUNTRY", "type": "string" }
  ]
}
{code}

When WriteXMLResult is created, the schema is checked for fields starting with 
"ATTR_". Matching fields are replaced by fields without the prefix. The 
reference to these fields is put into a list. When the above record is written 
to XML, the writer can check for each field, whether its reference is contained 
in the list. If that is the case, the field is written to the XML as attribute.

This is the best workaround I have identified so far. Do you have any other 
ideas? Are there already any plans to enhance records / schemas by metadata / 
attributes?



was (Author: jope):
[~markap14]

Hi Mark,

I am wondering how I can solve the following issue:
Assuming we have the following record:

{code}
MapRecord[{ID=1, NAME=Cleve Butler, AGE=42}]
{code}

Defining a schema for this is straightforward, as long as all keys shall be 
tags and all values shall be characters:

Schema:
{code}
{
  "namespace": "nifi",
  "name": "test",
  "type": "record",
  "fields": [
{ "name": "ID", "type": "string" },
{ "name": "NAME", "type": "string" },
{ "name": "AGE", "type": "int" },
{ "name": "COUNTRY", "type": "string" }
  ]
}
{code}

Result:
{code}

  1
  Cleve Butler
  42

{code}

However, I am wondering, how the schema can be defined to write XML with ID as 
attribute:

{code}

  Cleve Butler
  42

{code}

One way could be to instruct users to define a prefix for attributes via a 
property. Let's assume, the value of the property is "ATTR_".

The schema then has to be defined like this:
Schema:
{code}
{
  "namespace": "nifi",
  "name": "test",
  "type": "record",
  "fields": [
{ "name": "ATTR_ID", "type": "string" },
{ "name": "NAME", "type": "string" },
{ "name": "AGE", "type": "int" },
{ "name": "COUNTRY", "type": "string" }
  ]
}
{code}

When WriteXMLResult is created, the schema is checked for fields starting with 
"ATTR_". Matching fields are replaced by fields without the prefix. The 
reference to these fields is put into a list. When the above record is written 
to XML, the writer can check for each field, whether its reference is contained 
in the list. If that is the case, the field is written to the XML as attribute.

This is the best workaround I have identified so far. Do you have any other 
ideas? Are there already any plans to enhance records / schemas by metadata / 
attributes?


> Add XML record writer
> -
>
> Key: NIFI-5113
> URL: https://issues.apache.org/jira/browse/NIFI-5113
> Project: Apache NiFi
>  Issue Type: New Feature
>Reporter: Johannes Peter
>Assignee: Johannes Peter
>Priority: Major
>
> Corresponding writer for the XML record reader



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NIFI-5113) Add XML record writer

2018-04-24 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451277#comment-16451277
 ] 

Johannes Peter commented on NIFI-5113:
--

[~markap14]

Hi Mark,

I am wondering how I can solve the following issue:
Assuming we have the following record:

{code}
MapRecord[{ID=1, NAME=Cleve Butler, AGE=42}]
{code}

Defining a schema for this is straightforward, as long as all keys shall be 
tags and all values shall be characters:

Schema:
{code}
{
  "namespace": "nifi",
  "name": "test",
  "type": "record",
  "fields": [
{ "name": "ID", "type": "string" },
{ "name": "NAME", "type": "string" },
{ "name": "AGE", "type": "int" },
{ "name": "COUNTRY", "type": "string" }
  ]
}
{code}

Result:
{code}

  1
  Cleve Butler
  42

{code}

However, I am wondering, how the schema can be defined to write XML with ID as 
attribute:

{code}

  Cleve Butler
  42

{code}

One way could be to instruct users to define a prefix for attributes via a 
property. Let's assume, the value of the property is "ATTR_".

The schema then has to be defined like this:
Schema:
{code}
{
  "namespace": "nifi",
  "name": "test",
  "type": "record",
  "fields": [
{ "name": "ATTR_ID", "type": "string" },
{ "name": "NAME", "type": "string" },
{ "name": "AGE", "type": "int" },
{ "name": "COUNTRY", "type": "string" }
  ]
}
{code}

When WriteXMLResult is created, the schema is checked for fields starting with 
"ATTR_". Matching fields are replaced by fields without the prefix. The 
reference to these fields is put into a list. When the above record is written 
to XML, the writer can check for each field, whether its reference is contained 
in the list. If that is the case, the field is written to the XML as attribute.

This is the best workaround I have identified so far. Do you have any other 
ideas? Are there already any plans to enhance records / schemas by metadata / 
attributes?


> Add XML record writer
> -
>
> Key: NIFI-5113
> URL: https://issues.apache.org/jira/browse/NIFI-5113
> Project: Apache NiFi
>  Issue Type: New Feature
>Reporter: Johannes Peter
>Assignee: Johannes Peter
>Priority: Major
>
> Corresponding writer for the XML record reader



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NIFI-4516) Add QuerySolr processor

2018-04-24 Thread Johannes Peter (JIRA)

 [ 
https://issues.apache.org/jira/browse/NIFI-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johannes Peter updated NIFI-4516:
-
Summary: Add QuerySolr processor  (was: Add FetchSolr processor)

> Add QuerySolr processor
> ---
>
> Key: NIFI-4516
> URL: https://issues.apache.org/jira/browse/NIFI-4516
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Johannes Peter
>Assignee: Johannes Peter
>Priority: Major
>  Labels: features
> Fix For: 1.7.0
>
>
> The processor shall be capable 
> * to query Solr within a workflow,
> * to make use of standard functionalities of Solr such as faceting, 
> highlighting, result grouping, etc.,
> * to make use of NiFis expression language to build Solr queries, 
> * to handle results as records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NIFI-4516) Add FetchSolr processor

2018-04-24 Thread Johannes Peter (JIRA)

 [ 
https://issues.apache.org/jira/browse/NIFI-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johannes Peter updated NIFI-4516:
-
Fix Version/s: 1.7.0

> Add FetchSolr processor
> ---
>
> Key: NIFI-4516
> URL: https://issues.apache.org/jira/browse/NIFI-4516
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Johannes Peter
>Assignee: Johannes Peter
>Priority: Major
>  Labels: features
> Fix For: 1.7.0
>
>
> The processor shall be capable 
> * to query Solr within a workflow,
> * to make use of standard functionalities of Solr such as faceting, 
> highlighting, result grouping, etc.,
> * to make use of NiFis expression language to build Solr queries, 
> * to handle results as records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NIFI-5113) Add XML record writer

2018-04-23 Thread Johannes Peter (JIRA)

 [ 
https://issues.apache.org/jira/browse/NIFI-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johannes Peter updated NIFI-5113:
-
External issue URL:   (was: https://issues.apache.org/jira/browse/NIFI-4185)

> Add XML record writer
> -
>
> Key: NIFI-5113
> URL: https://issues.apache.org/jira/browse/NIFI-5113
> Project: Apache NiFi
>  Issue Type: New Feature
>Reporter: Johannes Peter
>Assignee: Johannes Peter
>Priority: Major
>
> Corresponding writer for the XML record reader



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NIFI-5113) Add XML record writer

2018-04-23 Thread Johannes Peter (JIRA)

 [ 
https://issues.apache.org/jira/browse/NIFI-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johannes Peter updated NIFI-5113:
-
External issue URL: https://issues.apache.org/jira/browse/NIFI-4185

> Add XML record writer
> -
>
> Key: NIFI-5113
> URL: https://issues.apache.org/jira/browse/NIFI-5113
> Project: Apache NiFi
>  Issue Type: New Feature
>Reporter: Johannes Peter
>Assignee: Johannes Peter
>Priority: Major
>
> Corresponding writer for the XML record reader



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NIFI-5113) Add XML record writer

2018-04-23 Thread Johannes Peter (JIRA)
Johannes Peter created NIFI-5113:


 Summary: Add XML record writer
 Key: NIFI-5113
 URL: https://issues.apache.org/jira/browse/NIFI-5113
 Project: Apache NiFi
  Issue Type: New Feature
Reporter: Johannes Peter
Assignee: Johannes Peter


Corresponding writer for the XML record reader



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NIFI-5113) Add XML record writer

2018-04-23 Thread Johannes Peter (JIRA)

 [ 
https://issues.apache.org/jira/browse/NIFI-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johannes Peter updated NIFI-5113:
-
Description: 
Corresponding writer for the XML record reader


  was:
Corresponding writer for the XML record reader

Related issue: https://issues.apache.org/jira/browse/NIFI-4185


> Add XML record writer
> -
>
> Key: NIFI-5113
> URL: https://issues.apache.org/jira/browse/NIFI-5113
> Project: Apache NiFi
>  Issue Type: New Feature
>Reporter: Johannes Peter
>Assignee: Johannes Peter
>Priority: Major
>
> Corresponding writer for the XML record reader



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NIFI-5113) Add XML record writer

2018-04-23 Thread Johannes Peter (JIRA)

 [ 
https://issues.apache.org/jira/browse/NIFI-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johannes Peter updated NIFI-5113:
-
Description: 
Corresponding writer for the XML record reader

Related issue: https://issues.apache.org/jira/browse/NIFI-4185

  was:Corresponding writer for the XML record reader


> Add XML record writer
> -
>
> Key: NIFI-5113
> URL: https://issues.apache.org/jira/browse/NIFI-5113
> Project: Apache NiFi
>  Issue Type: New Feature
>Reporter: Johannes Peter
>Assignee: Johannes Peter
>Priority: Major
>
> Corresponding writer for the XML record reader
> Related issue: https://issues.apache.org/jira/browse/NIFI-4185



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NIFI-5106) Add provenance reporting to GetSolr

2018-04-23 Thread Johannes Peter (JIRA)

 [ 
https://issues.apache.org/jira/browse/NIFI-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johannes Peter resolved NIFI-5106.
--
Resolution: Fixed

> Add provenance reporting to GetSolr
> ---
>
> Key: NIFI-5106
> URL: https://issues.apache.org/jira/browse/NIFI-5106
> Project: Apache NiFi
>  Issue Type: Improvement
>Reporter: Johannes Peter
>Assignee: Johannes Peter
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NIFI-5106) Add provenance reporting to GetSolr

2018-04-22 Thread Johannes Peter (JIRA)
Johannes Peter created NIFI-5106:


 Summary: Add provenance reporting to GetSolr
 Key: NIFI-5106
 URL: https://issues.apache.org/jira/browse/NIFI-5106
 Project: Apache NiFi
  Issue Type: Improvement
Reporter: Johannes Peter
Assignee: Johannes Peter






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NIFI-4516) Add FetchSolr processor

2018-04-20 Thread Johannes Peter (JIRA)

 [ 
https://issues.apache.org/jira/browse/NIFI-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johannes Peter resolved NIFI-4516.
--
Resolution: Fixed

> Add FetchSolr processor
> ---
>
> Key: NIFI-4516
> URL: https://issues.apache.org/jira/browse/NIFI-4516
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Johannes Peter
>Assignee: Johannes Peter
>Priority: Major
>  Labels: features
>
> The processor shall be capable 
> * to query Solr within a workflow,
> * to make use of standard functionalities of Solr such as faceting, 
> highlighting, result grouping, etc.,
> * to make use of NiFis expression language to build Solr queries, 
> * to handle results as records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NIFI-4185) Add XML record reader & writer services

2018-03-17 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403399#comment-16403399
 ] 

Johannes Peter edited comment on NIFI-4185 at 3/17/18 12:07 PM:


Hi [~pvillard],

for this reader I have not planned to require an XSD schema. My intention is 
that it can be configured in the same way like readers of other formats. I 
therefore translate Avro definitions to XML structures that are expected by the 
reader. Generally, the reader expects an array containing zero, one or more 
records. I use StAX as its pulling logic suits well to the record-lookup 
requirement.

BTW: Do you have an idea which XML structure the reader could expect when users 
define a map in their schema? Maybe something like this?

{code}




content
{content or object}


...



...

{code}


was (Author: jope):
Hi [~pvillard],

for this reader I have not planned to require an XSD schema. My intention is 
that it can be configured in the same way like readers of other formats. I 
therefore translate Avro definitions to XML structures that are expected by the 
reader. Generally, the reader expects an array containing zero, one or more 
records. I use StAX as its pulling logic suits well to the record-lookup 
requirement.

BTW: Do you have an idea which XML structure the reader could expect when users 
define a map in their schema? Maybe something like this?

{code}




content
{content or object}


...



{code}

> Add XML record reader & writer services
> ---
>
> Key: NIFI-4185
> URL: https://issues.apache.org/jira/browse/NIFI-4185
> Project: Apache NiFi
>  Issue Type: New Feature
>  Components: Extensions
>Affects Versions: 1.3.0
>Reporter: Andy LoPresto
>Assignee: Johannes Peter
>Priority: Major
>  Labels: json, records, xml
>
> With the addition of the {{RecordReader}} and {{RecordSetWriter}} paradigm, 
> XML conversion has not yet been targeted. This will replace the previous 
> ticket for XML to JSON conversion. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NIFI-4185) Add XML record reader & writer services

2018-03-17 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403399#comment-16403399
 ] 

Johannes Peter edited comment on NIFI-4185 at 3/17/18 12:07 PM:


Hi [~pvillard],

for this reader I have not planned to require an XSD schema. My intention is 
that it can be configured in the same way like readers of other formats. I 
therefore translate Avro definitions to XML structures that are expected by the 
reader. Generally, the reader expects an array containing zero, one or more 
records. I use StAX as its pulling logic suits well to the record-lookup 
requirement.

BTW: Do you have an idea which XML structure the reader could expect when users 
define a map in their schema? Maybe something like this?

{code}




content
{content or object}


...



...

{code}


was (Author: jope):
Hi [~pvillard],

for this reader I have not planned to require an XSD schema. My intention is 
that it can be configured in the same way like readers of other formats. I 
therefore translate Avro definitions to XML structures that are expected by the 
reader. Generally, the reader expects an array containing zero, one or more 
records. I use StAX as its pulling logic suits well to the record-lookup 
requirement.

BTW: Do you have an idea which XML structure the reader could expect when users 
define a map in their schema? Maybe something like this?

{code}




content
{content or object}


...



...

{code}

> Add XML record reader & writer services
> ---
>
> Key: NIFI-4185
> URL: https://issues.apache.org/jira/browse/NIFI-4185
> Project: Apache NiFi
>  Issue Type: New Feature
>  Components: Extensions
>Affects Versions: 1.3.0
>Reporter: Andy LoPresto
>Assignee: Johannes Peter
>Priority: Major
>  Labels: json, records, xml
>
> With the addition of the {{RecordReader}} and {{RecordSetWriter}} paradigm, 
> XML conversion has not yet been targeted. This will replace the previous 
> ticket for XML to JSON conversion. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NIFI-4185) Add XML record reader & writer services

2018-03-17 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403399#comment-16403399
 ] 

Johannes Peter commented on NIFI-4185:
--

Hi [~pvillard],

for this reader I have not planned to require an XSD schema. My intention is 
that it can be configured in the same way like readers of other formats. I 
therefore translate Avro definitions to XML structures that are expected by the 
reader. Generally, the reader expects an array containing zero, one or more 
records. I use StAX as its pulling logic suits well to the record-lookup 
requirement.

BTW: Do you have an idea which XML structure the reader could expect when users 
define a map in their schema? Maybe something like this?

{code}




content
content or object


...



{code}

> Add XML record reader & writer services
> ---
>
> Key: NIFI-4185
> URL: https://issues.apache.org/jira/browse/NIFI-4185
> Project: Apache NiFi
>  Issue Type: New Feature
>  Components: Extensions
>Affects Versions: 1.3.0
>Reporter: Andy LoPresto
>Assignee: Johannes Peter
>Priority: Major
>  Labels: json, records, xml
>
> With the addition of the {{RecordReader}} and {{RecordSetWriter}} paradigm, 
> XML conversion has not yet been targeted. This will replace the previous 
> ticket for XML to JSON conversion. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NIFI-4185) Add XML record reader & writer services

2018-03-17 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403399#comment-16403399
 ] 

Johannes Peter edited comment on NIFI-4185 at 3/17/18 12:05 PM:


Hi [~pvillard],

for this reader I have not planned to require an XSD schema. My intention is 
that it can be configured in the same way like readers of other formats. I 
therefore translate Avro definitions to XML structures that are expected by the 
reader. Generally, the reader expects an array containing zero, one or more 
records. I use StAX as its pulling logic suits well to the record-lookup 
requirement.

BTW: Do you have an idea which XML structure the reader could expect when users 
define a map in their schema? Maybe something like this?

{code}




content
{content or object}


...



{code}


was (Author: jope):
Hi [~pvillard],

for this reader I have not planned to require an XSD schema. My intention is 
that it can be configured in the same way like readers of other formats. I 
therefore translate Avro definitions to XML structures that are expected by the 
reader. Generally, the reader expects an array containing zero, one or more 
records. I use StAX as its pulling logic suits well to the record-lookup 
requirement.

BTW: Do you have an idea which XML structure the reader could expect when users 
define a map in their schema? Maybe something like this?

{code}




content
content or object


...



{code}

> Add XML record reader & writer services
> ---
>
> Key: NIFI-4185
> URL: https://issues.apache.org/jira/browse/NIFI-4185
> Project: Apache NiFi
>  Issue Type: New Feature
>  Components: Extensions
>Affects Versions: 1.3.0
>Reporter: Andy LoPresto
>Assignee: Johannes Peter
>Priority: Major
>  Labels: json, records, xml
>
> With the addition of the {{RecordReader}} and {{RecordSetWriter}} paradigm, 
> XML conversion has not yet been targeted. This will replace the previous 
> ticket for XML to JSON conversion. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (NIFI-4185) Add XML record reader & writer services

2018-03-11 Thread Johannes Peter (JIRA)

 [ 
https://issues.apache.org/jira/browse/NIFI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johannes Peter reassigned NIFI-4185:


Assignee: Johannes Peter

> Add XML record reader & writer services
> ---
>
> Key: NIFI-4185
> URL: https://issues.apache.org/jira/browse/NIFI-4185
> Project: Apache NiFi
>  Issue Type: New Feature
>  Components: Extensions
>Affects Versions: 1.3.0
>Reporter: Andy LoPresto
>Assignee: Johannes Peter
>Priority: Major
>  Labels: json, records, xml
>
> With the addition of the {{RecordReader}} and {{RecordSetWriter}} paradigm, 
> XML conversion has not yet been targeted. This will replace the previous 
> ticket for XML to JSON conversion. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NIFI-4185) Add XML record reader & writer services

2018-03-11 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394487#comment-16394487
 ] 

Johannes Peter edited comment on NIFI-4185 at 3/11/18 12:47 PM:


[~alopresto]:
  Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record, 
e. g. 
{code}

content
 ... 

{code}
or an array of records, e. g. 
{code}


content
 ... 

 ... 

{code}

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition
{code}
 
   
     content
     123
   
 
{code}

Schema definition
{code}
{   "name": "testschema",
    "namespace": "nifi",
    "type": "record",
    "fields": [
   { "name": "field1", "type": "string" }, 
   { "name": "field2", "type": "int" } 
] 
}
{code}

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
 Property: CONTENT_FIELD=content_field
 Property: ATTRIBUTE_PREFIX=attr.

XML definition
{code}
 
   
     some text
     
       some nested text
       some other nested text
     
   
 
{code}

Schema definition
{code}
{ 
 "name": "testschema",
 "namespace": "nifi",
 "type": "record",
 "fields": [
  {
   "name": "field1", 
   "type": {
"name": "NestedRecord",
"type": "record",
"fields" : [
 {"name": "attr.attribute", "type": "string"},
 {"name": "content_field", "type": "string"}
]
   }
  },
  {
   "name": "field2", 
   "type": {
"name": "NestedRecord",
"type": "record",
"fields" : [
 {"name": "attr.attribute", "type": "string"},
 {"name": "nested1", "type": "string"},
 {"name": "nested2", "type": "string"}
]
   }
  }
 ]
}
{code}
What do you say?


was (Author: jope):
[~alopresto]:
  Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition
{code}
 
   
     content
     123
   
 
{code}

Schema definition
{code}
{   "name": "testschema",
    "namespace": "nifi",
    "type": "record",
    "fields": [
   { "name": "field1", "type": "string" }, 
   { "name": "field2", "type": "int" } 
] 
}
{code}

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
 Property: CONTENT_FIELD=content_field
 Property: ATTRIBUTE_PREFIX=attr.

XML definition
{code}
 
   
     some text
     
       some nested text
       some other nested text
     
   
 
{code}

Schema definition
{code}
{ 
 "name": "testschema",
 "namespace": "nifi",
 "type": "record",
 "fields": [
  {
   "name": "field1", 
   "type": {
"name": "NestedRecord",
"type": "record",
"fields" : [
 {"name": "attr.attribute", "type": "string"},
 {"name": "content_field", "type": "string"}
]
   }
  },
  {
   "name": "field2", 
   "type": {
"name": "NestedRecord",
"type": "record",
"fields" : [
 {"name": "attr.attribute", "type": "string"},
 {"name": "nested1", "type": "string"},
 {"name": "nested2", "type": "string"}
]
   }
  }
 ]
}
{code}
What do you say?

> Add XML record reader & writer services
> ---
>
> Key: NIFI-4185
> URL: https://issues.apache.org/jira/browse/NIFI-4185
> Project: Apache NiFi
>  Issue Type: New Feature
>  Components: Extensions
>Affects Versions: 1.3.0
>Reporter: Andy LoPresto
>Priority: Major
>  Labels: json, records, xml
>
> With the addition of the {{RecordReader}} and {{RecordSetWriter}} paradigm, 
> XML conversion has not yet been targeted. This will replace the previous 
> ticket for XML to JSON conversion. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NIFI-4185) Add XML record reader & writer services

2018-03-11 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394487#comment-16394487
 ] 

Johannes Peter edited comment on NIFI-4185 at 3/11/18 12:45 PM:


[~alopresto]:
  Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition
{code}
 
   
     content
     123
   
 
{code}

Schema definition
{code}
{   "name": "testschema",
    "namespace": "nifi",
    "type": "record",
    "fields": [
   { "name": "field1", "type": "string" }, 
   { "name": "field2", "type": "int" } 
] 
}
{code}

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
 Property: CONTENT_FIELD=content_field
 Property: ATTRIBUTE_PREFIX=attr.

XML definition
{code}
 
   
     some text
     
       some nested text
       some other nested text
     
   
 
{code}

Schema definition
{code}
{ 
 "name": "testschema",
 "namespace": "nifi",
 "type": "record",
 "fields": [
  {
   "name": "field1", 
   "type": {
"name": "NestedRecord",
"type": "record",
"fields" : [
 {"name": "attr.attribute", "type": "string"},
 {"name": "content_field", "type": "string"}
]
   }
  },
  {
   "name": "field2", 
   "type": {
"name": "NestedRecord",
"type": "record",
"fields" : [
 {"name": "attr.attribute", "type": "string"},
 {"name": "nested1", "type": "string"},
 {"name": "nested2", "type": "string"}
]
   }
  }
 ]
}
{code}
What do you say?


was (Author: jope):
[~alopresto]:
  Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition
{code}
 
   
     content
     123
   
 
{code}

Schema definition
{code}
{   "name": "testschema",
    "namespace": "nifi",
    "type": "record",
    "fields": [
   { "name": "field1", "type": "string" }, 
   { "name": "field2", "type": "int" } 
] 
}
{code}

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
 Property: CONTENT_FIELD=content_field
 Property: ATTRIBUTE_PREFIX=attr.

XML definition
{code}
 
   
     some text
     
       some nested text
       some other nested text
     
   
 
{code}

Schema definition
{code}
{  
"name": "testschema",
"namespace": "nifi",
"type": "record",
"fields": [
{
"name": "field1", 
"type": {
"name": "NestedRecord",
"type": "record",
"fields" : [
{"name": "attr.attribute", "type": 
"string"},
{"name": "content_field", "type": 
"string"}
]
}
},
{
"name": "field2", 
"type": {
"name": "NestedRecord",
"type": "record",
"fields" : [
{"name": "attr.attribute", "type": 
"string"},
{"name": "nested1", "type": "string"},
{"name": "nested2", "type": "string"}
]
}
}
]
}
{code}
What do you say?

> Add XML record reader & writer services
> ---
>
> Key: NIFI-4185
> URL: https://issues.apache.org/jira/browse/NIFI-4185
> Project: Apache NiFi
>  Issue Type: New Feature
>  Components: Extensions
>Affects Versions: 1.3.0
>Reporter: Andy LoPresto
>Priority: Major
>  Labels: json, records, xml
>
> With the addition of the {{RecordReader}} and {{RecordSetWriter}} paradigm, 
> XML conversion has not yet been targeted. This will replace the previous 
> ticket for XML to JSON conversion. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NIFI-4185) Add XML record reader & writer services

2018-03-11 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394487#comment-16394487
 ] 

Johannes Peter edited comment on NIFI-4185 at 3/11/18 12:43 PM:


[~alopresto]:
  Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition
{code}
 
   
     content
     123
   
 
{code}

Schema definition
{code}
{   "name": "testschema",
    "namespace": "nifi",
    "type": "record",
    "fields": [
   { "name": "field1", "type": "string" }, 
   { "name": "field2", "type": "int" } 
] 
}
{code}

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
 Property: CONTENT_FIELD=content_field
 Property: ATTRIBUTE_PREFIX=attr.

XML definition
{code}
 
   
     some text
     
       some nested text
       some other nested text
     
   
 
{code}

Schema definition
{code}
{  
"name": "testschema",
"namespace": "nifi",
"type": "record",
"fields": [
{
"name": "field1", 
"type": {
"name": "NestedRecord",
"type": "record",
"fields" : [
{"name": "attr.attribute", "type": 
"string"},
{"name": "content_field", "type": 
"string"}
]
}
},
{
"name": "field2", 
"type": {
"name": "NestedRecord",
"type": "record",
"fields" : [
{"name": "attr.attribute", "type": 
"string"},
{"name": "nested1", "type": "string"},
{"name": "nested2", "type": "string"}
]
}
}
]
}
{code}
What do you say?


was (Author: jope):
[~alopresto]:
  Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition
{code}
 
   
     content
     123
   
 
{code}

Schema definition
{code}
{   "name": "testschema",
    "namespace": "nifi",
    "type": "record",
    "fields": [
   { "name": "field1", "type": "string" }, 
   { "name": "field2", "type": "int" } 
] 
}
{code}

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
 Property: CONTENT_FIELD=content_field
 Property: ATTRIBUTE_PREFIX=attr.

XML definition
{code}
 
   
     some text
     
       some nested text
       some other nested text
     
   
 
{code}

Schema definition
{code}
 { 
   "name": "testschema",
   "namespace": "nifi",
   "type": "record",
   "fields": [
     {
       "name": "field1", 
       "type": {
          "name": "NestedRecord",
          "type": "record",
          "fields" : [ 
   {  "name": "attr.attribute", "type": "string"  },
           {  "name": "content_field", "type": "string" }
         ]
       }
   },
   {
     "name": "field2", 
     "type": {
       "name": "NestedRecord",
       "type": "record",
       "fields" : [  
           {  "name": "attr.attribute", "type": "string"  },
           {  "name": "nested1", "type": "string"  },
           {  "name": "nested2", "type": "string"  }
        ]
       }
    }
   ]
 }
{code}
What do you say?

> Add XML record reader & writer services
> ---
>
> Key: NIFI-4185
> URL: https://issues.apache.org/jira/browse/NIFI-4185
> Project: Apache NiFi
>  Issue Type: New Feature
>  Components: Extensions
>Affects Versions: 1.3.0
>Reporter: Andy LoPresto
>Priority: Major
>  Labels: json, records, xml
>
> With the addition of the {{RecordReader}} and {{RecordSetWriter}} paradigm, 
> XML conversion has not yet been targeted. This will replace the previous 
> ticket for XML to JSON conversion. 



--
This message was 

[jira] [Comment Edited] (NIFI-4185) Add XML record reader & writer services

2018-03-11 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394487#comment-16394487
 ] 

Johannes Peter edited comment on NIFI-4185 at 3/11/18 12:42 PM:


[~alopresto]:
  Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition
{code}
 
   
     content
     123
   
 
{code}

Schema definition
{code}
{   "name": "testschema",
    "namespace": "nifi",
    "type": "record",
    "fields": [
   { "name": "field1", "type": "string" }, 
   { "name": "field2", "type": "int" } 
] 
}
{code}

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
 Property: CONTENT_FIELD=content_field
 Property: ATTRIBUTE_PREFIX=attr.

XML definition
{code}
 
   
     some text
     
       some nested text
       some other nested text
     
   
 
{code}

Schema definition
{code}
 { 
   "name": "testschema",
   "namespace": "nifi",
   "type": "record",
   "fields": [
     {
       "name": "field1", 
       "type": {
          "name": "NestedRecord",
          "type": "record",
          "fields" : [ 
   {  "name": "attr.attribute", "type": "string"  },
           {  "name": "content_field", "type": "string" }
         ]
       }
   },
   {
     "name": "field2", 
     "type": {
       "name": "NestedRecord",
       "type": "record",
       "fields" : [  
           {  "name": "attr.attribute", "type": "string"  },
           {  "name": "nested1", "type": "string"  },
           {  "name": "nested2", "type": "string"  }
        ]
       }
    }
   ]
 }
{code}
What do you say?


was (Author: jope):
[~alopresto]:
  Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition
{code}
 
   
     content
     123
   
 
{code}

Schema definition
{code}
{   "name": "testschema",
    "namespace": "nifi",
    "type": "record",
    "fields": [
   { "name": "field1", "type": "string" }, 
   { "name": "field2", "type": "int" } 
] 
}
{code}

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
 Property: CONTENT_FIELD=content_field
 Property: ATTRIBUTE_PREFIX=attr.

XML definition
 
   
     some text
     
       some nested text
       some other nested text
     
   
 

Schema definition
 { 
   "name": "testschema",
   "namespace": "nifi",
   "type": "record",
   "fields": [
     {
       "name": "field1", 
       "type":

{          "name": "NestedRecord",          "type": "record",          "fields" 
: [             \\{"name": "attr.attribute", "type": "string"}

,

            \{"name": "content_field", "type": "string"}

         ]
       }
   },
   {
     "name": "field2", 
     "type":

{       "name": "NestedRecord",       "type": "record",       "fields" : [      
      \\{"name": "attr.attribute", "type": "string"}

,

           \{"name": "nested1", "type": "string"},

           \{"name": "nested2", "type": "string"}

        ]
       }
    }
   ]
 }

What do you say?

> Add XML record reader & writer services
> ---
>
> Key: NIFI-4185
> URL: https://issues.apache.org/jira/browse/NIFI-4185
> Project: Apache NiFi
>  Issue Type: New Feature
>  Components: Extensions
>Affects Versions: 1.3.0
>Reporter: Andy LoPresto
>Priority: Major
>  Labels: json, records, xml
>
> With the addition of the {{RecordReader}} and {{RecordSetWriter}} paradigm, 
> XML conversion has not yet been targeted. This will replace the previous 
> ticket for XML to JSON conversion. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NIFI-4185) Add XML record reader & writer services

2018-03-11 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394487#comment-16394487
 ] 

Johannes Peter edited comment on NIFI-4185 at 3/11/18 12:39 PM:


[~alopresto]:
  Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition
{code}
 
   
     content
     123
   
 
{code}

Schema definition
{code}
{   "name": "testschema",
    "namespace": "nifi",
    "type": "record",
    "fields": [
   { "name": "field1", "type": "string" }, 
   { "name": "field2", "type": "int" } 
] 
}
{code}

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
 Property: CONTENT_FIELD=content_field
 Property: ATTRIBUTE_PREFIX=attr.

XML definition
 
   
     some text
     
       some nested text
       some other nested text
     
   
 

Schema definition
 { 
   "name": "testschema",
   "namespace": "nifi",
   "type": "record",
   "fields": [
     {
       "name": "field1", 
       "type":

{          "name": "NestedRecord",          "type": "record",          "fields" 
: [             \\{"name": "attr.attribute", "type": "string"}

,

            \{"name": "content_field", "type": "string"}

         ]
       }
   },
   {
     "name": "field2", 
     "type":

{       "name": "NestedRecord",       "type": "record",       "fields" : [      
      \\{"name": "attr.attribute", "type": "string"}

,

           \{"name": "nested1", "type": "string"},

           \{"name": "nested2", "type": "string"}

        ]
       }
    }
   ]
 }

What do you say?


was (Author: jope):
[~alopresto]:
  Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition
{code}
 
   
     content
     123
   
 
{code}

Schema definition
{code}
{   "name": "testschema",   "namespace": "nifi",   "type": "record",   
"fields": [ { "name": "field1", "type": "string" }, { "name": "field2", "type": 
"int" } ] }
{code}

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
 Property: CONTENT_FIELD=content_field
 Property: ATTRIBUTE_PREFIX=attr.

XML definition
 
   
     some text
     
       some nested text
       some other nested text
     
   
 

Schema definition
 { 
   "name": "testschema",
   "namespace": "nifi",
   "type": "record",
   "fields": [
     {
       "name": "field1", 
       "type":

{          "name": "NestedRecord",          "type": "record",          "fields" 
: [             \\{"name": "attr.attribute", "type": "string"}

,

            \{"name": "content_field", "type": "string"}

         ]
       }
   },
   {
     "name": "field2", 
     "type":

{       "name": "NestedRecord",       "type": "record",       "fields" : [      
      \\{"name": "attr.attribute", "type": "string"}

,

           \{"name": "nested1", "type": "string"},

           \{"name": "nested2", "type": "string"}

        ]
       }
    }
   ]
 }

What do you say?

> Add XML record reader & writer services
> ---
>
> Key: NIFI-4185
> URL: https://issues.apache.org/jira/browse/NIFI-4185
> Project: Apache NiFi
>  Issue Type: New Feature
>  Components: Extensions
>Affects Versions: 1.3.0
>Reporter: Andy LoPresto
>Priority: Major
>  Labels: json, records, xml
>
> With the addition of the {{RecordReader}} and {{RecordSetWriter}} paradigm, 
> XML conversion has not yet been targeted. This will replace the previous 
> ticket for XML to JSON conversion. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NIFI-4185) Add XML record reader & writer services

2018-03-11 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394487#comment-16394487
 ] 

Johannes Peter edited comment on NIFI-4185 at 3/11/18 12:38 PM:


[~alopresto]:
  Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition
{code}
 
   
     content
     123
   
 
{code}

Schema definition
{code}
{   "name": "testschema",   "namespace": "nifi",   "type": "record",   
"fields": [ { "name": "field1", "type": "string" }, { "name": "field2", "type": 
"int" } ] }
{code}

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
 Property: CONTENT_FIELD=content_field
 Property: ATTRIBUTE_PREFIX=attr.

XML definition
 
   
     some text
     
       some nested text
       some other nested text
     
   
 

Schema definition
 { 
   "name": "testschema",
   "namespace": "nifi",
   "type": "record",
   "fields": [
     {
       "name": "field1", 
       "type":

{          "name": "NestedRecord",          "type": "record",          "fields" 
: [             \\{"name": "attr.attribute", "type": "string"}

,

            \{"name": "content_field", "type": "string"}

         ]
       }
   },
   {
     "name": "field2", 
     "type":

{       "name": "NestedRecord",       "type": "record",       "fields" : [      
      \\{"name": "attr.attribute", "type": "string"}

,

           \{"name": "nested1", "type": "string"},

           \{"name": "nested2", "type": "string"}

        ]
       }
    }
   ]
 }

What do you say?


was (Author: jope):
[~alopresto]:
  Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition
 
   
     content
     123
   
 

Schema definition
{code}
{   "name": "testschema",   "namespace": "nifi",   "type": "record",   
"fields": [ \{ "name": "field1", "type": "string" }, \{ "name": "field2", 
"type": "int" } ] }
{code}

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
 Property: CONTENT_FIELD=content_field
 Property: ATTRIBUTE_PREFIX=attr.

XML definition
 
   
     some text
     
       some nested text
       some other nested text
     
   
 

Schema definition
 { 
   "name": "testschema",
   "namespace": "nifi",
   "type": "record",
   "fields": [
     {
       "name": "field1", 
       "type":

{          "name": "NestedRecord",          "type": "record",          "fields" 
: [             \\{"name": "attr.attribute", "type": "string"}

,

            \{"name": "content_field", "type": "string"}

         ]
       }
   },
   {
     "name": "field2", 
     "type":

{       "name": "NestedRecord",       "type": "record",       "fields" : [      
      \\{"name": "attr.attribute", "type": "string"}

,

           \{"name": "nested1", "type": "string"},

           \{"name": "nested2", "type": "string"}

        ]
       }
    }
   ]
 }

What do you say?

> Add XML record reader & writer services
> ---
>
> Key: NIFI-4185
> URL: https://issues.apache.org/jira/browse/NIFI-4185
> Project: Apache NiFi
>  Issue Type: New Feature
>  Components: Extensions
>Affects Versions: 1.3.0
>Reporter: Andy LoPresto
>Priority: Major
>  Labels: json, records, xml
>
> With the addition of the {{RecordReader}} and {{RecordSetWriter}} paradigm, 
> XML conversion has not yet been targeted. This will replace the previous 
> ticket for XML to JSON conversion. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NIFI-4185) Add XML record reader & writer services

2018-03-11 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394487#comment-16394487
 ] 

Johannes Peter edited comment on NIFI-4185 at 3/11/18 12:38 PM:


[~alopresto]:
  Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition
 
   
     content
     123
   
 

Schema definition
{code}
{   "name": "testschema",   "namespace": "nifi",   "type": "record",   
"fields": [ \{ "name": "field1", "type": "string" }, \{ "name": "field2", 
"type": "int" } ] }
{code}

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
 Property: CONTENT_FIELD=content_field
 Property: ATTRIBUTE_PREFIX=attr.

XML definition
 
   
     some text
     
       some nested text
       some other nested text
     
   
 

Schema definition
 { 
   "name": "testschema",
   "namespace": "nifi",
   "type": "record",
   "fields": [
     {
       "name": "field1", 
       "type":

{          "name": "NestedRecord",          "type": "record",          "fields" 
: [             \\{"name": "attr.attribute", "type": "string"}

,

            \{"name": "content_field", "type": "string"}

         ]
       }
   },
   {
     "name": "field2", 
     "type":

{       "name": "NestedRecord",       "type": "record",       "fields" : [      
      \\{"name": "attr.attribute", "type": "string"}

,

           \{"name": "nested1", "type": "string"},

           \{"name": "nested2", "type": "string"}

        ]
       }
    }
   ]
 }

What do you say?


was (Author: jope):
[~alopresto]:
  Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition
 
   
     content
     123
   
 

Schema definition
```json
{   "name": "testschema",   "namespace": "nifi",   "type": "record",   
"fields": [ \{ "name": "field1", "type": "string" }, \{ "name": "field2", 
"type": "int" } ] }
```

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
 Property: CONTENT_FIELD=content_field
 Property: ATTRIBUTE_PREFIX=attr.

XML definition
 
   
     some text
     
       some nested text
       some other nested text
     
   
 

Schema definition
 { 
   "name": "testschema",
   "namespace": "nifi",
   "type": "record",
   "fields": [
     {
       "name": "field1", 
       "type":

{          "name": "NestedRecord",          "type": "record",          "fields" 
: [             \\{"name": "attr.attribute", "type": "string"}

,

            \{"name": "content_field", "type": "string"}

         ]
       }
   },
   {
     "name": "field2", 
     "type":

{       "name": "NestedRecord",       "type": "record",       "fields" : [      
      \\{"name": "attr.attribute", "type": "string"}

,

           \{"name": "nested1", "type": "string"},

           \{"name": "nested2", "type": "string"}

        ]
       }
    }
   ]
 }

What do you say?

> Add XML record reader & writer services
> ---
>
> Key: NIFI-4185
> URL: https://issues.apache.org/jira/browse/NIFI-4185
> Project: Apache NiFi
>  Issue Type: New Feature
>  Components: Extensions
>Affects Versions: 1.3.0
>Reporter: Andy LoPresto
>Priority: Major
>  Labels: json, records, xml
>
> With the addition of the {{RecordReader}} and {{RecordSetWriter}} paradigm, 
> XML conversion has not yet been targeted. This will replace the previous 
> ticket for XML to JSON conversion. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NIFI-4185) Add XML record reader & writer services

2018-03-11 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394487#comment-16394487
 ] 

Johannes Peter edited comment on NIFI-4185 at 3/11/18 12:37 PM:


[~alopresto]:
  Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition
 
   
     content
     123
   
 

Schema definition
```json
{   "name": "testschema",   "namespace": "nifi",   "type": "record",   
"fields": [ \{ "name": "field1", "type": "string" }, \{ "name": "field2", 
"type": "int" } ] }
```

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
 Property: CONTENT_FIELD=content_field
 Property: ATTRIBUTE_PREFIX=attr.

XML definition
 
   
     some text
     
       some nested text
       some other nested text
     
   
 

Schema definition
 { 
   "name": "testschema",
   "namespace": "nifi",
   "type": "record",
   "fields": [
     {
       "name": "field1", 
       "type":

{          "name": "NestedRecord",          "type": "record",          "fields" 
: [             \\{"name": "attr.attribute", "type": "string"}

,

            \{"name": "content_field", "type": "string"}

         ]
       }
   },
   {
     "name": "field2", 
     "type":

{       "name": "NestedRecord",       "type": "record",       "fields" : [      
      \\{"name": "attr.attribute", "type": "string"}

,

           \{"name": "nested1", "type": "string"},

           \{"name": "nested2", "type": "string"}

        ]
       }
    }
   ]
 }

What do you say?


was (Author: jope):
[~alopresto]:
  Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition
 
   
     content
     123
   
 

Schema definition

{   "name": "testschema",   "namespace": "nifi",   "type": "record",   
"fields": [ \{ "name": "field1", "type": "string" }, \{ "name": "field2", 
"type": "int" } ] }

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
 Property: CONTENT_FIELD=content_field
 Property: ATTRIBUTE_PREFIX=attr.

XML definition
 
   
     some text
     
       some nested text
       some other nested text
     
   
 

Schema definition
 { 
   "name": "testschema",
   "namespace": "nifi",
   "type": "record",
   "fields": [
     {
       "name": "field1", 
       "type":

{          "name": "NestedRecord",          "type": "record",          "fields" 
: [             \\{"name": "attr.attribute", "type": "string"}

,

            \{"name": "content_field", "type": "string"}

         ]
       }
   },
   {
     "name": "field2", 
     "type":

{       "name": "NestedRecord",       "type": "record",       "fields" : [      
      \\{"name": "attr.attribute", "type": "string"}

,

           \{"name": "nested1", "type": "string"},

           \{"name": "nested2", "type": "string"}

        ]
       }
    }
   ]
 }

What do you say?

> Add XML record reader & writer services
> ---
>
> Key: NIFI-4185
> URL: https://issues.apache.org/jira/browse/NIFI-4185
> Project: Apache NiFi
>  Issue Type: New Feature
>  Components: Extensions
>Affects Versions: 1.3.0
>Reporter: Andy LoPresto
>Priority: Major
>  Labels: json, records, xml
>
> With the addition of the {{RecordReader}} and {{RecordSetWriter}} paradigm, 
> XML conversion has not yet been targeted. This will replace the previous 
> ticket for XML to JSON conversion. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NIFI-4185) Add XML record reader & writer services

2018-03-11 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394487#comment-16394487
 ] 

Johannes Peter edited comment on NIFI-4185 at 3/11/18 12:36 PM:


[~alopresto]:
  Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition
 
   
     content
     123
   
 

Schema definition

{   "name": "testschema",   "namespace": "nifi",   "type": "record",   
"fields": [ \{ "name": "field1", "type": "string" }, \{ "name": "field2", 
"type": "int" } ] }

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
 Property: CONTENT_FIELD=content_field
 Property: ATTRIBUTE_PREFIX=attr.

XML definition
 
   
     some text
     
       some nested text
       some other nested text
     
   
 

Schema definition
 { 
   "name": "testschema",
   "namespace": "nifi",
   "type": "record",
   "fields": [
     {
       "name": "field1", 
       "type":

{          "name": "NestedRecord",          "type": "record",          "fields" 
: [             \\{"name": "attr.attribute", "type": "string"}

,

            \{"name": "content_field", "type": "string"}

         ]
       }
   },
   {
     "name": "field2", 
     "type":

{       "name": "NestedRecord",       "type": "record",       "fields" : [      
      \\{"name": "attr.attribute", "type": "string"}

,

           \{"name": "nested1", "type": "string"},

           \{"name": "nested2", "type": "string"}

        ]
       }
    }
   ]
 }

What do you say?


was (Author: jope):
[~alopresto]:
  Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition
 
   
     content
     123
   
 

Schema definition

{   "name": "testschema",

  "namespace": "nifi",

  "type": "record",   "fields": [

     \{ "name": "field1", "type": "string" },

     \{ "name": "field2", "type": "int" }

   ]
 }

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
 Property: CONTENT_FIELD=content_field
 Property: ATTRIBUTE_PREFIX=attr.

XML definition
 
  
    some text
    
      some nested text
      some other nested text
    
  
 

Schema definition
 { 
  "name": "testschema",
  "namespace": "nifi",
  "type": "record",
  "fields": [
    {
      "name": "field1", 
      "type": {
         "name": "NestedRecord",
         "type": "record",
         "fields" : [

            \{"name": "attr.attribute", "type": "string"},

            \{"name": "content_field", "type": "string"}

         ]
      }
  },
  {
    "name": "field2", 
    "type": {
      "name": "NestedRecord",
      "type": "record",
      "fields" : [

           \{"name": "attr.attribute", "type": "string"},

           \{"name": "nested1", "type": "string"},

           \{"name": "nested2", "type": "string"}

        ]
      }
   }
  ]
 }

What do you say?

> Add XML record reader & writer services
> ---
>
> Key: NIFI-4185
> URL: https://issues.apache.org/jira/browse/NIFI-4185
> Project: Apache NiFi
>  Issue Type: New Feature
>  Components: Extensions
>Affects Versions: 1.3.0
>Reporter: Andy LoPresto
>Priority: Major
>  Labels: json, records, xml
>
> With the addition of the {{RecordReader}} and {{RecordSetWriter}} paradigm, 
> XML conversion has not yet been targeted. This will replace the previous 
> ticket for XML to JSON conversion. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NIFI-4185) Add XML record reader & writer services

2018-03-11 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394487#comment-16394487
 ] 

Johannes Peter edited comment on NIFI-4185 at 3/11/18 12:34 PM:


[~alopresto]:
  Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition
 
   
     content
     123
   
 

Schema definition

{   "name": "testschema",

  "namespace": "nifi",

  "type": "record",   "fields": [

     \{ "name": "field1", "type": "string" },

     \{ "name": "field2", "type": "int" }

   ]
 }

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
 Property: CONTENT_FIELD=content_field
 Property: ATTRIBUTE_PREFIX=attr.

XML definition
 
  
    some text
    
      some nested text
      some other nested text
    
  
 

Schema definition
 { 
  "name": "testschema",
  "namespace": "nifi",
  "type": "record",
  "fields": [
    {
      "name": "field1", 
      "type": {
         "name": "NestedRecord",
         "type": "record",
         "fields" : [

            \{"name": "attr.attribute", "type": "string"},

            \{"name": "content_field", "type": "string"}

         ]
      }
  },
  {
    "name": "field2", 
    "type": {
      "name": "NestedRecord",
      "type": "record",
      "fields" : [

           \{"name": "attr.attribute", "type": "string"},

           \{"name": "nested1", "type": "string"},

           \{"name": "nested2", "type": "string"}

        ]
      }
   }
  ]
 }

What do you say?


was (Author: jope):
[~alopresto]:
  Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition
 
  
    content
    123
  
 

Schema definition
 {
 "name": "testschema",
 "namespace": "nifi",
 "type": "record",
 "fields": [

{ "name": "field1", "type": "string" }

,

{ "name": "field2", "type": "int" }

]
 }

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
 Property: CONTENT_FIELD=content_field
 Property: ATTRIBUTE_PREFIX=attr.

XML definition
 
 
 some text
 
 some nested text
 some other nested text
 
 
 

Schema definition
 { 
 "name": "testschema",
 "namespace": "nifi",
 "type": "record",
 "fields": [
 {
 "name": "field1", 
 "type": {
 "name": "NestedRecord",
 "type": "record",
 "fields" : [

{"name": "attr.attribute", "type": "string"}

,

{"name": "content_field", "type": "string"}

]
 }
 },
 {
 "name": "field2", 
 "type": {
 "name": "NestedRecord",
 "type": "record",
 "fields" : [

{"name": "attr.attribute", "type": "string"}

,

{"name": "nested1", "type": "string"}

,

{"name": "nested2", "type": "string"}

]
 }
 }
 ]
 }

What do you say?

> Add XML record reader & writer services
> ---
>
> Key: NIFI-4185
> URL: https://issues.apache.org/jira/browse/NIFI-4185
> Project: Apache NiFi
>  Issue Type: New Feature
>  Components: Extensions
>Affects Versions: 1.3.0
>Reporter: Andy LoPresto
>Priority: Major
>  Labels: json, records, xml
>
> With the addition of the {{RecordReader}} and {{RecordSetWriter}} paradigm, 
> XML conversion has not yet been targeted. This will replace the previous 
> ticket for XML to JSON conversion. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NIFI-4185) Add XML record reader & writer services

2018-03-11 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394487#comment-16394487
 ] 

Johannes Peter edited comment on NIFI-4185 at 3/11/18 12:29 PM:


[~alopresto]:
  Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition
 
  
    content
    123
  
 

Schema definition
 {
 "name": "testschema",
 "namespace": "nifi",
 "type": "record",
 "fields": [

{ "name": "field1", "type": "string" }

,

{ "name": "field2", "type": "int" }

]
 }

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
 Property: CONTENT_FIELD=content_field
 Property: ATTRIBUTE_PREFIX=attr.

XML definition
 
 
 some text
 
 some nested text
 some other nested text
 
 
 

Schema definition
 { 
 "name": "testschema",
 "namespace": "nifi",
 "type": "record",
 "fields": [
 {
 "name": "field1", 
 "type": {
 "name": "NestedRecord",
 "type": "record",
 "fields" : [

{"name": "attr.attribute", "type": "string"}

,

{"name": "content_field", "type": "string"}

]
 }
 },
 {
 "name": "field2", 
 "type": {
 "name": "NestedRecord",
 "type": "record",
 "fields" : [

{"name": "attr.attribute", "type": "string"}

,

{"name": "nested1", "type": "string"}

,

{"name": "nested2", "type": "string"}

]
 }
 }
 ]
 }

What do you say?


was (Author: jope):
[~alopresto]:
 Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition


content
123



Schema definition
{
"name": "testschema",
"namespace": "nifi",
"type": "record",
"fields": [
{ "name": "field1", "type": "string" },
{ "name": "field2", "type": "int" }
]
}

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
Property: CONTENT_FIELD=content_field
Property: ATTRIBUTE_PREFIX=attr.

XML definition


some text

some nested text
some other nested text




Schema definition
{  
"name": "testschema",
"namespace": "nifi",
"type": "record",
"fields": [
{
"name": "field1", 
"type": {
"name": "NestedRecord",
"type": "record",
"fields" : [
{"name": "attr.attribute", "type": 
"string"},
{"name": "content_field", "type": 
"string"}
]
}
},
{
"name": "field2", 
"type": {
"name": "NestedRecord",
"type": "record",
"fields" : [
{"name": "attr.attribute", "type": 
"string"},
{"name": "nested1", "type": "string"},
{"name": "nested2", "type": "string"}
]
}
}
]
}

What do you say?

> Add XML record reader & writer services
> ---
>
> Key: NIFI-4185
> URL: https://issues.apache.org/jira/browse/NIFI-4185
> Project: Apache NiFi
>  Issue Type: New Feature
>  Components: Extensions
>Affects Versions: 1.3.0
>Reporter: Andy LoPresto
>Priority: Major
>  Labels: json, records, xml
>
> With the addition of the {{RecordReader}} and {{RecordSetWriter}} paradigm, 
> XML conversion has not yet been targeted. This will replace the previous 
> ticket for XML to JSON conversion. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NIFI-4185) Add XML record reader & writer services

2018-03-11 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394487#comment-16394487
 ] 

Johannes Peter commented on NIFI-4185:
--

[~alopresto]:
 Started implementing an XML Record Reader. Shall I create a separate ticket 
for this?

Similar to the JSON readers, the XML reader will expect either a single record 
(e. g. content ... ) or an array of 
records (e. g. content ... 
 ... )

The reader will be aligned with common transformators. "Normal" fields (e. g. 
String, Integer) can be described by simple key-value pairs:

XML definition


content
123



Schema definition
{
"name": "testschema",
"namespace": "nifi",
"type": "record",
"fields": [
{ "name": "field1", "type": "string" },
{ "name": "field2", "type": "int" }
]
}

Parsing of attributes or nested fields require the definition of nested records 
and a field name for the content (optional, a prefix for attributes can be 
defined):
Property: CONTENT_FIELD=content_field
Property: ATTRIBUTE_PREFIX=attr.

XML definition


some text

some nested text
some other nested text




Schema definition
{  
"name": "testschema",
"namespace": "nifi",
"type": "record",
"fields": [
{
"name": "field1", 
"type": {
"name": "NestedRecord",
"type": "record",
"fields" : [
{"name": "attr.attribute", "type": 
"string"},
{"name": "content_field", "type": 
"string"}
]
}
},
{
"name": "field2", 
"type": {
"name": "NestedRecord",
"type": "record",
"fields" : [
{"name": "attr.attribute", "type": 
"string"},
{"name": "nested1", "type": "string"},
{"name": "nested2", "type": "string"}
]
}
}
]
}

What do you say?

> Add XML record reader & writer services
> ---
>
> Key: NIFI-4185
> URL: https://issues.apache.org/jira/browse/NIFI-4185
> Project: Apache NiFi
>  Issue Type: New Feature
>  Components: Extensions
>Affects Versions: 1.3.0
>Reporter: Andy LoPresto
>Priority: Major
>  Labels: json, records, xml
>
> With the addition of the {{RecordReader}} and {{RecordSetWriter}} paradigm, 
> XML conversion has not yet been targeted. This will replace the previous 
> ticket for XML to JSON conversion. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NIFI-4516) Add FetchSolr processor

2018-03-06 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387726#comment-16387726
 ] 

Johannes Peter commented on NIFI-4516:
--

Hi [~abhi.rohatgi],

thank you for your offer, but I am almost done with this.

> Add FetchSolr processor
> ---
>
> Key: NIFI-4516
> URL: https://issues.apache.org/jira/browse/NIFI-4516
> Project: Apache NiFi
>  Issue Type: Improvement
>  Components: Extensions
>Reporter: Johannes Peter
>Assignee: Johannes Peter
>Priority: Major
>  Labels: features
>
> The processor shall be capable 
> * to query Solr within a workflow,
> * to make use of standard functionalities of Solr such as faceting, 
> highlighting, result grouping, etc.,
> * to make use of NiFis expression language to build Solr queries, 
> * to handle results as records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NIFI-4583) Restructure package nifi-solr-processors

2017-11-08 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-4583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16244399#comment-16244399
 ] 

Johannes Peter commented on NIFI-4583:
--

[~ijokarumawak] [~bbende] Do you agree?

> Restructure package nifi-solr-processors
> 
>
> Key: NIFI-4583
> URL: https://issues.apache.org/jira/browse/NIFI-4583
> Project: Apache NiFi
>  Issue Type: Improvement
>Reporter: Johannes Peter
>Assignee: Johannes Peter
>Priority: Minor
>
> Several functionalities currently implemented e. g. in GetSolr or 
> SolrProcessor should be made available for other processors or controller 
> services. A class SolrUtils should be created containing several static 
> methods. This includes the methods 
> - getRequestParams (PutSolrContentStream)
> - solrDocumentsToRecordSet (GetSolr) 
> - createSolrClient (SolrProcessor)
> and the inner class QueryResponseOutputStreamCallback (GetSolr)
> Some unit tests might be affected.
> The method declaration  
> protected SolrClient createSolrClient(final ProcessContext context, final 
> String solrLocation)
> should be changed to 
> public static SolrClient createSolrClient(final PropertyContext context, 
> final String solrLocation)
> to be suitable also for controller services.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NIFI-4583) Restructure package nifi-solr-processors

2017-11-08 Thread Johannes Peter (JIRA)
Johannes Peter created NIFI-4583:


 Summary: Restructure package nifi-solr-processors
 Key: NIFI-4583
 URL: https://issues.apache.org/jira/browse/NIFI-4583
 Project: Apache NiFi
  Issue Type: Improvement
Reporter: Johannes Peter
Assignee: Johannes Peter
Priority: Minor


Several functionalities currently implemented e. g. in GetSolr or SolrProcessor 
should be made available for other processors or controller services. A class 
SolrUtils should be created containing several static methods. This includes 
the methods 
- getRequestParams (PutSolrContentStream)
- solrDocumentsToRecordSet (GetSolr) 
- createSolrClient (SolrProcessor)
and the inner class QueryResponseOutputStreamCallback (GetSolr)

Some unit tests might be affected.

The method declaration  
protected SolrClient createSolrClient(final ProcessContext context, final 
String solrLocation)
should be changed to 
public static SolrClient createSolrClient(final PropertyContext context, final 
String solrLocation)
to be suitable also for controller services.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NIFI-4516) Add FetchSolr processor

2017-10-23 Thread Johannes Peter (JIRA)
Johannes Peter created NIFI-4516:


 Summary: Add FetchSolr processor
 Key: NIFI-4516
 URL: https://issues.apache.org/jira/browse/NIFI-4516
 Project: Apache NiFi
  Issue Type: Improvement
  Components: Extensions
Reporter: Johannes Peter
Assignee: Johannes Peter


The processor shall be capable 
* to query Solr within a workflow,
* to make use of standard functionalities of Solr such as faceting, 
highlighting, result grouping, etc.,
* to make use of NiFis expression language to build Solr queries, 
* to handle results as records.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-09-19 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172249#comment-16172249
 ] 

Johannes Peter commented on NIFI-3248:
--

Update:

I am almost done with the new processor implementation. Quick update: 

(1) Meanwhile I had a little conversation with Cassandra Targett (Solr PMC), 
and she helped me clarifying some things about field \_version\_. 
Unfortunately, it is not possible to convert a value of this field into a valid 
timestamp. The values of this field are monotonically increasing depending on 
indexing time, but only at shard level, not at collection level. I am sorry for 
the confusion. The processor therefore iterates over shards if (a) Solr runs in 
cloud mode and (b) \_version\_ is used to track document retrieval instead of a 
dedicated date field. Although this way might require more queries and 
therefore be slower if collections comprise many shards, I implemented this to 
make the processor suitable for many more collections. The shard names 
currently have to be specified by property as I yet have not found a reliable 
way to figure them out automatically (shard names != core names).
(2) I implemented an option to make the use of filter query caches 
configurable. 
(3) The processor now makes use of the StateManager. 
(4) I will add an option to convert results into records.

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be 

[jira] [Comment Edited] (NIFI-3248) GetSolr can miss recently updated documents

2017-09-04 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16152432#comment-16152432
 ] 

Johannes Peter edited comment on NIFI-3248 at 9/4/17 11:26 AM:
---

[~ijokarumawak]
(1) Sorting by ID ensures that each document ist retrieved only once, even if 
the document is updated. Sorting by \_version\_ asc ensures that each version 
of a document is retrieved once, as updated documents are "appended" at the 
end. I personally expect that someone, who uses Solr as a source, wants to see 
updated Solr documents in the target system to replace old ones. However, we 
could make this configurable. 
(2) The parameter fq provides the same query capabilities like q and can be 
used in the same way. The essential difference is that q basically is used to 
calculate relevancy, whereas fq is basically used to filter and to improve 
performance. In this case, we don't need relevancy as we sort by indexing time. 
Nevertheless, I see the point that users expect a property where they can 
configure the main query. 
(3) \_version\_ behaves like a timestamp, so there should be a little chance 
that two documents within a collection have the same value (in a cluster). I 
know that there is a way to convert it into a timestamp, but I first have to 
figure out how to do this exactly. Sorting by "\_version\_ asc" and using 
cursor marks should make the retrieval reliable to a very high degree. 
(4) I want to emphasize again that the logic and the purpose of GetSolr doesn't 
cover the capabilities of Solr sufficiently. There should be an additional 
processor to use Solr not only as a source, but also as a query layer. Features 
like faceting, grouping, pivot (e. g. for analytical purposes), spellchecking 
(e. g. for OCR or NLP), etc. etc. etc. are not covered by GetSolr (and 
shouldn't be included as the processor should focus on the reliable retrieval). 
However, there should be a more flexible option to query Solr within workflows.


was (Author: jope):
[~ijokarumawak]
(1) Sorting by ID ensures that each document ist retrieved only once, even if 
the document is updated. Sorting by \_version\_ asc ensures that each version 
of a document is retrieved once, as updated documents are "appended" at the 
end. I personally expect that someone, who uses Solr as a source, wants to see 
updated Solr documents in the target system to replace old ones. However, we 
could make this configurable. 
(2) The parameter fq provides the same query capabilities like q and can be 
used in the same way. The essential difference is that q basically is used to 
calculate relevancy, whereas fq is basically used to filter and to improve 
performance. In this case, we don't need relevancy as we sort by indexing time. 
Nevertheless, I see the point that users expect a property where they can 
configure the main query. 
(3) \_version\_ behaves like a timestamp, so there should be a little chance 
that two documents within a collection have the same value (in a cluster). I 
know that there is a way to convert it into a timestamp, but I first have to 
figure out how to do this exactly. Sorting by "\_version\_ asc" and using 
cursor marks should make the retrieval reliable to a very high degree. 
(4) I want to emphasize again that the logic and the purpose of GetSolr doesn't 
cover the capabilities of Solr sufficiently. There should be an additional 
processor to use Solr not only as a source, but also as a query layer. Features 
like faceting, grouping, pivot (e. g. for analytical purposes), spellchecking 
(e. g. for OCR or NLP), etc. etc. etc. are not covered by GetSolr (and 
shouldn't be included as the main focus should rely on the reliable retrieval). 
However, there should be a more flexible option to query Solr within workflows.

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|

[jira] [Comment Edited] (NIFI-3248) GetSolr can miss recently updated documents

2017-09-04 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16152432#comment-16152432
 ] 

Johannes Peter edited comment on NIFI-3248 at 9/4/17 10:19 AM:
---

[~ijokarumawak]
(1) Sorting by ID ensures that each document ist retrieved only once, even if 
the document is updated. Sorting by \_version\_ asc ensures that each version 
of a document is retrieved once, as updated documents are "appended" at the 
end. I personally expect that someone, who uses Solr as a source, wants to see 
updated Solr documents in the target system to replace old ones. However, we 
could make this configurable. 
(2) The parameter fq provides the same query capabilities like q and can be 
used in the same way. The essential difference is that q basically is used to 
calculate relevancy, whereas fq is basically used to filter and to improve 
performance. In this case, we don't need relevancy as we sort by indexing time. 
Nevertheless, I see the point that users expect a property where they can 
configure the main query. 
(3) \_version\_ behaves like a timestamp, so there should be a little chance 
that two documents within a collection have the same value (in a cluster). I 
know that there is a way to convert it into a timestamp, but I first have to 
figure out how to do this exactly. Sorting by "\_version\_ asc" and using 
cursor marks should make the retrieval reliable to a very high degree. 
(4) I want to emphasize again that the logic and the purpose of GetSolr doesn't 
cover the capabilities of Solr sufficiently. There should be an additional 
processor to use Solr not only as a source, but also as a query layer. Features 
like faceting, grouping, pivot (e. g. for analytical purposes), spellchecking 
(e. g. for OCR or NLP), etc. etc. etc. are not covered by GetSolr (and 
shouldn't be included as the main focus should rely on the reliable retrieval). 
However, there should be a more flexible option to query Solr within workflows.


was (Author: jope):
(1) Sorting by ID ensures that each document ist retrieved only once, even if 
the document is updated. Sorting by \_version\_ asc ensures that each version 
of a document is retrieved once, as updated documents are "appended" at the 
end. I personally expect that someone, who uses Solr as a source, wants to see 
updated Solr documents in the target system to replace old ones. However, we 
could make this configurable. 
(2) The parameter fq provides the same query capabilities like q and can be 
used in the same way. The essential difference is that q basically is used to 
calculate relevancy, whereas fq is basically used to filter and to improve 
performance. In this case, we don't need relevancy as we sort by indexing time. 
Nevertheless, I see the point that users expect a property where they can 
configure the main query. 
(3) \_version\_ behaves like a timestamp, so there should be a little chance 
that two documents within a collection have the same value (in a cluster). I 
know that there is a way to convert it into a timestamp, but I first have to 
figure out how to do this exactly. Sorting by "\_version\_ asc" and using 
cursor marks should make the retrieval reliable to a very high degree. 
(4) I want to emphasize again that the logic and the purpose of GetSolr doesn't 
cover the capabilities of Solr sufficiently. There should be an additional 
processor to use Solr not only as a source, but also as a query layer. Features 
like faceting, grouping, pivot (e. g. for analytical purposes), spellchecking 
(e. g. for OCR or NLP), etc. etc. etc. are not covered by GetSolr (and 
shouldn't be included as the main focus should rely on the reliable retrieval). 
However, there should be a more flexible option to query Solr within workflows.

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-09-04 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16152432#comment-16152432
 ] 

Johannes Peter commented on NIFI-3248:
--

(1) Sorting by ID ensures that each document ist retrieved once, even if the 
document is updated. Sorting by \_version\_ asc ensures that each version of a 
document is retrieved once, as updated documents are "appended" at the end. I 
personally expect that someone, who uses Solr as a source, wants to see updated 
Solr documents in the target system to replace old ones. However, we could make 
this configurable. 
(2) The parameter fq provides the same query capabilities like q and can be 
used in the same way. The essential difference is that q basically is used to 
calculate relevancy, whereas fq is basically used to filter and to improve 
performance. In this case, we don't need relevancy as we sort by indexing time. 
Nevertheless, I see the point that users expect a property where they can 
configure the main query. 
(3) \_version\_ behaves like a timestamp, so there should be a little chance 
that two documents within a collection have the same value (in a cluster). I 
know that there is a way to convert it into a timestamp, but I first have to 
figure out how to do this exactly. Sorting by "\_version\_ asc" and using 
cursor marks should make the retrieval reliable to a very high degree. 
(4) I want to emphasize again that the logic and the purpose of GetSolr doesn't 
cover the capabilities of Solr sufficiently. There should be an additional 
processor to use Solr not only as a source, but also as a query layer. Features 
like faceting, grouping, pivot (e. g. for analytical purposes), spellchecking 
(e. g. for OCR or NLP), etc. etc. etc. are not covered by GetSolr (and 
shouldn't be included as the main focus should rely on the reliable retrieval). 
However, there should be a more flexible option to query Solr within workflows.

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi 

[jira] [Comment Edited] (NIFI-3248) GetSolr can miss recently updated documents

2017-09-04 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16152432#comment-16152432
 ] 

Johannes Peter edited comment on NIFI-3248 at 9/4/17 10:17 AM:
---

(1) Sorting by ID ensures that each document ist retrieved only once, even if 
the document is updated. Sorting by \_version\_ asc ensures that each version 
of a document is retrieved once, as updated documents are "appended" at the 
end. I personally expect that someone, who uses Solr as a source, wants to see 
updated Solr documents in the target system to replace old ones. However, we 
could make this configurable. 
(2) The parameter fq provides the same query capabilities like q and can be 
used in the same way. The essential difference is that q basically is used to 
calculate relevancy, whereas fq is basically used to filter and to improve 
performance. In this case, we don't need relevancy as we sort by indexing time. 
Nevertheless, I see the point that users expect a property where they can 
configure the main query. 
(3) \_version\_ behaves like a timestamp, so there should be a little chance 
that two documents within a collection have the same value (in a cluster). I 
know that there is a way to convert it into a timestamp, but I first have to 
figure out how to do this exactly. Sorting by "\_version\_ asc" and using 
cursor marks should make the retrieval reliable to a very high degree. 
(4) I want to emphasize again that the logic and the purpose of GetSolr doesn't 
cover the capabilities of Solr sufficiently. There should be an additional 
processor to use Solr not only as a source, but also as a query layer. Features 
like faceting, grouping, pivot (e. g. for analytical purposes), spellchecking 
(e. g. for OCR or NLP), etc. etc. etc. are not covered by GetSolr (and 
shouldn't be included as the main focus should rely on the reliable retrieval). 
However, there should be a more flexible option to query Solr within workflows.


was (Author: jope):
(1) Sorting by ID ensures that each document ist retrieved once, even if the 
document is updated. Sorting by \_version\_ asc ensures that each version of a 
document is retrieved once, as updated documents are "appended" at the end. I 
personally expect that someone, who uses Solr as a source, wants to see updated 
Solr documents in the target system to replace old ones. However, we could make 
this configurable. 
(2) The parameter fq provides the same query capabilities like q and can be 
used in the same way. The essential difference is that q basically is used to 
calculate relevancy, whereas fq is basically used to filter and to improve 
performance. In this case, we don't need relevancy as we sort by indexing time. 
Nevertheless, I see the point that users expect a property where they can 
configure the main query. 
(3) \_version\_ behaves like a timestamp, so there should be a little chance 
that two documents within a collection have the same value (in a cluster). I 
know that there is a way to convert it into a timestamp, but I first have to 
figure out how to do this exactly. Sorting by "\_version\_ asc" and using 
cursor marks should make the retrieval reliable to a very high degree. 
(4) I want to emphasize again that the logic and the purpose of GetSolr doesn't 
cover the capabilities of Solr sufficiently. There should be an additional 
processor to use Solr not only as a source, but also as a query layer. Features 
like faceting, grouping, pivot (e. g. for analytical purposes), spellchecking 
(e. g. for OCR or NLP), etc. etc. etc. are not covered by GetSolr (and 
shouldn't be included as the main focus should rely on the reliable retrieval). 
However, there should be a more flexible option to query Solr within workflows.

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or 

[jira] [Comment Edited] (NIFI-3248) GetSolr can miss recently updated documents

2017-09-03 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16151784#comment-16151784
 ] 

Johannes Peter edited comment on NIFI-3248 at 9/3/17 12:17 PM:
---

[~ijokarumawak], [~bbende]
I examined the current GetSolr implementation and I found several issues, which 
I want to discuss:
(1) Currently, a date field needs to be included into the index schema and the 
Solr documents for indexing. Although this can be realized easily via Solrs' 
TimestampUpdateProcessor, it should be better, simply to use Solrs' \_version\_ 
field for filtering subsequent retrieval. This field is included in every 
well-configured Solr index as it is required for several functionalities. By 
doing so, this processor could also be used for indexes, which were not created 
considering NiFi interactions. 
(2) Iterating through a resultset will only be done if the processor runs the 
first time. This will be problematic if the amount of newly indexed documents 
in a trigger interval exceeds the configured batch size.
(3) Successively increasing the start parameter to retrieve Solr documents in 
batches is accompanied by two problems in this context. First, this way shows a 
poor performance for large collections. Second, updating the index during the 
iteration will probably lead to duplicates or a loss of documents in the case 
that positions of documents change due to newly indexed documents or deletions. 
Instead of increasing the start parameter, cursor marks should be used, and the 
sorting should be fixed to an ascending order of the time when documents were 
indexed (\_version\_ field). More details on this can be retrieved here 
https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html
(4) Using the fq-parameter instead of the q-parameter should improve the 
performance in some cases, as Solr is able to use caches for fq. The 
q-parameter should be fixed to "\*:\*". 

As a consequence, I suggest to redesign the GetSolr processor in a way that it 
mainly focuses on retrieving documents reliably. This can be done better by 
using cursor marks and the \_version\_ field. Additionally, users should not be 
enabled to change the parameters sort and q. The full query capabilities of 
Solr could be made available by integrating an additional processor, e. g. 
"FetchSolr".


was (Author: jope):
[~ijokarumawak], [~bbende]
I examined the current GetSolr implementation and I found several issues, which 
I want to discuss:
(1) Currently, a date field needs to be included into the index schema and the 
Solr documents for indexing. Although this can be realized easily via Solrs' 
TimestampUpdateProcessor, it should be better, simply to use Solrs' \_version\_ 
field for filtering subsequent retrieval. This field is included in every 
well-configured Solr index as it is required for several functionalities. By 
doing so, this processor could also be used for indexes, which were not created 
considering NiFi interactions. 
(2) Iterating through a resultset will only be done if the processor runs the 
first time. This will be problematic if the amount of newly indexed documents 
in a trigger interval exceeds the configured batch size.
(3) Successively increasing the start parameter to retrieve Solr documents in 
batches is accompanied by two problems in this context. First, this way shows a 
poor performance for large collections. Second, updating the index during the 
iteration will probably lead to duplicates or a loss of documents in the case 
that positions of documents change due to newly indexed documents or deletions. 
Instead of increasing the start parameter, cursor marks should be used, and the 
sorting should be fixed to an ascending order of the time when documents were 
indexed (\_version\_ field). More details on this can be retrieved here 
https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html
(4) Using the fq-parameter instead of the q-parameter should improve the 
performance in some cases, as Solr is able to use caches for fq. The 
q-parameter should be fixed to "*:*". 

As a consequence, I suggest to redesign the GetSolr processor in a way that it 
mainly focuses on retrieving documents reliably. This can be done better by 
using cursor marks and the \_version\_ field. Additionally, users should not be 
enabled to change the parameters sort and q. The full query capabilities of 
Solr could be made available by integrating an additional processor, e. g. 
"FetchSolr".

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-09-03 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16151784#comment-16151784
 ] 

Johannes Peter commented on NIFI-3248:
--

[~ijokarumawak], [~bbende]
I examined the current GetSolr implementation and I found several issues, which 
I want to discuss:
(1) Currently, a date field needs to be included into the index schema and the 
Solr documents for indexing. Although this can be realized easily via Solrs' 
TimestampUpdateProcessor, it should be better, simply to use Solrs' \_version\_ 
field for filtering subsequent retrieval. This field is included in every 
well-configured Solr index as it is required for several functionalities. By 
doing so, this processor could also be used for indexes, which were not created 
considering NiFi interactions. 
(2) Iterating through a resultset will only be done if the processor runs the 
first time. This will be problematic if the amount of newly indexed documents 
in a trigger interval exceeds the configured batch size.
(3) Successively increasing the start parameter to retrieve Solr documents in 
batches is accompanied by two problems in this context. First, this way shows a 
poor performance for large collections. Second, updating the index during the 
iteration will probably lead to duplicates or a loss of documents in the case 
that positions of documents change due to newly indexed documents or deletions. 
Instead of increasing the start parameter, cursor marks should be used, and the 
sorting should be fixed to an ascending order of the time when documents were 
indexed (\_version\_ field). More details on this can be retrieved here 
https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html
(4) Using the fq-parameter instead of the q-parameter should improve the 
performance in some cases, as Solr is able to use caches for fq. The 
q-parameter should be fixed to "*:*". 

As a consequence, I suggest to redesign the GetSolr processor in a way that it 
mainly focuses on retrieving documents reliably. This can be done better by 
using cursor marks and the \_version\_ field. Additionally, users should not be 
enabled to change the parameters sort and q. The full query capabilities of 
Solr could be made available by integrating an additional processor, e. g. 
"FetchSolr".

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Johannes Peter
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-08-31 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149066#comment-16149066
 ] 

Johannes Peter commented on NIFI-3248:
--

[~ijokarumawak] Sure. I will start within next week.

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Koji Kawamura
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow below, GetSolr can miss updated documents:
> {code}
> t1: GetSolr queried
> t2: GenerateFlowFile set date = t2
> t3: PutSolrContentStream stored new doc
> t4: GetSolr queried again, from t1 to t4, but the new doc hasn't been indexed
> t5: Solr completed index
> t6: GetSolr queried again, from t4 to t6, the doc didn't match query
> {code}
> This behavior should be at least documented.
> Plus, it would be helpful to add a new configuration property to GetSolr, to 
> specify commit lag-time so that GetSolr aims older timestamp range to query 
> documents.
> {code}
> // with commit 

[jira] [Comment Edited] (NIFI-3248) GetSolr can miss recently updated documents

2017-08-29 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16145306#comment-16145306
 ] 

Johannes Peter edited comment on NIFI-3248 at 8/29/17 1:46 PM:
---

Have you considered to use the Solr field \_version\_ yet? It can be treated 
like a timestamp. It also can be transformed to a timestamp. E. g. sorting for 
"\_version\_ desc" sorts documents depending on their time of indexing. 


was (Author: jope):
Have you considered to use the Solr field \_version\_ yet? It can be treated 
like a timestamp. It also can be transformed to a timestamp. E. g. sorting for 
"_version_ desc" sorts documents depending on their time of indexing. 

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Koji Kawamura
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow below, GetSolr can miss updated documents:
> {code}
> t1: GetSolr queried
> t2: GenerateFlowFile set date = t2

[jira] [Comment Edited] (NIFI-3248) GetSolr can miss recently updated documents

2017-08-29 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16145306#comment-16145306
 ] 

Johannes Peter edited comment on NIFI-3248 at 8/29/17 1:46 PM:
---

Have you considered to use the Solr field \_version\_ yet? It can be treated 
like a timestamp. It also can be transformed to a timestamp. E. g. sorting for 
"_version_ desc" sorts documents depending on their time of indexing. 


was (Author: jope):
Have you considered to use the Solr field "__version__" yet? It can be treated 
like a timestamp. It also can be transformed to a timestamp. E. g. sorting for 
"_version_ desc" sorts documents depending on their time of indexing. 

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Koji Kawamura
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow below, GetSolr can miss updated documents:
> {code}
> t1: GetSolr queried
> t2: GenerateFlowFile set date = t2

[jira] [Comment Edited] (NIFI-3248) GetSolr can miss recently updated documents

2017-08-29 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16145306#comment-16145306
 ] 

Johannes Peter edited comment on NIFI-3248 at 8/29/17 1:45 PM:
---

Have you considered to use the Solr field "__version__" yet? It can be treated 
like a timestamp. It also can be transformed to a timestamp. E. g. sorting for 
"_version_ desc" sorts documents depending on their time of indexing. 


was (Author: jope):
Have you considered to use the Solr field "_version_" yet? It can be treated 
like a timestamp. It also can be transformed to a timestamp. E. g. sorting for 
"_version_ desc" sorts documents depending on their time of indexing. 

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Koji Kawamura
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow below, GetSolr can miss updated documents:
> {code}
> t1: GetSolr queried
> t2: GenerateFlowFile set date = t2

[jira] [Commented] (NIFI-3248) GetSolr can miss recently updated documents

2017-08-29 Thread Johannes Peter (JIRA)

[ 
https://issues.apache.org/jira/browse/NIFI-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16145306#comment-16145306
 ] 

Johannes Peter commented on NIFI-3248:
--

Have you considered to use the Solr field "_version_" yet? It can be treated 
like a timestamp. It also can be transformed to a timestamp. E. g. sorting for 
"_version_ desc" sorts documents depending on their time of indexing. 

> GetSolr can miss recently updated documents
> ---
>
> Key: NIFI-3248
> URL: https://issues.apache.org/jira/browse/NIFI-3248
> Project: Apache NiFi
>  Issue Type: Bug
>  Components: Extensions
>Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1, 
> 1.0.1
>Reporter: Koji Kawamura
>Assignee: Koji Kawamura
> Attachments: nifi-flow.png, query-result-with-curly-bracket.png, 
> query-result-with-square-bracket.png
>
>
> GetSolr holds the last query timestamp so that it only fetches documents 
> those have been added or updated since the last query.
> However, GetSolr misses some of those updated documents, and once the 
> documents date field value becomes older than last query timestamp, the 
> document won't be able to be queried by GetSolr any more.
> This JIRA is for tracking the process of investigating this behavior, and 
> discussion on them.
> Here are things that can be a cause of this behavior:
> |#|Short description|Should we address it?|
> |1|Timestamp range filter, curly or square bracket?|No|
> |2|Timezone difference between update and query|Additional docs might be 
> helpful|
> |3|Lag comes from NearRealTIme nature of Solr|Should be documented at least, 
> add 'commit lag-time'?|
> h2. 1. Timestamp range filter, curly or square bracket?
> At the first glance, using curly and square bracket in mix looked strange 
> ([source 
> code|https://github.com/apache/nifi/blob/support/nifi-0.5.x/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/GetSolr.java#L202]).
>  But these difference has a meaning.
> The square bracket on the range query is inclusive and the curly bracket is 
> exclusive. If we use inclusive on both sides and a document has a time stamp 
> exactly on the boundary then it could be returned in two consecutive 
> executions, and we only want it in one.
> This is intentional, and it should be as it is.
> h2. 2. Timezone difference between update and query
> Solr treats date fields as [UTC 
> representation|https://cwiki.apache.org/confluence/display/solr/Working+with+Dates|].
>  If date field String value of an updated document represents time without 
> timezone, and NiFi is running on an environment using timezone other than 
> UTC, GetSolr can't perform date range query as users expect.
> Let's say NiFi is running with JST(UTC+9). A process added a document to Solr 
> at 15:00 JST. But the date field doesn't have timezone. So, Solr indexed it 
> as 15:00 UTC. Then GetSolr performs range query at 15:10 JST, targeting any 
> documents updated from 15:00 to 15:10 JST. GetSolr formatted dates using UTC, 
> i.e. 6:00 to 6:10 UTC. The updated document won't be matched with the date 
> range filter.
> To avoid this, updated documents must have proper timezone in date field 
> string representation.
> If one uses NiFi expression language to set current timestamp to that date 
> field, following NiFi expression can be used:
> {code}
> ${now():format("-MM-dd'T'HH:mm:ss.SSSZ")}
> {code}
> It will produce a result like:
> {code}
> 2016-12-27T15:30:04.895+0900
> {code}
> Then it will be indexed in Solr with UTC and will be queried by GetSolr as 
> expected.
> h2. 3. Lag comes from NearRealTIme nature of Solr
> Solr provides Near Real Time search capability, that means, the recently 
> updated documents can be queried in Near Real Time, but it's not real time. 
> This latency can be controlled by either on client side which requests the 
> update operation by specifying "commitWithin" parameter, or on the Solr 
> server side, "autoCommit" and "autoSoftCommit" in 
> [solrconfig.xml|https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig#UpdateHandlersinSolrConfig-Commits].
> Since commit and updating index can be costly, it's recommended to set this 
> interval long enough up to the maximum tolerable latency.
> However, this can be problematic with GetSolr. For instance, as shown in the 
> simple NiFi flow below, GetSolr can miss updated documents:
> {code}
> t1: GetSolr queried
> t2: GenerateFlowFile set date = t2
> t3: PutSolrContentStream stored new doc
> t4: GetSolr queried again, from t1 to t4, but the new doc hasn't been indexed
> t5: Solr completed index
> t6: GetSolr queried again, from t4 to t6, the doc didn't match query
> {code}
> This behavior should be at least documented.
> Plus, it would be