[jira] [Comment Edited] (NIFI-1935) Added ConvertDynamicJsonToAvro processor

Daniel Cave (JIRA) Fri, 10 Jun 2016 12:00:37 -0700

    [ 
https://issues.apache.org/jira/browse/NIFI-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325078#comment-15325078
 ]


Daniel Cave edited comment on NIFI-1935 at 6/10/16 7:00 PM:
------------------------------------------------------------

I had to revisit the issues to remind myself of the cases/issues involved.  
Previously I had issues with it not properly reading from the attribute due to 
the way that attributes are interpreted when they are JSON (i.e. as a JSON 
string value with the element being the attribute name), however this may have 
been fixed.  I retested it as of today based on 0.x, and ConvertJSONToAvro says 
it cannot find schema if you try to read it from an attribute (likely due to 
the same reason AttributesToJson fails to properly convert it, since it cant 
read avro.brinary and requires it to be a true string).  Also, 
ConvertJSONToAvro doesn't register the data provenance, which is a bug and 
needs a ticket and fix for.

However, the case for this processor is interrelated with use cases for 
InferAvroSchema (which also required the changes to SplitJSON).  If I write the 
schema out as an attribute and the only use of that schema is to convert the 
json to avro, then you are correct that the existing processor is sufficient.  
However, writing the schema to an attribute presents other issues and limits 
its usefulness.  My use cases for the schema also involve using the same schema 
to generate SQL/CQL/Hive/etc statements for ingestion as well as sending the 
avro to a schema registry and programatically creating RDDs.  To do all this I 
need the schema as content in its proper JSON form.  In theory, 
AttributesToJSON would do this, however it isn't language aware and will create 
JSON with { <attributeName> : <attributeValue>  }  where the attributeName is 
schema and the attributeValue would be the avro schema as a JSON string (there 
is a similar issue in using InvokeHTTP with the response as an attribute), 
however due to the way the schema exists in the attribute it actually returns 
an empty string (since the attribute is actually in avro.binary form).  The 
processor seems to have been meant for simple attributes and not complex ones 
as putting an avro schema in one creates where the attribute contents itself is 
avro or JSON.  EvaluateJsonPath is also an option, however again once you 
extract the avro schema from attribute you'll find that it's not in a proper 
format and isn't valid JSON or a valid schema anymore.

Basically, there are four options to fix the issue:  Do Infer twice (once to 
content and once to attribute, not desirable due to overhead ramifications on 
small devices with sub-second throughput and involves fixing the schema 
attribute issue), upgrade ConvertJSONToAvro to handle schemas from either 
content or attribute, make major changes to AttributeToJson or create a new 
complex version of it, or to split Convert into Convert and ConvertDynamic 
where one can accept the schema from a flowfile content (which has other use 
cases as well).  I chose the latter as it created the least amount of backwards 
compatibility issues.  That is not to say it is necessarily the best choice of 
the four in all cases, it was the lesser of evils for the community in my view 
for now.  If you guys disagree or I've missed a better way to extract the 
schema then I'm certainly open to discussion and revisiting my design for 
dynamically handling everything from source to sink (any sink).  Keep in mind, 
some of the sink processors require everything in flowfile content 
(PutCassandra), some hybrid (ExecuteSQL), and some all in attributes 
(PutHiveQL).  So since those design inconsistent processors are already in 
public use, I have to be able to interpret the avro schema and create 
statements for any of the three in order to be able to handle the JSON and sink 
or transfer it into any source which means I need to be able to do varying 
kinds of parsing to it depending on the sink.

Let me know what you think.  Also, keep in mind in building this response to 
you I found three new bugs in at least three processors that need new tickets:  
evaluateAttributeExpressions() doesnt seem to be able to handle avro.binary 
(affects anything evaluating an attribute I assume, and may apply to normal 
blobs too), ConvertJsonToAvro isn't writing any data provenance on failure, 
InferAvroSchema writing JSON to attribute doesn't write in the right form.


was (Author: daniel cave):
I had to revisit the issues to remind myself of the cases/issues involved.  
Previously I had issues with it not properly reading from the attribute due to 
the way that attributes are interpreted when they are JSON (i.e. as a JSON 
string value with the element being the attribute name), however this may have 
been fixed.  I retested it as of today based on 0.x, and ConvertJSONToAvro says 
it cannot find schema if you try to read it from an attribute (likely due to 
the same reason AttributesToJson fails to properly convert it, since it cant 
read avro.brinary and requires it to be a true string).  Also, 
ConvertJSONToAvro doesn't register the data provenance, which is a bug and 
needs a ticket and fix for.

However, the case for this processor is interrelated with use cases for 
InferAvroSchema (which also required the changes to SplitJSON).  If I write the 
schema out as an attribute and the only use of that schema is to convert the 
json to avro, then you are correct that the existing processor is sufficient.  
However, writing the schema to an attribute presents other issues and limits 
its usefulness.  My use cases for the schema also involve using the same schema 
to generate SQL/CQL/Hive/etc statements for ingestion as well as sending the 
avro to a schema registry and programatically creating RDDs.  To do all this I 
need the schema as content in its proper JSON form.  In theory, 
AttributesToJSON would do this, however it isn't language aware and will create 
JSON with { <attributeName> : <attributeValue>  }  where the attributeName is 
schema and the attributeValue would be the avro schema as a JSON string (there 
is a similar issue in using InvokeHTTP with the response as an attribute), 
however due to the way the schema exists in the attribute it actually returns 
an empty string (since the attribute is actually in avro.binary form).  The 
processor seems to have been meant for simple attributes and not complex ones 
as putting an avro schema in one creates where the attribute contents itself is 
avro or JSON.  EvaluateJsonPath is also an option, however again once you 
extract the avro schema from attribute you'll find that it's not in a proper 
format and isn't valid JSON or a valid schema anymore.

Basically, there are four options to fix the issue:  Do Infer twice (once to 
content and once to attribute, not desirable due to overhead ramifications on 
small devices with sub-second throughput and involves fixing the schema 
attribute issue), upgrade ConvertJSONToAvro to handle schemas from either 
content or attribute, make major changes to AttributeToJson or create a new 
complex version of it, or to split Convert into Convert and ConvertDynamic 
where one can accept the schema from a flowfile content (which has other use 
cases as well).  I chose the latter as it created the least amount of backwards 
compatibility issues.  That is not to say it is necessarily the best choice of 
the four in all cases, it was the lesser of evils for the community in my view 
for now.  If you guys disagree or I've missed a better way to extract the 
schema then I'm certainly open to discussion and revisiting my design for 
dynamically handling everything from source to sink (any sink).  Keep in mind, 
some of the sink processors require everything in flowfile content 
(PutCassandra), some hybrid (ExecuteSQL), and some all in attributes 
(PutHiveQL).  So since those design inconsistent processors are already in 
public use, I have to be able to interpret the avro schema and create 
statements for any of the three in order to be able to handle the JSON and sink 
or transfer it into any source which means I need to be able to do varying 
kinds of parsing to it depending on the sink.

Let me know what you think.  Also, keep in mind in building this response to 
you I found three new bugs in at least three processors that need new tickets:  
evaluateAttributeExpressions() doesnt seem to be able to handle avro.binary 
(affects anything evaluating an attribute I assume, and may apply to normal 
blobs too), ConvertJsonToAvro is writing data provenance on failure, 
InferAvroSchema writing JSON to attribute doesn't write in the right form.

> Added ConvertDynamicJsonToAvro processor
> ----------------------------------------
>
>                 Key: NIFI-1935
>                 URL: https://issues.apache.org/jira/browse/NIFI-1935
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Extensions
>    Affects Versions: 1.0.0, 0.7.0
>            Reporter: Daniel Cave
>            Assignee: Alex Halldin
>            Priority: Minor
>             Fix For: 1.0.0, 0.7.0
>
>         Attachments: 
> 0001-NIFI-1935-Added-ConvertDynamicJSONToAvro.java.-Added.patch
>
>
> ConvertJsonToAvro required a predefined Avro schema to convert JSON and 
> required the presence of all field on the incoming JSON.  
> ConvertDynamicJsonToAvro functions similarly, however it now accepts the JSON 
> and schema as incoming flowfiles and creates the Avro dynamically.
> This processor requires the InferAvroSchema processor in its upstream flow so 
> that it can use the original and schema flowfiles as input.  These two 
> flowfiles will have the unique attribute inferredAvroId set on them by 
> InferAvroSchema so that they can be properly matched in 
> ConvertDynamicJsonToAvro.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (NIFI-1935) Added ConvertDynamicJsonToAvro processor

Reply via email to