[ 
https://issues.apache.org/jira/browse/NIFI-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763047#comment-16763047
 ] 

Mark Payne commented on NIFI-5938:
----------------------------------

[~joewitt] I believe so. I pushed a new commit just now to address the review 
feedback.

> Allow Record Readers to Infer Schema on Read
> --------------------------------------------
>
>                 Key: NIFI-5938
>                 URL: https://issues.apache.org/jira/browse/NIFI-5938
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>             Fix For: 1.9.0
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The introduction of record-oriented processors was a huge improvement for 
> NiFi in terms of usability. However, they only improve usability if you have 
> a schema for your data. There have been several comments along the lines of 
> "I would really love to use the record-oriented processors, but I don't have 
> a schema for my data."
> Sometimes users have no schema because they don't want to bother with 
> creating the schemas. The schema becomes a usability issue. This is 
> especially true for very large documents that contain a lot of nested 
> Records. Other times, users cannot create a schema because they retrieve 
> arbitrary data from some source, and they have no idea what the data will 
> look like.
> We do not want to remove the notion of a schema, however. Schemas provide for 
> a very powerful construct for many use cases, and it provides Processors a 
> much easier-to-use API. If we provide the ability to Infer the Schema on 
> Read, though, we can provide the best of both worlds. While we do have 
> processors for inferring schemas for JSON and CSV data, those are not always 
> sufficient. They cannot be used, for instance, by ConsumeKafkaRecord, 
> ExecuteSQL, etc. because those Processors need the schema before that. 
> Additionally, we have no ability to infer a schema for XML, logs, etc.
> Finally, we need to consider processors that are designed to manipulate the 
> data. For example, UpdateRecord, JoltTransformRecord, LookupRecord (when used 
> for enrichment), and QueryRecord. These Processors follow a typical pattern 
> of "get reader's schema, then provide it to the writer in order to get 
> writer's schema." This means that if the Record Writer inherits the record's 
> schema, and we infer that schema, then any newly added fields will simply be 
> dropped by the writer because the writer's schema doesn't know about those 
> fields. As a result, we need to ensure that we first transform the first 
> record, get the schema for the transformed record, and then pass that 
> transformed record's schema to the Writer, so that the Writer inherits the 
> schema describing data after transformation.
> Design/Implementation Goals should include:
> - High performance: users should be impacted as little as is feasible.
> - Usability: users should be able to infer schemas with as little 
> configuration as is reasonable.
> - Ease of Development: code should be written in a way that makes it easy for 
> new Record Readers to provide schema inference that is fast, efficient, 
> correct, and consistent with how the other readers infer schemas.
> - Implementations: At a minimum, we should provide the ability to infer 
> schemas for JSON, XML, and CSV data.
> - Backward Compatibility: The new feature should not break backward 
> compatibility for any Record Reader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to