[ https://issues.apache.org/jira/browse/NIFI-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763047#comment-16763047 ]
Mark Payne commented on NIFI-5938: ---------------------------------- [~joewitt] I believe so. I pushed a new commit just now to address the review feedback. > Allow Record Readers to Infer Schema on Read > -------------------------------------------- > > Key: NIFI-5938 > URL: https://issues.apache.org/jira/browse/NIFI-5938 > Project: Apache NiFi > Issue Type: Improvement > Components: Extensions > Reporter: Mark Payne > Assignee: Mark Payne > Priority: Major > Fix For: 1.9.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > The introduction of record-oriented processors was a huge improvement for > NiFi in terms of usability. However, they only improve usability if you have > a schema for your data. There have been several comments along the lines of > "I would really love to use the record-oriented processors, but I don't have > a schema for my data." > Sometimes users have no schema because they don't want to bother with > creating the schemas. The schema becomes a usability issue. This is > especially true for very large documents that contain a lot of nested > Records. Other times, users cannot create a schema because they retrieve > arbitrary data from some source, and they have no idea what the data will > look like. > We do not want to remove the notion of a schema, however. Schemas provide for > a very powerful construct for many use cases, and it provides Processors a > much easier-to-use API. If we provide the ability to Infer the Schema on > Read, though, we can provide the best of both worlds. While we do have > processors for inferring schemas for JSON and CSV data, those are not always > sufficient. They cannot be used, for instance, by ConsumeKafkaRecord, > ExecuteSQL, etc. because those Processors need the schema before that. > Additionally, we have no ability to infer a schema for XML, logs, etc. > Finally, we need to consider processors that are designed to manipulate the > data. For example, UpdateRecord, JoltTransformRecord, LookupRecord (when used > for enrichment), and QueryRecord. These Processors follow a typical pattern > of "get reader's schema, then provide it to the writer in order to get > writer's schema." This means that if the Record Writer inherits the record's > schema, and we infer that schema, then any newly added fields will simply be > dropped by the writer because the writer's schema doesn't know about those > fields. As a result, we need to ensure that we first transform the first > record, get the schema for the transformed record, and then pass that > transformed record's schema to the Writer, so that the Writer inherits the > schema describing data after transformation. > Design/Implementation Goals should include: > - High performance: users should be impacted as little as is feasible. > - Usability: users should be able to infer schemas with as little > configuration as is reasonable. > - Ease of Development: code should be written in a way that makes it easy for > new Record Readers to provide schema inference that is fast, efficient, > correct, and consistent with how the other readers infer schemas. > - Implementations: At a minimum, we should provide the ability to infer > schemas for JSON, XML, and CSV data. > - Backward Compatibility: The new feature should not break backward > compatibility for any Record Reader. -- This message was sent by Atlassian JIRA (v7.6.3#76005)