[ 
https://issues.apache.org/jira/browse/NIFI-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765220#comment-16765220
 ] 

ASF subversion and git services commented on NIFI-5938:
-------------------------------------------------------

Commit 36c0a99e91c492329c0df69ee4ae961e57295f84 in nifi's branch 
refs/heads/master from Mark Payne
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=36c0a99 ]

NIFI-5938: Added ability to infer record schema on read from JsonTreeReader, 
JsonPathReader, XML Reader, and CSV Reader.
 - Updates to make UpdateRecord and RecordPath automatically update Record 
schema when performing update and perform the updates on the first record in 
UpdateRecord before obtaining Writer Schema. This allows the Writer to  to 
inherit the Schema of the updated Record instead of the Schema of the Record as 
it was when it was read.
 - Updated JoltTransformRecord so that schema is inferred on the first 
transformed object before passing the schema to the Record Writer, so that if 
writer inherits schema from record, the schema that is inherited is the trans 
transformed schema
 - Updated LookupRecord to allow for Record fields to be arbitrarily added
 - Implemented ContentClaimInputStream
 - Added controller service for caching schemas
 - UpdatedQueryRecord to cache schemas automatically up to some number of 
schemas, which will significantly inprove throughput in many cases, especially 
with inferred schemas.

NIFI-5938: Updated AvroTypeUtil so that if creating an Avro Schema using a 
field name that is not valid for Avro, it creates a Schema that uses a 
different, valid field name and adds an alias for the given field name so that 
the fields still are looked up appropriately. Fixed a bug in finding the 
appropriate Avro field when aliases are used. Updated ContentClaimInputStream 
so that if mark() is called followed by multiple calls to reset(), that each 
reset() call is successful instead of failing after the first one (the JavaDoc 
for InputStream appears to indicate that the InputStream is free to do either 
and in fact the InputStream is even free to allow reset() to reset to the 
beginning of file if mark() is not even called, if it chooses to do so instead 
of requiring a call to mark()).

NIFI-5938: Added another unit test for AvroTypeUtil

NIFI-5938: If using inferred schema in CSV Reader, do not consider first record 
as a header line. Also addressed a bug in StandardConfigurationContext that was 
exposed by CSVReader, in which calling getProperty(PropertyDescriptor) did not 
properly lookup the canonical representation of the Property Descriptor from 
the component before attempting to get a default value

Signed-off-by: Matthew Burgess <mattyb...@apache.org>

This closes #3253


> Allow Record Readers to Infer Schema on Read
> --------------------------------------------
>
>                 Key: NIFI-5938
>                 URL: https://issues.apache.org/jira/browse/NIFI-5938
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>             Fix For: 1.9.0
>
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> The introduction of record-oriented processors was a huge improvement for 
> NiFi in terms of usability. However, they only improve usability if you have 
> a schema for your data. There have been several comments along the lines of 
> "I would really love to use the record-oriented processors, but I don't have 
> a schema for my data."
> Sometimes users have no schema because they don't want to bother with 
> creating the schemas. The schema becomes a usability issue. This is 
> especially true for very large documents that contain a lot of nested 
> Records. Other times, users cannot create a schema because they retrieve 
> arbitrary data from some source, and they have no idea what the data will 
> look like.
> We do not want to remove the notion of a schema, however. Schemas provide for 
> a very powerful construct for many use cases, and it provides Processors a 
> much easier-to-use API. If we provide the ability to Infer the Schema on 
> Read, though, we can provide the best of both worlds. While we do have 
> processors for inferring schemas for JSON and CSV data, those are not always 
> sufficient. They cannot be used, for instance, by ConsumeKafkaRecord, 
> ExecuteSQL, etc. because those Processors need the schema before that. 
> Additionally, we have no ability to infer a schema for XML, logs, etc.
> Finally, we need to consider processors that are designed to manipulate the 
> data. For example, UpdateRecord, JoltTransformRecord, LookupRecord (when used 
> for enrichment), and QueryRecord. These Processors follow a typical pattern 
> of "get reader's schema, then provide it to the writer in order to get 
> writer's schema." This means that if the Record Writer inherits the record's 
> schema, and we infer that schema, then any newly added fields will simply be 
> dropped by the writer because the writer's schema doesn't know about those 
> fields. As a result, we need to ensure that we first transform the first 
> record, get the schema for the transformed record, and then pass that 
> transformed record's schema to the Writer, so that the Writer inherits the 
> schema describing data after transformation.
> Design/Implementation Goals should include:
> - High performance: users should be impacted as little as is feasible.
> - Usability: users should be able to infer schemas with as little 
> configuration as is reasonable.
> - Ease of Development: code should be written in a way that makes it easy for 
> new Record Readers to provide schema inference that is fast, efficient, 
> correct, and consistent with how the other readers infer schemas.
> - Implementations: At a minimum, we should provide the ability to infer 
> schemas for JSON, XML, and CSV data.
> - Backward Compatibility: The new feature should not break backward 
> compatibility for any Record Reader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to