Re: AVRO is the only output format with ExecuteSQL

Matt Burgess Tue, 07 Aug 2018 11:17:30 -0700

Yeah that's definitely doable, most of the logic for writing a
ResultSet to a Flow File is localized (currently to JdbcCommon but
also in ResultSetRecordSet), so I wouldn't think it would be too much
refactor. What are folks thoughts on whether to add a Record Writer
property to the existing ExecuteSQL or subclass it to a new processor
called ExecuteSQLRecord? The former is more consistent with how the
SiteToSite reporting tasks work, but this is a processor. The latter
is more consistent with the way we've done other record processors,
and the benefit there is that we don't have to add a bunch of
documentation to fields that will be ignored (such as the Use Avro
Logical Types property which we wouldn't need in a ExecuteSQLRecord).
Having said that, we will want to offer the same options in the Avro
Reader/Writer, but Peter is working on that under NIFI-5405 [1].


Thanks,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-5405

On Tue, Aug 7, 2018 at 2:06 PM Andy LoPresto <alopre...@apache.org> wrote:
>
> Matt,
>
> Would extending the core ExecuteSQL processor with an ExecuteSQLRecord 
> processor also work? I wonder about discoverability if only one processor is 
> present and in other places we explicitly name the processors which handle 
> records as such. If the ExecuteSQL processor handled all the SQL logic, and 
> the ExecuteSQLRecord processor just delegated most of the processing in its 
> #onTrigger() method to super, do you foresee any substantial difficulties? It 
> might require some refactoring of the parent #onTrigger() to service methods.
>
>
> Andy LoPresto
> alopre...@apache.org
> alopresto.apa...@gmail.com
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Aug 7, 2018, at 10:25 AM, Andrew Grande <apere...@gmail.com> wrote:
>
> As a side note, one has to ha e a serious justification _not_ to use 
> record-based processors. The benefits, including performance, are too 
> numerous to call out here.
>
> Andrew
>
> On Tue, Aug 7, 2018, 1:15 PM Mark Payne <marka...@hotmail.com> wrote:
>>
>> Boris,
>>
>> Using a Record-based processor does not mean that you need to define a 
>> schema upfront. This is
>> necessary if the source itself cannot provide a schema. However, since it is 
>> pulling structured data
>> and the schema can be inferred from the database, you wouldn't need to. As 
>> Matt was saying, your
>> Record Writer can simply be configured to Inherit Record Schema. It can then 
>> write the schema to
>> the "avro.schema" attribute or you can choose "Do Not Write Schema". This 
>> would still allow the data
>> to be written in JSON, CSV, etc.
>>
>> You could also have the Record Writer choose to write the schema using the 
>> "avro.schema" attribute,
>> as mentioned above, and then have any down-stream processors read the schema 
>> from this attribute.
>> This would allow you to use any record-oriented processors you'd like 
>> without having to define the
>> schema yourself, if you don't want to.
>>
>> Thanks
>> -Mark
>>
>>
>>
>> On Aug 7, 2018, at 12:37 PM, Boris Tyukin <bo...@boristyukin.com> wrote:
>>
>> thanks for all the responses! it means I am not the only one interested in 
>> this topic.
>>
>> Record-aware version would be really nice, but a lot of times I do not want 
>> to use record-based processors since I need to define a schema for 
>> input/output upfront and just want to run SQL query and get whatever results 
>> back. It just adds an extra step that will be subject to break/support.
>>
>> Similar to Kafka processors, it is nice to have an option of record-based 
>> processor vs. message oriented processor. But if one processor can do it 
>> all, it is even better :)
>>
>>
>> On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess <mattyb...@apache.org> wrote:
>>>
>>> I'm definitely interested in supporting a record-aware version as well
>>> (I wrote the Jira up last year [1] but haven't gotten around to
>>> implementing it), however I agree with Peter's comment on the Jira.
>>> Since ExecuteSQL is an oft-touched processor, if we had two processors
>>> that only differed in how the output is formatted, it could be harder
>>> to maintain (bugs to be fixed in two places, e.g.). I think we should
>>> add an optional RecordWriter property to ExecuteSQL, and the
>>> documentation would reflect that if it is not set, the output will be
>>> Avro with embedded schema as it has always been. If the RecordWriter
>>> is set, either the schema can be hardcoded, or they can use "Inherit
>>> Record Schema" even though there's no reader, and that would mimic the
>>> current behavior where the schema is inferred from the database
>>> columns and used for the writer. There is precedence for this pattern
>>> in the SiteToSite reporting tasks.
>>>
>>> To Bryan's point about history, Avro at the time was the most
>>> descriptive of the solutions because it maintains the schema and
>>> datatypes with the data, unlike JSON, CSV, etc. Also before the record
>>> readers/writers, as Bryan said, you pretty much had to split,
>>> transform, merge. We just need to make that processor (and others with
>>> specific input/output formats) "record-aware" for better performance.
>>>
>>> Regards,
>>> Matt
>>>
>>> [1] https://issues.apache.org/jira/browse/NIFI-4517
>>> On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bbe...@gmail.com> wrote:
>>> >
>>> > I would also add that the pattern of splitting to 1 record per flow
>>> > file was common before the record processors existed, and generally
>>> > this can/should be avoided now in favor of processing/manipulating
>>> > records in place, and keeping them together in large batches.
>>> >
>>> >
>>> >
>>> > On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <apere...@gmail.com> wrote:
>>> > > Careful, that makes too much sense, Joe ;)
>>> > >
>>> > >
>>> > > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <joe.w...@gmail.com> wrote:
>>> > >>
>>> > >> i think we just need to make an ExecuteSqlRecord processor.
>>> > >>
>>> > >> thanks
>>> > >>
>>> > >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <mikerthom...@gmail.com> 
>>> > >> wrote:
>>> > >>>
>>> > >>> My guess is that it is due to the fact that Avro is the only record 
>>> > >>> type
>>> > >>> that can match sql pretty closely feature to feature on data types.
>>> > >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <bo...@boristyukin.com>
>>> > >>> wrote:
>>> > >>>>
>>> > >>>> I've been wondering since I started learning NiFi why ExecuteSQL
>>> > >>>> processor only returns AVRO formatted data. All community examples 
>>> > >>>> I've seen
>>> > >>>> then convert AVRO to json and pretty much all of them then split 
>>> > >>>> json to
>>> > >>>> multiple flows.
>>> > >>>>
>>> > >>>> I found myself doing the same thing over and over and over again.
>>> > >>>>
>>> > >>>> Since everyone is doing it, is there a strong reason why AVRO is 
>>> > >>>> liked
>>> > >>>> so much? And why everyone continues doing this 3 step pattern rather 
>>> > >>>> than
>>> > >>>> providing users with an option to output json instead and another 
>>> > >>>> option to
>>> > >>>> output one flowfile or multiple (one per record).
>>> > >>>>
>>> > >>>> thanks
>>> > >>>> Boris
>>
>>
>

Re: AVRO is the only output format with ExecuteSQL

Reply via email to