Re: AVRO is the only output format with ExecuteSQL

Andy LoPresto Tue, 07 Aug 2018 11:06:45 -0700

Matt,

Would extending the core ExecuteSQL processor with an ExecuteSQLRecord 
processor also work? I wonder about discoverability if only one processor is 
present and in other places we explicitly name the processors which handle 
records as such. If the ExecuteSQL processor handled all the SQL logic, and the 
ExecuteSQLRecord processor just delegated most of the processing in its 
#onTrigger() method to super, do you foresee any substantial difficulties? It 
might require some refactoring of the parent #onTrigger() to service methods.



Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Aug 7, 2018, at 10:25 AM, Andrew Grande <apere...@gmail.com> wrote:
> 
> As a side note, one has to ha e a serious justification _not_ to use 
> record-based processors. The benefits, including performance, are too 
> numerous to call out here.
> 
> Andrew
> 
> On Tue, Aug 7, 2018, 1:15 PM Mark Payne <marka...@hotmail.com 
> <mailto:marka...@hotmail.com>> wrote:
> Boris,
> 
> Using a Record-based processor does not mean that you need to define a schema 
> upfront. This is
> necessary if the source itself cannot provide a schema. However, since it is 
> pulling structured data
> and the schema can be inferred from the database, you wouldn't need to. As 
> Matt was saying, your
> Record Writer can simply be configured to Inherit Record Schema. It can then 
> write the schema to
> the "avro.schema" attribute or you can choose "Do Not Write Schema". This 
> would still allow the data
> to be written in JSON, CSV, etc.
> 
> You could also have the Record Writer choose to write the schema using the 
> "avro.schema" attribute,
> as mentioned above, and then have any down-stream processors read the schema 
> from this attribute.
> This would allow you to use any record-oriented processors you'd like without 
> having to define the
> schema yourself, if you don't want to.
> 
> Thanks
> -Mark
> 
> 
> 
>> On Aug 7, 2018, at 12:37 PM, Boris Tyukin <bo...@boristyukin.com 
>> <mailto:bo...@boristyukin.com>> wrote:
>> 
>> thanks for all the responses! it means I am not the only one interested in 
>> this topic.
>> 
>> Record-aware version would be really nice, but a lot of times I do not want 
>> to use record-based processors since I need to define a schema for 
>> input/output upfront and just want to run SQL query and get whatever results 
>> back. It just adds an extra step that will be subject to break/support.
>> 
>> Similar to Kafka processors, it is nice to have an option of record-based 
>> processor vs. message oriented processor. But if one processor can do it 
>> all, it is even better :)
>> 
>> 
>> On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess <mattyb...@apache.org 
>> <mailto:mattyb...@apache.org>> wrote:
>> I'm definitely interested in supporting a record-aware version as well
>> (I wrote the Jira up last year [1] but haven't gotten around to
>> implementing it), however I agree with Peter's comment on the Jira.
>> Since ExecuteSQL is an oft-touched processor, if we had two processors
>> that only differed in how the output is formatted, it could be harder
>> to maintain (bugs to be fixed in two places, e.g.). I think we should
>> add an optional RecordWriter property to ExecuteSQL, and the
>> documentation would reflect that if it is not set, the output will be
>> Avro with embedded schema as it has always been. If the RecordWriter
>> is set, either the schema can be hardcoded, or they can use "Inherit
>> Record Schema" even though there's no reader, and that would mimic the
>> current behavior where the schema is inferred from the database
>> columns and used for the writer. There is precedence for this pattern
>> in the SiteToSite reporting tasks.
>> 
>> To Bryan's point about history, Avro at the time was the most
>> descriptive of the solutions because it maintains the schema and
>> datatypes with the data, unlike JSON, CSV, etc. Also before the record
>> readers/writers, as Bryan said, you pretty much had to split,
>> transform, merge. We just need to make that processor (and others with
>> specific input/output formats) "record-aware" for better performance.
>> 
>> Regards,
>> Matt
>> 
>> [1] https://issues.apache.org/jira/browse/NIFI-4517 
>> <https://issues.apache.org/jira/browse/NIFI-4517>
>> On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bbe...@gmail.com 
>> <mailto:bbe...@gmail.com>> wrote:
>> >
>> > I would also add that the pattern of splitting to 1 record per flow
>> > file was common before the record processors existed, and generally
>> > this can/should be avoided now in favor of processing/manipulating
>> > records in place, and keeping them together in large batches.
>> >
>> >
>> >
>> > On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <apere...@gmail.com 
>> > <mailto:apere...@gmail.com>> wrote:
>> > > Careful, that makes too much sense, Joe ;)
>> > >
>> > >
>> > > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <joe.w...@gmail.com 
>> > > <mailto:joe.w...@gmail.com>> wrote:
>> > >>
>> > >> i think we just need to make an ExecuteSqlRecord processor.
>> > >>
>> > >> thanks
>> > >>
>> > >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <mikerthom...@gmail.com 
>> > >> <mailto:mikerthom...@gmail.com>> wrote:
>> > >>>
>> > >>> My guess is that it is due to the fact that Avro is the only record 
>> > >>> type
>> > >>> that can match sql pretty closely feature to feature on data types.
>> > >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <bo...@boristyukin.com 
>> > >>> <mailto:bo...@boristyukin.com>>
>> > >>> wrote:
>> > >>>>
>> > >>>> I've been wondering since I started learning NiFi why ExecuteSQL
>> > >>>> processor only returns AVRO formatted data. All community examples 
>> > >>>> I've seen
>> > >>>> then convert AVRO to json and pretty much all of them then split json 
>> > >>>> to
>> > >>>> multiple flows.
>> > >>>>
>> > >>>> I found myself doing the same thing over and over and over again.
>> > >>>>
>> > >>>> Since everyone is doing it, is there a strong reason why AVRO is liked
>> > >>>> so much? And why everyone continues doing this 3 step pattern rather 
>> > >>>> than
>> > >>>> providing users with an option to output json instead and another 
>> > >>>> option to
>> > >>>> output one flowfile or multiple (one per record).
>> > >>>>
>> > >>>> thanks
>> > >>>> Boris
>

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: AVRO is the only output format with ExecuteSQL

Reply via email to