Re: AVRO is the only output format with ExecuteSQL

Mike Thomsen Mon, 13 Aug 2018 05:47:58 -0700

Boris,

Yeah, you can fork either his branch or his entire repo and try it out.
Also, usual caveat: user beware until it passes code review...


Mike

On Mon, Aug 13, 2018 at 8:36 AM Boris Tyukin <bo...@boristyukin.com> wrote:

> Matt, you are awesome! 15 files changes and 3k lines of code - man, do not
> tell me you did that in just a few days :)
>
> since it has not been merged yet with the master, can I just use your
> personal branch to compile entire nifi? or is it better to cherry pick your
> commit into master? I would like to try it out
>
> Boris
>
> On Fri, Aug 10, 2018 at 4:55 PM Matt Burgess <mattyb...@apache.org> wrote:
>
>> Boris et al,
>>
>> I put up a PR [1] to add ExecuteSQLRecord and QueryDatabaseTableRecord
>> under NIFI-4517, in case anyone wants to play around with it :)
>>
>> Regards,
>> Matt
>>
>> [1] https://github.com/apache/nifi/pull/2945
>> On Tue, Aug 7, 2018 at 8:30 PM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>> >
>> > Matt, you rock!! thank you!!
>> >
>> > On Tue, Aug 7, 2018 at 5:16 PM Matt Burgess <mattyb...@gmail.com>
>> wrote:
>> >>
>> >> Sounds good, it makes the underlying code a bit more complicated but I
>> see from y’all’s points that a “separate” processor is a better user
>> experience. I’m knee deep in it as we speak, hope to have a PR up in a few
>> days.
>> >>
>> >> Thanks,
>> >> Matt
>> >>
>> >>
>> >> On Aug 7, 2018, at 5:07 PM, Andrew Grande <apere...@gmail.com> wrote:
>> >>
>> >> I'd really like to see the Record suffix on the processor for
>> discoverability, as already mentioned.
>> >>
>> >> Andrew
>> >>
>> >> On Tue, Aug 7, 2018, 2:16 PM Matt Burgess <mattyb...@apache.org>
>> wrote:
>> >>>
>> >>> Yeah that's definitely doable, most of the logic for writing a
>> >>> ResultSet to a Flow File is localized (currently to JdbcCommon but
>> >>> also in ResultSetRecordSet), so I wouldn't think it would be too much
>> >>> refactor. What are folks thoughts on whether to add a Record Writer
>> >>> property to the existing ExecuteSQL or subclass it to a new processor
>> >>> called ExecuteSQLRecord? The former is more consistent with how the
>> >>> SiteToSite reporting tasks work, but this is a processor. The latter
>> >>> is more consistent with the way we've done other record processors,
>> >>> and the benefit there is that we don't have to add a bunch of
>> >>> documentation to fields that will be ignored (such as the Use Avro
>> >>> Logical Types property which we wouldn't need in a ExecuteSQLRecord).
>> >>> Having said that, we will want to offer the same options in the Avro
>> >>> Reader/Writer, but Peter is working on that under NIFI-5405 [1].
>> >>>
>> >>> Thanks,
>> >>> Matt
>> >>>
>> >>> [1] https://issues.apache.org/jira/browse/NIFI-5405
>> >>>
>> >>> On Tue, Aug 7, 2018 at 2:06 PM Andy LoPresto <alopre...@apache.org>
>> wrote:
>> >>> >
>> >>> > Matt,
>> >>> >
>> >>> > Would extending the core ExecuteSQL processor with an
>> ExecuteSQLRecord processor also work? I wonder about discoverability if
>> only one processor is present and in other places we explicitly name the
>> processors which handle records as such. If the ExecuteSQL processor
>> handled all the SQL logic, and the ExecuteSQLRecord processor just
>> delegated most of the processing in its #onTrigger() method to super, do
>> you foresee any substantial difficulties? It might require some refactoring
>> of the parent #onTrigger() to service methods.
>> >>> >
>> >>> >
>> >>> > Andy LoPresto
>> >>> > alopre...@apache.org
>> >>> > alopresto.apa...@gmail.com
>> >>> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>> >>> >
>> >>> > On Aug 7, 2018, at 10:25 AM, Andrew Grande <apere...@gmail.com>
>> wrote:
>> >>> >
>> >>> > As a side note, one has to ha e a serious justification _not_ to
>> use record-based processors. The benefits, including performance, are too
>> numerous to call out here.
>> >>> >
>> >>> > Andrew
>> >>> >
>> >>> > On Tue, Aug 7, 2018, 1:15 PM Mark Payne <marka...@hotmail.com>
>> wrote:
>> >>> >>
>> >>> >> Boris,
>> >>> >>
>> >>> >> Using a Record-based processor does not mean that you need to
>> define a schema upfront. This is
>> >>> >> necessary if the source itself cannot provide a schema. However,
>> since it is pulling structured data
>> >>> >> and the schema can be inferred from the database, you wouldn't
>> need to. As Matt was saying, your
>> >>> >> Record Writer can simply be configured to Inherit Record Schema.
>> It can then write the schema to
>> >>> >> the "avro.schema" attribute or you can choose "Do Not Write
>> Schema". This would still allow the data
>> >>> >> to be written in JSON, CSV, etc.
>> >>> >>
>> >>> >> You could also have the Record Writer choose to write the schema
>> using the "avro.schema" attribute,
>> >>> >> as mentioned above, and then have any down-stream processors read
>> the schema from this attribute.
>> >>> >> This would allow you to use any record-oriented processors you'd
>> like without having to define the
>> >>> >> schema yourself, if you don't want to.
>> >>> >>
>> >>> >> Thanks
>> >>> >> -Mark
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> On Aug 7, 2018, at 12:37 PM, Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>> >>> >>
>> >>> >> thanks for all the responses! it means I am not the only one
>> interested in this topic.
>> >>> >>
>> >>> >> Record-aware version would be really nice, but a lot of times I do
>> not want to use record-based processors since I need to define a schema for
>> input/output upfront and just want to run SQL query and get whatever
>> results back. It just adds an extra step that will be subject to
>> break/support.
>> >>> >>
>> >>> >> Similar to Kafka processors, it is nice to have an option of
>> record-based processor vs. message oriented processor. But if one processor
>> can do it all, it is even better :)
>> >>> >>
>> >>> >>
>> >>> >> On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess <mattyb...@apache.org>
>> wrote:
>> >>> >>>
>> >>> >>> I'm definitely interested in supporting a record-aware version as
>> well
>> >>> >>> (I wrote the Jira up last year [1] but haven't gotten around to
>> >>> >>> implementing it), however I agree with Peter's comment on the
>> Jira.
>> >>> >>> Since ExecuteSQL is an oft-touched processor, if we had two
>> processors
>> >>> >>> that only differed in how the output is formatted, it could be
>> harder
>> >>> >>> to maintain (bugs to be fixed in two places, e.g.). I think we
>> should
>> >>> >>> add an optional RecordWriter property to ExecuteSQL, and the
>> >>> >>> documentation would reflect that if it is not set, the output
>> will be
>> >>> >>> Avro with embedded schema as it has always been. If the
>> RecordWriter
>> >>> >>> is set, either the schema can be hardcoded, or they can use
>> "Inherit
>> >>> >>> Record Schema" even though there's no reader, and that would
>> mimic the
>> >>> >>> current behavior where the schema is inferred from the database
>> >>> >>> columns and used for the writer. There is precedence for this
>> pattern
>> >>> >>> in the SiteToSite reporting tasks.
>> >>> >>>
>> >>> >>> To Bryan's point about history, Avro at the time was the most
>> >>> >>> descriptive of the solutions because it maintains the schema and
>> >>> >>> datatypes with the data, unlike JSON, CSV, etc. Also before the
>> record
>> >>> >>> readers/writers, as Bryan said, you pretty much had to split,
>> >>> >>> transform, merge. We just need to make that processor (and others
>> with
>> >>> >>> specific input/output formats) "record-aware" for better
>> performance.
>> >>> >>>
>> >>> >>> Regards,
>> >>> >>> Matt
>> >>> >>>
>> >>> >>> [1] https://issues.apache.org/jira/browse/NIFI-4517
>> >>> >>> On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bbe...@gmail.com>
>> wrote:
>> >>> >>> >
>> >>> >>> > I would also add that the pattern of splitting to 1 record per
>> flow
>> >>> >>> > file was common before the record processors existed, and
>> generally
>> >>> >>> > this can/should be avoided now in favor of
>> processing/manipulating
>> >>> >>> > records in place, and keeping them together in large batches.
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <
>> apere...@gmail.com> wrote:
>> >>> >>> > > Careful, that makes too much sense, Joe ;)
>> >>> >>> > >
>> >>> >>> > >
>> >>> >>> > > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <joe.w...@gmail.com>
>> wrote:
>> >>> >>> > >>
>> >>> >>> > >> i think we just need to make an ExecuteSqlRecord processor.
>> >>> >>> > >>
>> >>> >>> > >> thanks
>> >>> >>> > >>
>> >>> >>> > >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <
>> mikerthom...@gmail.com> wrote:
>> >>> >>> > >>>
>> >>> >>> > >>> My guess is that it is due to the fact that Avro is the
>> only record type
>> >>> >>> > >>> that can match sql pretty closely feature to feature on
>> data types.
>> >>> >>> > >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <
>> bo...@boristyukin.com>
>> >>> >>> > >>> wrote:
>> >>> >>> > >>>>
>> >>> >>> > >>>> I've been wondering since I started learning NiFi why
>> ExecuteSQL
>> >>> >>> > >>>> processor only returns AVRO formatted data. All community
>> examples I've seen
>> >>> >>> > >>>> then convert AVRO to json and pretty much all of them then
>> split json to
>> >>> >>> > >>>> multiple flows.
>> >>> >>> > >>>>
>> >>> >>> > >>>> I found myself doing the same thing over and over and over
>> again.
>> >>> >>> > >>>>
>> >>> >>> > >>>> Since everyone is doing it, is there a strong reason why
>> AVRO is liked
>> >>> >>> > >>>> so much? And why everyone continues doing this 3 step
>> pattern rather than
>> >>> >>> > >>>> providing users with an option to output json instead and
>> another option to
>> >>> >>> > >>>> output one flowfile or multiple (one per record).
>> >>> >>> > >>>>
>> >>> >>> > >>>> thanks
>> >>> >>> > >>>> Boris
>> >>> >>
>> >>> >>
>> >>> >
>>
>

Re: AVRO is the only output format with ExecuteSQL

Reply via email to