Hey Peter,

Finally getting a chance to take a look at some of this stuff. Very cool that 
you're jumping in on this! I think
this has the potential to really enable a lot of great new possibilities very 
quickly and easily. Matt B. and I
discussed this concept a while back on the mailing list a bit but didn't get 
into great detail. He may have
some great thoughts that he can contribute, as well. I jotted down some 
thoughts that I had on the design
as I looked over it...

* I think we should not really think of these are record conversions, but 
rather two distinct concepts - 
record parsing and record serialization. This opens up a lot more 
possibilities, as we could use a Record
Parser service for a more generic PutHBase processors, for instance. Likely, we 
could use the Record
Serialization service with a source processor like ExecuteSQL. And then a 
processor that wants to convert
from one format to another could use both a parser and a serializer. I don't 
think the RecordConversionService
is really needed, as I think it makes more sense to just have a single 
processor to do conversion and these
service API's would be used by this processor.

* Rather than having a RecordSet that contains a Map<String, Field> getSchema() 
and a lot of "getSupportsXYZ()" types
of methods, I think we should instead have a "Schema getSchema()" method and 
move the description of what is supported
into the Schema object.

* I have written record parsing libraries before, and while the natural 
tendency is to create a structure such as Record that has
a bunch of Field objects, this can actually cripple performance when running on 
large volumes of records. For instance, if you
were running just 50 MB/sec. of incoming data, and lets say that you have 
fields that average 100 bytes each, you're talking half
a million Field objects per second that get created and have to be garbage 
collected. And these are fairly conservative numbers -
you could easily reach many million Field objects per second. So the approach 
that I would recommend is instead of having Record
return a bunch of Field objects, instead have accessor methods on Record such 
as:

getString(int fieldIndex);
getString(String fieldName);
getInt(int fieldIndex);
getInt(String fieldName);
getType(int fieldIndex);
getName(int fieldIndex);

* I would also expect that calling getString(1) would convert the first field 
to a String by using String.vlaueOf or something like that.
I would expect some sort of inferred conversion, but the impl you have here 
just does a type-cast, which can lead to a lot of redundant
code being used by the clients of the library.

* If Record is going to have a 'getUnderlyingObject' method that returns an 
Object, it may make more sense to instead make Record generic
and return an object of type T?

I know this is quite a bit of feedback. Hopefully not too overwhelming :) 
There's a whole lot of good stuff here, and I think this is going
to turn out to be an awesome addition to NiFi. Very excited about it!


-Mark



> On Aug 23, 2016, at 1:00 PM, Peter Wicks (pwicks) <[email protected]> wrote:
> 
> Matt,
> 
> I've re-architected it a bit.  I'll admit that I haven't tested this new 
> design, but it makes sense, wanted to get your feedback.  Let me know if this 
> somewhat lines up with your vision.
> 
> https://github.com/patricker/nifi/tree/RowProcessorService
> 
> Then we can have cool processors like `RecordSetConverter` that can convert 
> between arbitrary types, and `RecordSetToSQL` etc...  Could be really cool.
> 
> --Peter
> 
> -----Original Message-----
> From: Peter Wicks (pwicks) 
> Sent: Monday, August 22, 2016 8:57 PM
> To: '[email protected]' <[email protected]>
> Subject: RE: Looking for feedback on my WIP Design
> 
> Matt,
> 
> When you put it that way it sounds like we'll be taking over the world by 
> breakfast...
> 
> It sounds like this has a lot of other potential uses, such as a generic 
> conversion processor instead of all these one-off a-to-b conversion 
> processors that are floating around. I hadn't really thought that far, but I 
> can see you went there straight away.
> 
> I'll put some more work into it along the lines you outlined and check back 
> in.
> 
> Thanks,
>  Peter
> 
> -----Original Message-----
> From: Matt Burgess [mailto:[email protected]] 
> Sent: Monday, August 22, 2016 7:58 PM
> To: [email protected]
> Subject: Re: Looking for feedback on my WIP Design
> 
> Peter,
> 
> First of all, great work! I couldn't find an Apache Jira for this, but I 
> remember seeing something in the dev email list about perhaps having a 
> ControllerService for arbitrary conversions.
> 
> I took a look at the commit; first things first, looks good for the use case, 
> thanks much!  A handful of notes:
> 
> - As you know, the critical part is your "RowConverter".  To be most useful, 
> we should make sure (like Java has, over time, with their supported SQL 
> types), we can refer to structured types in some common language.  So 
> "startNewFile()" might be better as "newRecord()", and
> addRow() might be better expressed as a NiFi-defined interface.
> ResultSets could have a simple wrapper implementing the generic interface 
> that would let the code consume it the same as any other object.
> 
> - For a common language, perhaps we could refer to structured data as 
> "records" and individual fields (also perhaps nested) as "fields".
> 
> - With a common NiFi-defined API for getting records and fields, we could 
> implement all the converters you mention without any explicit dependency on 
> an implementation NAR/JAR.
> 
> - We should avoid the explicit conversion of input format to any intermediate 
> format; in other words, the interface should follow the Adapter or Facade 
> pattern, and should convert-on-read.
> 
> - I'm sure I've forgotten some of the details between format details, but 
> I'll list them for discussion with respect to conversions:
> 
> 1) XML
> 2) JSON
> 3) CSV (and TSV and other text-based delimited files)
> 4) SQL ResultSet
> 5) CQL ResultSet (Cassandra)
> 6) Elasticsearch result set
> 7) Avro
> 8) Parquet
> 
> If we can cover these with a single interface and 8(-ish) implementations, 
> then I think we're in good shape for world domination
> :) Not being sarcastic, I'm saying let's make it happen!
> 
> Regards,
> Matt
> 
> On Mon, Aug 22, 2016 at 9:32 PM, Peter Wicks (pwicks) <[email protected]> 
> wrote:
>> I'm working on a change to QueryDatabaseTable (and eventually would apply to 
>> ExecuteSQL) to allow users to choose the output format, so something besides 
>> just Avro.  I'm planning to put in a ticket soon for my work.
>> 
>> Goals:
>> 
>> -          If this update goes out no user should be affected, as defaults 
>> will work the way they have before.
>> 
>> -          Don't want to muck up dependencies of standard processors with 
>> lots of other libraries to support multiple formats. As such I have 
>> implemented it using Controller Services to make the converters pluggable.
>> 
>> -          Would like to support these output formats to start:
>> 
>> o   Avro (This one is already in my commit, as I copy/pasted the code over; 
>> but the logic has now been moved to an appropriate library, which I like)
>> 
>> o   ORC (Would be implemented in the HIVE library)
>> 
>> o   JSON (No idea where I would put this processor, unless it's in a new 
>> module)
>> 
>> o   Delimited (No idea where I would put this processor, unless it's in a 
>> new module)
>> 
>> Here is my commit: 
>> https://github.com/patricker/nifi/commit/01a79804f60b6be0f86499949712c
>> d9118fb4f7f
>> 
>> I'd appreciate feedback on design/implementation.  I have the guts in there 
>> of how I was planning to implement this.
>> 
>> Thanks,
>>  Peter

Reply via email to