Hi Steve, Regarding the possibility to have both properties (one for file-based configuration and one for UI-input based configuration) you can have a look at InvokeScriptedProcessor [1] that proposes the same approach.
Anyway, I'm looking forward to your contribution, it looks like a nice addition to XML processing! [1] https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-scripting-bundle/nifi-scripting-processors/src/main/java/org/apache/nifi/processors/script/InvokeScriptedProcessor.java 2017-03-07 17:34 GMT+01:00 Steve Lawrence <[email protected]>: > Oleg and Joe, > > Thanks for your quick responses. This clarified a lot of my questions. > I'll make the suggested changes and create a bug in JIRA to track start > the process and see if the community would accept it. > > Regarding the file based versus provided configuration, in practice, > XProc configurations are fairly large and complicated so it might seem a > little awkward to copy/paste an XProc config into the NiFi property. > Using just a path is similar to how the TransformXml processor works > (XSLT and XProc a very similar in complexity). But I wouldn't be against > making it configurable somehow. Perhaps have separate options for file > vs XProc config? Or we could allow the option to be either a file or > XProc config, and we can programatically determine if the option looks > like a valid file or a vali XProc config. Or is that uncommon for NiFi > properties? > > Regarding memory management, you are correct that all inputs and outputs > are in memory at once. I don't think XMLCalabash has options to do > something different, e.g. file based. I'll investigate this and if this > is the case I'll make sure to document this. > > Thanks, > - Steve > > > On 03/07/2017 10:42 AM, Joe Witt wrote: > > Steve, > > > > First thanks for raising the discussion and its awesome that you've > > built your own processor to help leverage the xml calabash software. > > A couple of quick thoughts from a short scan: > > > > - Processor naming: > > Try to come up with a name that is of the form 'verb' 'subject'. > > For this processor it seems like PipelineXML or ProcessXML is > > appropriate. It is in the description and tags for the processor that > > you'd want to put things like 'XProc', 'XMLCalabash', 'XML', etc.. > > > > - Thread safety: > > XML processing is notoriously slow. You probably want to go the > > extra mile to make it support multiple threads and remove trigger > > serially. You can either create the necessary XMLCalabash objects on > > demand with each trigger call or if this is expensive then you can > > operate on batches of flowfiles at a time (slightly less cool these > > streamy days) or you could have a small pool/cache of these objects > > which are lazily inited then reused on subsequent calls. All the > > necessary lifecycle hooks are in place on the processor for any of > > these patterns. > > > > - Interest in a contrib to the community: > > XML is indeed quite common and often people want to work with it. > > Provided there is a healthy contribution and all licensing and notice > > aspects are in order then I think we'd be quite happy to help you turn > > it into a contribution to the apache nifi community. If you decide > > not to go that route this is ok as well but obviously we'd like to > > help you contribute to the community itself if possible. > > > > - File based versus provided configuration: > > Consider allowing the user to enter/paste in a pipeline > > configuration directly into the property as an alternative to relying > > on a file reference. By having the configuration entered directly it > > greatly eases the burden on an administrator having to put that config > > somewhere on all systems in a nifi cluster and further it means the > > users through the web UI can easily tweak their pipelines. > > > > - Provide a sample configuration/template using it: > > It would be awesome if you could write a blog or something that > > shows this thing in all its glory. How to set it up, sample data, a > > pipeline, and the results. That would be very helpful. > > > > - Handling of 'original' flowfile > > Consider having an 'original' relationship which you send the > > original flowfile down rather than removing it in the session if all > > goes well. We've found that folks often like to use that relationship > > after the processing is successful or they can just terminate it. But > > it gives them the control. > > > > - Memory management > > Can you describe the memory management aspects of this processor? > > Will it load the original document in memory fully and will it have > > all outputs in memory at once? This is a common challenge with XML > > stuff. This will need to be well described on the processor so users > > can be careful to consider how many instances/threads/etc.. to use. > > > > I noticed you did a really nice job of accounting for flowfiles and > > ensuring provenance would work here. Nice job! > > > > Thanks > > Joe > > > > On Tue, Mar 7, 2017 at 10:17 AM, Steve Lawrence <[email protected]> > wrote: > >> We have developed a NiFi processor that uses XMLCalabash [1] to add > >> support for XProc [2] processing. XProc is an XML transformation > >> language that defines and XML pipeline, allowing for complex validation, > >> transformation, and routing of XML data within the pipeline, using > >> existing XML technologies such as RelaxNG, Schematron, XSD Schema, > >> XQuery, XSLT, XPath and custom XProc transformations. > >> > >> This new processor is mostly straightforward, but we had some questions > >> regarding the specific implementation and the handling of non-thread > >> safe code. The code is available for viewing here: > >> > >> > >> https://opensource.ncsa.illinois.edu/bitbucket/ > projects/DFDL/repos/nifi-xproc/browse > >> > >> In this processor, a property is created to provide an XProc file, which > >> defines the pipeline input and output "ports". XML goes into an input > >> port, goes through the pipeline, and one or more XML documents exit at > >> specified output ports. This NiFi processor maps each output port to a > >> dynamic NiFi relationship. It does this mapping in the > >> onPropertyModified method when the XProc file property is changed. This > >> method also stores the XMLCalabash XRuntime and XPipeline objects (which > >> do all the pipeline work) in volatile member variables to be used later > >> in onTrigger. The members are saved here to avoid recreating them in > >> each call to onTrigger. Is this an acceptable place to do that? It seems > >> this normally happens in an @OnScheduled method or in the first call to > >> onTrigger, however the objects must be created in onPropertyModified to > >> get the output ports, so this does avoid recreating the same objects > >> multiple times. Also note that the same objects are created in the > >> XML_PIPELINE_VALIDATOR but are not saved due to the validator being > >> static, so there is already some duplication. Is there a standard way to > >> avoid duplication/is this an acceptable way to handle this? > >> > >> The other concern we have is that the XPipeline and XRuntime objects > >> created by XML Calabash are not thread safe. To resolve this issue, the > >> processor is annotated with @TriggerSerially. Is this the correct > >> solution, or is there a some other preferred method. Perhaps ThreadLocal > >> or a thread safe pool of XPipeline objects is preferred? > >> > >> Lastly, is this something the devs would be interested in pulling into > >> NiFI, and if not, what could be changed to achieve this? The code is > >> licensed as Apache v2 and we would be happy to contribute the code to > >> NiFi if deemed acceptable. > >> > >> Thanks, > >> - Steve > >> > >> [1] http://xmlcalabash.com/ > >> [2] https://www.w3.org/TR/xproc/ > >
