Re: NiFI XProc Processor

Pierre Villard Tue, 07 Mar 2017 10:06:37 -0800

Hi Steve,

Regarding the possibility to have both properties (one for file-based
configuration and one for UI-input based configuration) you can have a look
at InvokeScriptedProcessor [1] that proposes the same approach.


Anyway, I'm looking forward to your contribution, it looks like a nice
addition to XML processing!

[1]
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-scripting-bundle/nifi-scripting-processors/src/main/java/org/apache/nifi/processors/script/InvokeScriptedProcessor.java

2017-03-07 17:34 GMT+01:00 Steve Lawrence <[email protected]>:

> Oleg and Joe,
>
> Thanks for your quick responses. This clarified a lot of my questions.
> I'll make the suggested changes and create a bug in JIRA to track start
> the process and see if the community would accept it.
>
> Regarding the file based versus provided configuration, in practice,
> XProc configurations are fairly large and complicated so it might seem a
> little awkward to copy/paste an XProc config into the NiFi property.
> Using just a path is similar to how the TransformXml processor works
> (XSLT and XProc a very similar in complexity). But I wouldn't be against
> making it configurable somehow. Perhaps have separate options for file
> vs XProc config? Or we could allow the option to be either a file or
> XProc config, and we can programatically determine if the option looks
> like a valid file or a vali XProc config. Or is that uncommon for NiFi
> properties?
>
> Regarding memory management, you are correct that all inputs and outputs
> are in memory at once. I don't think XMLCalabash has options to do
> something different, e.g. file based. I'll investigate this and if this
> is the case I'll make sure to document this.
>
> Thanks,
> - Steve
>
>
> On 03/07/2017 10:42 AM, Joe Witt wrote:
> > Steve,
> >
> > First thanks for raising the discussion and its awesome that you've
> > built your own processor to help leverage the xml calabash software.
> > A couple of quick thoughts from a short scan:
> >
> > - Processor naming:
> >   Try to come up with a name that is of the form 'verb' 'subject'.
> > For this processor it seems like PipelineXML or ProcessXML is
> > appropriate.  It is in the description and tags for the processor that
> > you'd want to put things like 'XProc', 'XMLCalabash', 'XML', etc..
> >
> > - Thread safety:
> >   XML processing is notoriously slow.  You probably want to go the
> > extra mile to make it support multiple threads and remove trigger
> > serially.  You can either create the necessary XMLCalabash objects on
> > demand with each trigger call or if this is expensive then you can
> > operate on batches of flowfiles at a time (slightly less cool these
> > streamy days) or you could have a small pool/cache of these objects
> > which are lazily inited then reused on subsequent calls.  All the
> > necessary lifecycle hooks are in place on the processor for any of
> > these patterns.
> >
> > - Interest in a contrib to the community:
> >   XML is indeed quite common and often people want to work with it.
> > Provided there is a healthy contribution and all licensing and notice
> > aspects are in order then I think we'd be quite happy to help you turn
> > it into a contribution to the apache nifi community.  If you decide
> > not to go that route this is ok as well but obviously we'd like to
> > help you contribute to the community itself if possible.
> >
> > - File based versus provided configuration:
> >   Consider allowing the user to enter/paste in a pipeline
> > configuration directly into the property as an alternative to relying
> > on a file reference.  By having the configuration entered directly it
> > greatly eases the burden on an administrator having to put that config
> > somewhere on all systems in a nifi cluster and further it means the
> > users through the web UI can easily tweak their pipelines.
> >
> > - Provide a sample configuration/template using it:
> >   It would be awesome if you could write a blog or something that
> > shows this thing in all its glory.  How to set it up, sample data, a
> > pipeline, and the results.  That would be very helpful.
> >
> > - Handling of 'original' flowfile
> >   Consider having an 'original' relationship which you send the
> > original flowfile down rather than removing it in the session if all
> > goes well. We've found that folks often like to use that relationship
> > after the processing is successful or they can just terminate it.  But
> > it gives them the control.
> >
> > - Memory management
> >   Can you describe the memory management aspects of this processor?
> > Will it load the original document in memory fully and will it have
> > all outputs in memory at once?  This is a common challenge with XML
> > stuff.  This will need to be well described on the processor so users
> > can be careful to consider how many instances/threads/etc.. to use.
> >
> > I noticed you did a really nice job of accounting for flowfiles and
> > ensuring provenance would work here.  Nice job!
> >
> > Thanks
> > Joe
> >
> > On Tue, Mar 7, 2017 at 10:17 AM, Steve Lawrence <[email protected]>
> wrote:
> >> We have developed a NiFi processor that uses XMLCalabash [1] to add
> >> support for XProc [2] processing. XProc is an XML transformation
> >> language that defines and XML pipeline, allowing for complex validation,
> >> transformation, and routing of XML data within the pipeline, using
> >> existing XML technologies such as RelaxNG, Schematron, XSD Schema,
> >> XQuery, XSLT, XPath and custom XProc transformations.
> >>
> >> This new processor is mostly straightforward, but we had some questions
> >> regarding the specific implementation and the handling of non-thread
> >> safe code. The code is available for viewing here:
> >>
> >>
> >> https://opensource.ncsa.illinois.edu/bitbucket/
> projects/DFDL/repos/nifi-xproc/browse
> >>
> >> In this processor, a property is created to provide an XProc file, which
> >> defines the pipeline input and output "ports". XML goes into an input
> >> port, goes through the pipeline, and one or more XML documents exit at
> >> specified output ports. This NiFi processor maps each output port to a
> >> dynamic NiFi relationship. It does this mapping in the
> >> onPropertyModified method when the XProc file property is changed. This
> >> method also stores the XMLCalabash XRuntime and XPipeline objects (which
> >> do all the pipeline work) in volatile member variables to be used later
> >> in onTrigger. The members are saved here to avoid recreating them in
> >> each call to onTrigger. Is this an acceptable place to do that? It seems
> >> this normally happens in an @OnScheduled method or in the first call to
> >> onTrigger, however the objects must be created in onPropertyModified to
> >> get the output ports, so this does avoid recreating the same objects
> >> multiple times. Also note that the same objects are created in the
> >> XML_PIPELINE_VALIDATOR but are not saved due to the validator being
> >> static, so there is already some duplication. Is there a standard way to
> >> avoid duplication/is this an acceptable way to handle this?
> >>
> >> The other concern we have is that the XPipeline and XRuntime objects
> >> created by XML Calabash are not thread safe. To resolve this issue, the
> >> processor is annotated with @TriggerSerially. Is this the correct
> >> solution, or is there a some other preferred method. Perhaps ThreadLocal
> >> or a thread safe pool of XPipeline objects is preferred?
> >>
> >> Lastly, is this something the devs would be interested in pulling into
> >> NiFI, and if not, what could be changed to achieve this? The code is
> >> licensed as Apache v2 and we would be happy to contribute the code to
> >> NiFi if deemed acceptable.
> >>
> >> Thanks,
> >> - Steve
> >>
> >> [1] http://xmlcalabash.com/
> >> [2] https://www.w3.org/TR/xproc/
>
>

Re: NiFI XProc Processor

Reply via email to