Re: [RT] Input Pipelines: Storage and Selection (was Re: [RT] InputPipelines (long))

Stefano Mazzocchi Thu, 09 Jan 2003 11:04:50 -0800

Sorry for taking me so long.

Daniel Fagerstrom wrote:

The discussion about input pipelines can be divided in two parts:
1. Improving the handling of the input stream in Cocoon. This is needed for web services, it is also needed for making it possible to implement a writable cocoon:-protocol, something that IMO would be very useful for reusing functionality in Cocoon, especially from blocks.

2. The second part of the proposal is to use two pipelines, executed in sequence, to respond to input in Cocoon. The first pipeline (called input pipeline) is responsible for reading the input and from request parameters or from the input stream, transform it to an appropriate format and store it in e.g. a session parameter, a file or a db. After the input pipeline there is an ordinary (output) pipeline that is responsible for generating the response. The output pipeline is executed after that the execution of the input pipeline is completed, as a consequence actions and selections in the output pipeline can be dependent e.g. on if the handling of input succeeded or not and on the data that was stored by the input pipeline.

Here I will focus on your comments on the second part of the proposal.

Ok.

I'm leaving a bunch of stuff uncut because I don't know where to cut the context.

> Daniel Fagerstrom wrote:
<snip/>
>> In Sitemaps
>> -----------
>>
>> In a sitemap an input pipeline could be used e.g. for implementing a
>> web service:
>>
>> <match pattern="myservice">
>> <generate type="xml">
>> <parameter name="scheme" value="myInputFormat.scm"/>
>> </generate>
>> <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
>> <serialize type="dom-session" non-terminating="true">
>> <parameter name="dom-name" value="input"/>
>> </serialize>
>> <select type="pipeline-state">
>> <when test="success">
>> <act type="my-business-logic"/>
>> <generate type="xsp" src="collectTheResult.xsp"/>
>> <serialize type="xml"/>
>> </when>
>> <when test="non-valid">
>> 
>> </when>
>> </select>
>> </match>
>>
>> Here we have first an input pipeline that reads and validates xml
>> input, transforms it to some appropriate format and store the result
>> as a dom-tree in a session attribute. A serializer normally means that
>> the pipeline should be executed and thereafter an exit from the
>> sitemap. I used the attribute non-terminating="true", to mark that
>> the input pipeline should be executed but that there is more to do in
>> the sitemap afterwards.
>>
>> After the input pipeline there is a selector that select the output
>> pipeline depending of if the input pipeline succeed or not. This use
>> of selection have some relation to the discussion about pipe-aware
>> selection (see [3] and the references therein). It would solve at
>> least my main use cases for pipe-aware selection, without having its
>> drawbacks: Stefano considered pipe-aware selection mix of concern,
>> selection should be based on meta data (pipeline state) rather than on
>> data (pipeline content). There were also some people who didn't like
>> my use of buffering of all input to the pipe-aware selector. IMO the
>> use of selectors above solves booth of these issues.
>>
>> The output pipeline start with an action that takes care about the
>> business logic for the application. This is IMHO a more legitimate use
>> for actions than the current mix of input handling and business logic.
>
>
> Wouldn't the following pipeline achieve the same functionality you want
> without requiring changes to the architecture?
>
> <match pattern="myservice">
> <generate type="payload"/>
> <transform type="validator">
> <parameter name="scheme" value="myInputFormat.scm"/>
> </transform>
> <select type="pipeline-state">
> <when test="valid">
> <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
> <transform type="my-business-logic"/>
> <serialize type="xml"/>
> </when>
> <otherwise>
> 
> </otherwise>
> </select>
> </match>

Yes, it would achieve about the same functionality as I want and it could easily be implemented with the help of the small extensions of the sitemap interpreter that I implemented for pipe aware selection [3].

I think it could be interesting to do a detailed comparison between the differences in our proposals: How the input stream and validation is handled, how the selection based on pipeline state is performed, if storage of the input is done in a serializer or in a transformer, and how the new output is created.

Ok, let's go.

Input Stream
------------

For input stream handling you used

<generate type="payload"/>

Is the payload generator equivalent to the StreamGenerator? Or does it something more, like switching parser depending on mime type for the input stream?

I really don't think this is important. We are basically discussing if the current sitemap architecture is good enough for what you want.

Once the Cocoon Environment is more balanced toward input, you can have a uber-payload-generator that does everything and brews beer, or you can have your own small personal generator that does what you want.

My point was: why asking for two pipelines when you can do the same thing with one?

I used

<generate type="xml"/>

The idea is that if no src attribute is given the sitemap interpreter automatically connect the generator to the input stream of the environment (the input stream from the http request in the servlet case, in other cases it is more unclear). This behavior was inspired by the handling of std input in unix pipelines.

Hmmm, interesting concept indeed, but I wonder if it's really meaninful in our context. I mean, maybe there are generators that don't need src and don't rely on input. But an idiotic TimeGenerator is the only one I can think of... and that really doesn't stand up as an argument, does it?

Nicola Ken proposed:

<generate type="xml" src="inputstream://"/>

I prefer this solution compared to mine as it doesn't require any change of the sitemap interpreter, I also believe that it it easier to understand as it is more explicit. It also (as Nicola Ken has explained) gives a good SoC, the uri in the src attribute describes where to read the resource from, e.g. input stream, file, cvs, http, ftp, etc and the generator is responsible for how to parse the resource. If we develop a input stream protocol, all the work invested in the existing generators, can immediately be reused in web services.

It is true that reduces the number of required generators. But there is something about this that disturbs me even if I can't really tell you what it is rationally... hmmm...

Validation
----------

Should validation be part of the parsing of input as in:

<generate type="xml">
<parameter name="scheme" value="myInputFormat.scm"/>
</generate>

or should it be a separate transformation step:

<transform type="validator">
<parameter name="scheme" value="myInputFormat.scm"/>
</transform>

or maybe the responsibility of the protocol as Nicola Ken proposed in one of his posts:

<generate type="xml" src="inputstream:myInputFormat.scm"/>

This is not a question about architecture but rather one about finding "best practices".

I don't think validation should be part of the protocol.

I disagree. Quite strongly, actually. Consider xinclude or any xml expansion that changes the stream infoset. You could have valid templates and valid fragments and still have invalid results (namespaces make the whole thing very tricky... and in the future we'll need the ability to mix tons of them, think FO+SVG+MathML for a normal example)

Now, if our xml-processing architecture is balanced enough, people might want to use xinclude transformers to juice-up their SOAP-processing pipelines. At that point, where do you validate?

Keeping the validation at a separate level helps because:

1) validation becomes explicit and infoset-transparent, in the spirit of RelaxNG.

2) multiple validation is possible (in the spirit of Xpipe)

3) pipeline authors are more aware of validation issues as pipeline processing stages.

It means that the protocol has to take care of the parsing and that would mumble the SoC where the protocol is responsible for locating and delivering the stream and the generator is responsible for parsing it, that Nicola Ken have argued for in his other posts.

Well, the problem is that relating the concept of validation to the concept of parsing and infoset production/augmentation is a *MISTAKE* that the XML specification perpetuated from the SGML days.

Please, let's stop it once for all. Putting validation as an implicit stage of parsing would set us back at least 5 years in markup technologies design.

Should validation be part of the generator or a transform step? I don't know.

Transformation, for the simple reason that you might need to validate a pipeline more than once.

> If the input not is xml as for the ParserGenerator, I guess that

the validation must take place in the generator. If the xml parser validates the input as a part of the parsing it is more practical to let the generator be responsible for validation (IIRC Xerces2 has an internal pipeline structure and performs validation in a transformer like way, so for Xerces2 it would probably be as efficient to do validation in a transformer as in a generator).

Note that the fact of including the *location* of a schema inside a document is another huge mistake perpetuated because XML failed to describe schema catalogs.

A document should indicate what "type" of document it is (something like the public DTD identifier) and let the system find out *how* to validate that document type.

Otherwise it seem to give better SoC to separate the parsing and the validation step, so that we can have one validation transformer for each scheme language.

No, if the description of the document is done properly (NOTE: even JClark hasn't still figured out a way to address the issue )

I would do it like this

<?xml version="1.0"?>
<document xml:type="http://apache.org/document/1.1/";>
...
</document>

and then it's up to the processor to understand how to validate a document type indicated by that URI.

NOTE: it's not a namespace URI, but an indentifier for the type of document that we are using. Of course, the same identifier can be used in both cases. For example

<?xml version="1.0"?>
<d:document
xml:type="http://apache.org/document/1.1/";
xmlns:d="http://apache.org/document/1.1/";>
...
</d:document>

In some cases it might be practical to augment the xml document with error information to be able to give more exact user feedback on where the errors are located. For such applications it seem more natural to me to have validation in a transformer.

A question that might have architectural consequences is how the validation step should report validation errors.

Agreed.

If the input is not parseable at all there is not much more to do than throwing an exception and letting the ordinary internal error handler report the situation. If some of the elements or attributes in the input has the wrong type we probably want to return more detailed feedback than just the internal error page. Some possible validation error report mechanisms are: storing an error report object in the environment e.g. in the object model, augmenting the xml document with error reporting attributes or elements, throwing an exception object that contains a detailed error description object or a combination of some of these mechanisms.

Mixing data and state information was considered to be a bad practice in the discussion about pipe-aware selection (se references in [3]), that rules out using only augmentation of the xml document as error reporting mechanism. Throwing an exeption would AFAIU lead to difficulties in giving customized error reports. So I believe it would be best to put some kind of state describing object in the environment and possibly combine this whith augmentation of the xml document.

Yes, that would be my assumption too. And in case there is the need to incorporate those validation mistakes back into the content, a transformer (maybe even an XSLT stylesheet) can do that.

This seems the cleanest solution to me.

Pipe State Dependent Selection
------------------------------

For selecting response based on if the input document is valid or not you suggest the following:

...
<transform type="validator">
<parameter name="scheme" value="myInputFormat.scm"/>
</transform>
<select type="pipeline-state">
<when test="valid">
<transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
...

As I mentioned earlier this could easily be implemented with the "pipe-aware selection" code I submitted in [3]. Let us see how it would work:

The PipelineStateSelector can not be executed at pipeline construction time as for ordinary selectors.

Gosh, you're right, I didn't think about that.

The pipeline before the selector including the ValidatorTransformer must have been executed before the selection is performed. This can be implemented by letting the PipelineStateSelector implement a special marker interface, say PipelineStateAware, so that it can have special treatment in the selection part of the sitemap interpreter.

yes

When the sitemap interpreter gets a PipelineStateAware selector it first ends the currently constructed pipeline with a serializer that store its sax input in e.g. a dom-tree and the pipeline is processed and the dom tree thith the cashed result is stored in e.g. the object model. In the next step the selector is executed and it can base its decision on result from the first part of the pipeline. If the ValidationTransformer puts a validation result descriptor in the object model, the PipelineStateSelector can perform tests on this result descriptor. In the last step a new pipeline is constructed where the generator reads from the stored dom tree, and in the example above, the first transformer will be an XSLTransformer.

we are reaching the point where pipeline selection cannot be processed "a-priori" but must include information on the run-time environment.

As much as I didn't like pipe-aware selection, I do agree that validation-aware selection is a special pipe-aware selection but it *IS* very important and must be taken in to consideration.

Hmmm, this kinda shades a totally different light on the concept of selection. (which has an interesting side effect in making selectors and matchers even more different than they are today).

An alternative and more explicit way to describe the pipeline state dependent selection above, is:

...
<transform type="validator">
<parameter name="scheme" value="myInputFormat.scm"/>
</transform>
<serialize type="object-model-dom" non-terminating="true">
<parameter name="name" value="validated-input"/>
</serialize>
<select type="pipeline-state">
<when test="valid">
<generate type="object-model-dom">
<parameter name="name" value="validated-input"/>
</generate>
<transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
...

Here the extensions to the current Cocoon semantics is put in the serializer instead of the selector. The sitemap interpreter treats a non-terminating serializer as ordinary serializer in the sense that it puts the serializer in the end of the current pipeline and executes it. The difference is that it instead of returning to the caller of the sitemap interpreter, it creates a new current pipeline and continue to interpret the component after the serializer, in this case a selector. The sitemap interpreter will also ignore the output stream of the serializer, the serializer is suposed to have side effects. The new current pipeline will then get a ObjectModelDOMGenerator as generator and an XSLTTransformer as its first transformer.

No, I'm sorry but I don't like this. I totally don't like the abuse of serialiers for this concept of 'intermetiate-non-sax-stream' components. It's potentially very dangerous, I see an incredible potential for abuse.

What do others think about this concept of pipelining pipelines? isn't this kind of recursion the mark of FS?

I prefer this construction compared to the more implicit one because it is more obvious what it does and also as it gives more freedom about how to store the user input.

True, but it also gives people more ability to abuse the system. Think about internal pipelines, and views, and resources and aggregation... have you thought about all the potential uses of these pipeline pipelining on all current sitemap usecases?

you are, in fact, proposing a *MAJOR* change in the way the pipelines are setup. In short, more freedom and less pipeline granularity... but sometimes it's good to make it harder for them to come up with something... so they *THINK* about it.

Maybe I'm being too conservative, but I'm very afraid of all those unplanned (and unwanted) changes that these new chained pipelines could produce... (besides, how do you stop them from wanting more than two pipelines? should we? would you also like to chain a pipeline with a reader and then another pipeline?)

Some people seem to prefer to store user input in Java beans, in some applications session parameters might be a better place then the object model.

I've seen the ugliest sitemaps coming out of exactly that concept of storing everything in the sitemap and then parsing it back into the pipeline... believe me, it's more abused than used correctly as it is right now.

Pipelines with Side Effects
---------------------------

A common pattern in pipelines that handle input (at least in the application that I write) is that the first half of the pipeline takes care of the input and ends with a transformer that stores the input. The transformer can be e.g. the SQLTransformer (with insert or update statements), the WriteDOMSessionTransformer, the SourceWritingTransformer. These transformers has side effects, they store something, and returns an xml document that tells if it succeeded or not. A conclusion from the threads about pipe aware selection was that sending meta data, like if the operation succeeded or not, in the pipeline is a bad practice and especially that we don't should allow selection based on such content. Given that these transformers basically translate xml input to a binary format and generates an xml output that we are supposed to ignore, it would IMO be more natural to see them as some kind of serializer.

The next half of the pipeline creates the response, here it is less obvious what transformer to use. I normally use an XSLTTransformer and typically ignore its input stream and only create an xml document that is rendered into e.g. html in a sub sequent transformer.

I think that it would be more natural to replace the pattern:

...
<transform type="store something, return state info"/>
<transform type="create a response document, ignore input"/>
...

with

...
<serialize type="store something, put state info in the environment"
non-terminating="true"/>
<generate type="create a response document" src="response document"/>
...

If we give the serializer a destination attribute as well, all the existing serializers could be used for storing input in files etc as well.

...
<serialize type="xml" dest="xmldb://..." non-terminating="true"/>

Now, let me ask you something: how much have you been playing with the FlowScript?

A while ago I proposed the ability to call a pipeline from the flowscript but specifying the outputstream that the serializer should use. Basically, the flow now can use a pipeline as a tool to do stuff without necessarely be tied to the client.

In all your discussion you have been placing a bunch of flow logic (how to move from one pipeline to the next) into the sitemap. I'd suggest to move it where it belongs (the flow) and let the sitemap do its job (defining pipelines that others can use).

Why? well, while the concept of stateless output is inherently declerative, the concept of stateless input + output is declarative for the match and procedural for its internals.

So, I wonder, why don't we leave the declarative part to the sitemaps and use the flow as our procedural glue?

...

This would give the same SoC that i argued in favour of in the context of input: The serializer is responsible for how to serialize from xml to the binary data format and the destination is responsible for where to store the data.

This can be achieved with a flow method that includes a way to specific the output stream (or a WriteableSource, probably better) that the serializer has to use.

Conclusion
----------

I am afraid that I put more question than I answer in this RT. Many of them are of "best practice" character, and do not have any architectural consequences, and does not have to be answered right now. There are however some questions that need an answer:

How should pipeline components, like the validation transformer, report state information? Placing some kind of state object in the object model would be one possibility, but I don't know.

The real problem is not where to store the data, IMO, but the fact that you showed that there is a serious need for run-time selection that can't be addressed with our today's architecture.

We seem to agree about that there is a need for selection in pipelines based on the state of the computation in the pipeline that precedes the selection.

Yes. I finally got to this conclusion.

Here we have two proposals:

1. Introduce pipeline state aware selectors (e.g. by letting the selector implement a marker interface), and give such selectors special treatment in the sitemap interpreter.

2. Extend the semantics of serializers so that the sitemap interpreter can continue to interpret the sitemap after a serializer, (e.g. by a new non-terminating attribute for serializers).

I prefer the second proposal.

I prefer the first :)

Booth proposals can be implemented with no back compatibility problems at all by requiring the selectors or serializer that need the extended semantics, to implement a special marker interface, and by adding code that reacts on the marker interface in the sitemap interpreter.

Yes, I see that.

To use serializers more generally for storing things, as I propsed above, the Serializer interface would need to extend the SitemapModelComponent interface.

Don't know about that. I like serializers the way they are, but I'd like to be able to detach them from the client output stream but using the flowscript.

--
Stefano Mazzocchi <[EMAIL PROTECTED]>
--------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

Re: [RT] Input Pipelines: Storage and Selection (was Re: [RT] InputPipelines (long))

Reply via email to