Re: [RT] A new Forrest implementation?

Tim Williams Tue, 15 Aug 2006 07:22:43 -0700

On 8/15/06, Ross Gardler <[EMAIL PROTECTED]> wrote:

Tim Williams wrote:
> On 8/14/06, Ross Gardler <[EMAIL PROTECTED]> wrote:
>
>> This is a Random Thought. The ideas contained within are not fully
>> developed and are bound to have lots of holes. The idea is to promote
>> healthy discussion, so please, everyone, dive in and discuss.


...

> I think the Cocoon community has recognized the monolithic-ness of the
> framework.  Stefano brought it up[1] and I think the responses are
> encouraging - though the maven promises leave *very* much to be
> desired as it has effectively stopped me from even attempting to build
> their trunk.

It has been discussed a great many times. Some progress has been made,
but I very much doubt it will happen in a time frame sufficient to help
Forrest. The thread you link to is certainly not the first that
highlighted this issue.

>> What Forrest Does
>> =================
>>
>> Input -> Input Processing -> Internal Format -> Output Processing ->
>> Output Format
>>
>> To do this we need to:
>>
>> - locate the source document
>> - determine the format of the input document
>> - decide which input plugin to use
>> - generate the internal format using the input plugin
>> - decide what output plugin we need
>> - generate the output format using the output plugin
>>
>> Lets look at each of these in turn
>
>
> Oversimplified but we'll see where you go with this...

Please expand. Please add in the complexities that you see so that we
can examine them.

>> Locate the source document
>> --------------------------
>>
>> To do this we use the locationmap, this is Forrest technology.
>
>
> A lot of avalon and excalibur + a very little Cocoon for context and
> an (all things considered) wrapped up by a very little bit of Forrest
> code.  I'm just suggesting that we've done nothing but wrapped some
> stuff here - "forrest technology" is a stretch.  To recreate it, we
> could get context elsewhere but we'd need an equivalent to
> avalon/excalibur I think.

Come on, are you realy claiming that we need Avalon+Excalibur+Cocoon to
create a hashmap of possible matches to any given string?


I'm saying the matching/selection does not come from Forrest code.
They would need to be implemented.  Source resolution/validity does
not come from Forrest code; it would need to be implemented.

All we need is pattern matching followed by a lookup then a lookup. See
my psuedo code later in the original post. The *concept* of the
Locationmap is Forrest technology and it can be reproduced without any
of the baggage Cocoon requires us to bring along.

>> Decide which input plugin to use
>> ---------------------------------
>>
>> This is done by resolving the processing request via the Cocoon sitemap.
>> But why?
>>
>> Each input type should only be processed by a single input plugin, there
>> should be no need for complex pipeline semantics to discover which
>> plugin to apply to a document, all we should need to do is look up the
>> type of document in a plugins table.
>
>
> And aggregates?  The end result isn't a from a single document but an
> aggregate of  multiple data uri's - at least that's the dispatcher
> plan as I understand it.

All aggregates are about requesting multiple input sources and merging
them together. Therefore aggregates do not belong here, they belong in
the output plugin stage (so I'll come back to this later)
 > A cocoon transformer levies pretty
> minimal requirement: an XMLConsumer/XMLProducer (easy and natural, sax
> event handlers and a single method respectively) and some simple
> lifecycle contract methods needed for being a part of the managed
> environment.

I really should have been talking about the complexitites of writing a
generator. As we very rarely need to write transformers. Try writing a
generator that, for example, uses hibernate to communicate with a
relational database.


Same thing, except it's just a producer and not also a consumer.  The
code to do this will be almost exactly the same in any other
SAX-event-streaming approach.  But anyway...

public class HibernateGenerator extends AbstractGenerator
{
 public void generate() throws SAXException {
    contentHandler.startDocument();
    contentHandler.startElement("","committers", "committers");

    List committers = listCommitters();

   for(int i = 0; i < committers.size(); i++) {
     Person indCommitter = (Person)committers.get(i);
     contentHandler.startElement("","committer","committer");
     contentHandler.startElement("","name","name");
     contentHandler.characters(indCommitter.getName().toCharArray());
     contentHandler.endElement("","name","name");
     contentHandler.endElement("","committer","committer");
    }

   contentHandler.endElement("","committers","committers");
   contentHandler.endDocument();
 }

 private List listCommitters() {
   Session session = HibernateUtil.getSessionFactory().getCurrentSession();
    session.beginTransaction();
    List result = session.createQuery("from Committers").list();
    session.getTransaction.commit();
    return result;
 }
}

No comments on code quality here;)  I guess the point here is that you
can come up with a complex "generator" requirement, but you've already
admitted that SAX event-streaming is the way to go.  If this is true,
then the complexity of turning some source content into SAX events
will ultimately remain.

[Note: I've got no experience with Hibernate so this example is
strictly based on their docs.]

> I think being in some sort of managed environment (e.g.
> Spring) is likely needed in any real-world approach.  So I'd turn this
> around and ask where is the complexity?

First complexity: building Cocoon

Second complexity: building any component that has additional dependencies

Third complexity: deploying a new (non-trivial) component within a plugin

Fourth complexity: a community that is pulling in many different directions

There are many more but I will leave it at that. If you don't agree then
I suggest you actually try it before arguing the case. You can then tell
me where I am going wrong.


"Actually try" what?  Surely you can be more constructive than
questioning my credibility here?  I've built Cocoon before.  I am
unable to do so now after the Mavenization.  I've expressed that
frustration here and on the Cocoon list.  Building Cocoon is complex,
I agree.  Inside the TreeProcessor code is complex I agree.  The
standard components (Generator, Transformer, etc.) is not that
complex.  What is it that you'd like me to "actually try" and I'll
respond.

Of course, it can be argued that 1-3 are because Forrest was built
against a much older version of Cocoon and has failed to keep up (for
example why a plugins not Cocoon blocks?). I would respond that this is
because of the fourth complexity.

So, then it can be argued that we should be contributing to Cocoon and
helping resolve the fourth complexity. That may be the outcome of this
RT, it may not.


sounds reasonable.

>> Decide what output plugin to use
>> --------------------------------
>>
>> This is done by examining the requested URL. The actual selection of the
>> output plugin is done within the Cocoon sitemap. I have all the same
>> arguments here as I do for input plugins, this only needs to be a simple
>> lookup, not a complex pipeline operation.
>
>
> I get the feeling you're basing this on the simplest use-case
> imaginable.  The output plugin is about the format of the output not
> the content of the output.  The sitemap benefits here allow for more
> complex processing (e.g. user profiling, smart content delivery, etc.)

I disagree. The sitemap is a way of *configuring* this complex
processing, it is not the processing itself. The sitemap has become an
XML programming language and I hate it for that reason.

Have you ever dived in to the implementation and tried to do anything
useful in there?


Again, what implementation?  I've looked inside to the Treeprocessor
code in Cocoon, yes, and it is difficult to grasp.  I did this when
doing the LM mounting stuff to see how mountnodes were implemented in
the sitemap - I like to think this was useful.  I see no reason why
the average user would care about this stuff though.

The fact that the sitemap had become a programming language is one
reason why Cocoon came up with the flow engine (e.g. to get rid of
actions). But if you use the flow engine then you are programming with
Javascript, it's only a small step from there to Java. So are there any
benefits in using Javascript over Java?

In my opinion the answer is a resounding no, at least for our use case.

>> Generate the output format
>> --------------------------
>>
>> This is typically done by an XSLT transformation and/or by a third party
>> library (i.e. FOP) I have the same arguments here as I do for the
>> generation of internal format documents, in fact the parts of Cocoon we
>> use are identical in both cases.
>
>
> Yeah, output is just a transformer.  Same thoughts as above.

OK, back to aggregation since I argued earlier that it belongs here.

Aggregation is nothing more than the collation of a number of resources
in response to a single request. It turns a single request to a number
of requests. Each individual request is handled just like any other
request. ASo what you have is a locationmap something like this:

<map match="foo/bar/**">
   <aggregate>
     <location src="..." required="true"/>
     <location src="..." required="false"/>
   </aggegate>
<map>


Fair enough, move the aggregation to the Locationmap.  This looks very
similar to the sitemap though, no?

>> Caching
>> -------
>>
>> Cocoons Caching mechanism is pretty good, but it has its limitations
>> within Forrest. In particular, we have discovered that the Locationmap
>> cannot be cached efficiently using the Cocoon mechanisms.
>
>
> This may be true. We had a novice working on LM caching at the time
> and I've learned quite a bit since then.  I'd like to re-evaluate this
> before I'm willing to agree with with such a bold statement.

This illustrates my point exactly. I looked at this too and also failed
to get a better solution.

The reason I failed (and I guess the same for you) is that the code is
just so complex and jumbled that it's next to impossible to find ones
way around once one gets past the API.


I've documented my challenges somewhere.  It had to do with the timing
of getCacheKey() and getValidity() for mounted maps I think - I'd have
to go back and look.

>> This is now
>> one of the key bottlenecks in Forrest.
>
>
> Based on?  I'd like to see this profiling data.  Knowing that the LM
> is our way ahead I've been worried about squeezing every ounce where
> we could but I was still under the impression that it isn't a
> consequential performance bottleneck.

Try building the Cocoon docs. Its set up on a Forrestbot in our zone.
Even when co-located on the same physical machine as the source for the
content it takes over 30 minutes to build. It really is a horrible solution.


My question was really whether you confirmed that the locationmap is
the reason for this slowness?  I suspect it is not and, thus, not a
"key bottleneck" in Forrest.

If you want to profile it then you can get the forrest site from the
Cocoon-Whiteboard.


I'll take a look to see if it's really the Locationmap that's the culprit there.

This is an extreme example case, but one that is quite common in my
experience using Forrest to do real document processing (as opposed to
web site generation).


I'm disagreeing with your conclusion - that the LM is a key
contributor to the performance problems.  I am not disagreeing with
the performance problem itself.  For example, I think a much larger
contributing factor is that we re-generate everything for changes that
really impact only a small part of a site.  This has nothing to do
with Cocoon baggage; we just have an implementation that isn't very
efficient.

>> We could work with Cocoon on their caching mechanism but there seems
>> little interest in this since our use case here is quite unusual. Of
>> course, we can do the work ourselves and add it to Cocoon. But why not
>> use a cacheting mechanism more suited to our needs?
>
>
> So it's not 100% suitable so it's worthless?  It fits in 98% percent
> of our needs so I don't see this as a compelling argument.

That's unfair. I'm saying it is not perfect, therefore it is not
necessary to use it. I did not say it is not perfect so lets get rid of
it. Please take this in the context of all the other problems I am
highlighing rather than considering it as a single point.

Besides it doesn't work for the locationmap, so in fact it is not used
in some of the processing of every single request we make. That's
considerably more than "2%"


Yeah, it's baby-and-the-bathwater thing I think. I'd rather figure out
how to solve our problem with the current cache mechanism than see
this as a reason to re-implement all of Forrest.  I'm just saying that
of all the things that might motivate me to be involved in a
re-implementation, this one doesn't strongly resonate with me.

>> Ready Made Transformations
>> --------------------------
>>

...

> You seem to be
> suggesting that Cocoon requires some big overhead to do transforms and
> that's simply not the case.

That's right, I call 40Mb of bloat a fair big overhead for doing XSLT
transformations.

This time I really am oversimplifying, but I hope you see my point -
certainly that is how my customers see it. As a result I ended up, in
most cases, writing a series of Java components that I wired together
manually and plugged directly into whatever framework they were using.
This RT is about doing this in a more felxible and reusable way.


You're customers are likely just intimidated by the
Cocoon-learning-curve itself rather then 40Mb of jar files.  Many of
the libraries would be needed regardless I think.  Avalon would need
to be replaced with another container that would likely be larger in
size at least.  batik, fop, jtidy, excalibur, etc, are all still
needed.

>> This complexity makes it difficult for newcomers to get started in using
>> Forrest for anything other than basic XSLT transformations.
>

...

>  My point is that newcomers are
> going to find it difficult to deal with any framework that attempts to
> achieve anything beyond the simplistic.

Yes, but if the framework is designed to do one job (publishing in our
case) then it is simpler to understand than if it is designed to do
every job (as with Cocoon).

>> The end result is that we have only one type of user - those doing XSLT
>> transformations.
>>
>> Plugin Selection
>> ----------------
>>
>> This is done through the sitemap. This is perhaps where the biggest
>> advantage of Cocoon in our context can be found. The sitemap is a really
>> flexible way of describing a processing path.
>>
>> However, it also provides loads of stuff we simply don't need when all
>> we are doing is transforming from one document structure to another. This
>> makes it complex to new users (although having our own sitemap
>> documentation would help here).
>>
>> Finally, as discussed in the previous section, we don't need a complex
>> pipeline definition for our processing, we just need to plug an input
>> plugin to an output plugin via our internal format and that is it. We
>> have no need for all the sitemap bells and whistles.
>
>
> I'm struggling to figure out what you think is forcing us into our
> current apparently overly complex solution.  Is it the sitemap grammar
> that is complex?

Not the grammar itself (although I do hate the fact that we are now
programming using the sitemap). The complexity is in processing of that
gramar whic results in the selection of the processing path to take.


I don't understand.  Treeprocessor? NodeBuilders? Matchers?

All we need to do is select the right plugins and make them work
together. Look at how many internal pipeline requests there are to do
this in Forrest now (its even worse if we use the dispatcher).

This is overly complex for what is ultimately a couple of lookups.


I'll hopefully find time later to look at your psuedo-code and maybe
it'll make more sense to me.  Right now, I'm just seeing what goes on
as much more than a "couple lookups".

> Learning curves aside, I'd rather sit on top of a framework that
> supports a more complex solution than is my current problem because
> experience has shown me that the initial requirements grow and I don't
> want to have port when that growth happens.

This is exactly why I hate "catch all" frameworks. They try to be all
things to all people. I prefer to use what I need now and look at
expanding things when I find a use case that requires it. How can you
know in advance that the framework you choose is going to be adequate
for the job in hand? How do you know you won't eed Struts, or Ruby On
Rails, or Wicket or SpringMVC or whatever?

This is personal opinion and we should really leave it at the door.
Different people for different things. Our job is to decide what is best
for the project not for us as idividuals. I'll just leave you with one
though...

If I'm going hiking I do not struggle carrying a family tent on my back
just because I may have some more children at some point in the future.


Ok, we'll drop this line of thought as you suggest...

>> Conclusion
>> ----------
>>
>> Cocoon does not, IMHO, bring enough benefits to outweigh the overhead of
>> using it.
>>
>> That overhead is:
>>
>> - bloat (all those jars we don't need)
>
>
> this is going to be addressed with maven (argghhh) and/or osgi someday
> - it's a recognized issue by many cocooners.

"someday" is the optimal word there. I've been waiting too long.


C'mon, you're an OS veteran here.  Patches welcome, right?

If we reject this RT based on this argument then I want to see Forrest
developers helping Cocoon sort this out rather than standing by waiting
for it to happen.


Ok, I threw the "maven" thing in with fingers crossed.  I'd rather
they go back to ant personally, maven is silly.  I have a high-speed
connection and it takes forever to download libs each time I *attempt*
to build only to see it fail 10 minutes into it.  Argghhh...

>> - complex code (think of your first attempt to write a transformer)
>
>
> I've never written a transformer.  I suspect that I could do it in a
> day or less though depending upon the requirements.  It's simply
> implementing XMLConsumer by handling SAX events, not that
> extraordinary for a SAX-stream-based framework.  How do the many other
> pipeline frameworks do transforms if not by handling SAX events?

Yes, transformers are simple. I should have picked non-trivial
generators as discussed above. Especially since this is a more common
requirement in the real world. That is we need input plugins to inteface
with existing corporate legacy code.

>> - complex configuration (sitemap, locationmap, xconf)
>
>
> Like component managers nowadays, we've failed to strike a good
> balance between flexibility (configurability) and ease of use.

I really can't agree with the "like component managers nowadays" part.
Have you actually worked with something like Spring? It is unbelievably
simple.

>> - based on Avalon which is pretty much dead as a project
>
>
> They are at least partially migrated to Spring for management
> purposes.   I understood that as a move to eventually migrate fully
> from Avalon to Spring.

Don't be fooled by the "headlines". Look into the code. Until the Avalon
jars are gone then my point stands. Until someone here gets into the
Cocoon code and starts trying to disentangle things then my point stands.


Until the Avalon jars are gone?  That's not fair really.  That's black
and white and doesn't allow for a comprehension of progress.

Let's take a look at the progress...

<removed unnecessary junk>
final class AvalonServiceManager
   implements ServiceManager, BeanFactoryAware {

   protected BeanFactory beanFactory;

   public void setBeanFactory(BeanFactory beanFactory) throws BeansException {
       this.beanFactory = beanFactory;
   }

   public boolean hasService(String role) {
       return this.beanFactory.containsBean(role);
   }

   public Object lookup(String role) throws ServiceException {
           return this.beanFactory.getBean(role);
   }
}

Looks to me like the headlines were correct in this case.  More or
less a light wrapping around Spring.  Spring is doing the heavy
lifting behind the scenes.  It's a whole lot of work to rip out the
Avalon interfaces so I understand the desire to just wrap it for now.

Why don't I do that? I have other things to do, I need Forrest to be
useful, I don't use, and have never used, Cocoon independantly of
Forrest (at least not commercially).

>> So Should We Re-Implement Forrest without Cocoon?
>> =================================================
>>
>> In order to find an answer to this question lets consider how we might
>> re-implement Forrest without Cocoon:
>>
>> Locate the source document
>> --------------------------
>>
>> We do this through the locationmap and can continue to do so. We would
>> need to write a new locationmap parser though.  This would simply do the
>> following (note, no consideration of caching at this stage, but there
>> are a number of potential cache points in the pseudo code below):
>
>
> Assumes that matching and selection have already been implemented
> somewhere?

Yes, the way I see it, regular expressions are pretty standard and well
supported.

...

>> Generate the internal document
>> ------------------------------
>>
>> Since the plugins are now loaded via a component manager our
>> transformation classes are POJO's that are independant of any particular
>> execution environemnt, therefore, there is no need to do anything
>> clever here.
>
>
> I don't understand.  They need input/output contracts, right?  There
> aren't standards defined for such things so it is execution
> environment dependent.  The concept of a POJO is honestly really gray
> to me.  I view Cocoon's transformation classes as POJO's.  I've tried
> to grasp this POJO concept before and gotten lost. The Java community
> certainly has a knack for the creation of buzzwords with blurry
> meaning.

I'm not really using POJO in the correct context here.  All a plugin
needs is a method to do its stuff. This could be called "execute". The
input would be a SAX stream (for which there are multiple standard
implementations), the output would also be a sax stream.

There is no dependency on anything else. Even the container manager in
use would be independant from the plugins and could be replaced at any time.


Again, strictly talking about the components, what you describe above
as a "plugin" is an implementation of XMLProducer and XMLConsumer.
I'm not seeing the benefit/difference but don't waste time on
responding until I actually put the effort into looking at your
psuedo-code.

>> So is this interesting or not?
>
>
> Not so far...  I'm not convinced.  I think you're implicitly
> describing an oversimplified use-case, overstating the complexity of
> Cocoon, and glossing over what we get from Cocoon.  More to come...

Tim, you have argued against my points, are there none that you see
merit in? It would be helpful if you could highlight any points that you
feel are valid, even just by saying "yes, OK". This will enable us to
pull the good stuff out of this thread and to let the bad stuff just rot
away.


Fair enough, I'll try to do this when I respond tonight to the other
half of your first mail.

--tim

Re: [RT] A new Forrest implementation?

Reply via email to