Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Stefano Mazzocchi 5 Jan 2002 11:30:48 -0000

Tom, first of all, many thanks for your polite and useful reply. 

I used a somewhat overimflammable tone on purpose to test the 'heat
dissipation' capabilities of this community (which is a big part of my
job as an ASF sponsor) and the results indicate that my sponsoring job
will definately be an easy one :)

I already had the feeling this was the case (Sam as well, we talked
about it privately) but if even discussing the very technical foundation
of the effort doesn't create negative energy, there is nothing that will
destroy a community.

Enough for the community points.

Now, with my ASF hat removed, back to the technical points.

Tom Bradford wrote:
> 
> Stefano Mazzocchi wrote:
> > I see a native XML database as an incredibly great DBMS for
> > semi-structured data and an incredibly poor DBMS for structured data.
> 
> I don't think anyone's debating that, though I wouldn't use the label
> 'incredibly poor' for structured data, especially since the definition
> of what structured data is can't be answered by relational DBs
> either...  I don't consider normalization and joins as being structure,
> so much as I consider it to be a rigid decomposition of structure.

Good point.

> > Corba? no thanks, I need WebDAV.
> 
> As much as all of us hate it, CORBA absolutely has its uses.  We could
> never get away with wire-compression if we were using a 'service the
> world' WebDAV style approach.  Wire compression has bought us
> performance gains, though not enough to justify keeping it exclusively.

My points was not to remove CORBA from the picture (BTW, is there
anybody here who is usign XIndice from CORBA in a real-life
application?) but to indicate my impression that time spent on a webDAV
connection would have been better spent. No offense intended, just a
consideration from the document-oriented world where CORBA will never
even enter.

> > Joins? no thanks, I need document fragment aggregation.
> 
> In the context of XML, I think these are the same.

In terms of functionality, you might be right, in terms of performance,
well, I'm not as optimistic as you seem to be on this.

> > XMLSchemas? no thanks, I need infoset-neutral RelaxNG validation.
> 
> Personally, and I'm just reiterating things I've said in the past, I
> hate W3C XML Schemas, and many others do as well.  

Yep. I never heard anybody saying the opposite.

> I don't want to have
> to put ourselves in a position where we're forced to make a choice on
> any one validation mechanism to the detriment of our users.  

That's a good point, but again, I'm questioning the darwinistic
evolutionary process of this effort: do what people ask, not what
architectural elegance suggests or W3C recommends.

> So if we
> can continue to push validation to the client application, that's the
> track we should take... for a couple of important reasons: (1)
> Performance... validation is slow, Bogging down the server to perform it
> can only cause problems, and (2) Choice: If we standardize on W3C
> Schemas, then we exlude support for other schema specifications.  I
> think that's unwise, especially with the major backlash that XML Schemas
> has received.

I agree with you on the fact that the engine internals should deal with
validation. Just like Cocoon doesn't validate stuff by default.

The content management system I'd like to have could be build in two
ways:

 1) single layer: XIndice includes all the required functionality.

 2) double layer: XIndice is the db engine, something else wraps it and
performs CMS operations like access control, workflow management, data
validation, versioning, etc.

Separation of Concerns clearly indicates that the second option is the
best. This has been my view of the issues since May 2000, when I first
took a serious look to dbXML as the engine for such a system.

This is why I wanted XIndice over to Apache: separation of concerns is a
great way to do parallel design and increase productivity and give users
more choice, but it can't work without *solid* contracts between the
systems that interoperate.

So, what I'm asking, is *NOT* to turn XIndice into a CMS, not at all!
What I'd like to see is XIndice remaining *very* abstract on the XML
model, but without sacrificing performance and making it possible to
implement more complex systems on top.

> > If you have structured data, you can't beat the relational model. This
> > is the result of 50 years of database research: do we *really* believe
> > we are smarter/wiser/deeper-thinkers than all the people that worked on
> > the database industry since the 50's?
> 
> One might argue that the relational database industry hasn't learned
> very much in the decades that it's been around.  Not that I'm saying XML
> databases are better, but relational databases were created to solve the
> problems of the databases of their time.  That time has passed.  There
> are still a lot of applications that have the problem that relational
> databases are trying to solve, but there are many applications that have
> the problem that XML databases are trying to solve.  Further still,
> there are apps that no database can adequately solve.

Absolutely. Still, please, let's try to avoid a pissing contest with the
RDBMS communities and lead the way for those grounds where the relation
model fits, but with a very bad twist.

Example, I've seen a clever implementation of an XML database on top of
a relational DB using the parent-son relation of nodes. The problem was
transforming XPath queries into SQL queries with one inner join per
'slash' on the xpath. Go figure the performances :)

> > I see two big fields where XIndice can make a difference (and this is
> > the reason why I wanted this project to move under Apache in the first
> > place!):
> >
> >  - web services
> >  - content management systems
> 
> Don't forget health care, legal documents, and scientific applications.

These are all examples of the above two.

> These are three areas where Xindice has organically found a home in
> since its creation.

Of course.

> >  - one big tree with nodes flavor (following .NET blue/red nodes):
> > follows the design patterns of file systems with folders, files,
> > symlinks and such. [great would be the ability to dump the entire thing
> > as a huge namespaced XML file to allow easy backup and duplication]
> 
> >  - node-granular and ACL-based authorization and security [great would
> > be the ability to make nodes 'transparent' for those people who don't
> > have access to see them]
> >
> >  - file system-like direct access (WebDAV instead of useless XUpdate!)
> > [great for editing solutions since XUpdate requires the editor to get
> > the document, perform the diff and send the diff, while the same
> > operation can be performed by the server with one less connection, this
> > is what CVS does!]
> 
> Woah!  Stop right there.  XUpdate is far from useless, and your
> explaination of how it works, in the context of Xindice is incorrect.

No, I think you didn't get my point (see below).

> When you perform an XUpdate query, it's sent to the server which
> performs all of the work.  Never is a document sent to the client except
> for a summary of how many nodes were touched by the update.  It actually
> performs very well, because you can modify every single document in a
> collection, taking several different actions, with a single command.

XUpdate is a way express deltas, differences between trees. 

In the data-centric world, people are used to send deltas: change this
number with this other one, append this new address, remove this credit
card from the valid list.

In the document-centric world, people are used to think of files, not
about their diffs. 

CVS is a great system because does all the differential processing on
documents by itself, transparently.

Now, the use of a delta-oriented update language isn't necessarely bad
as a 'wire-transport' (much like CVS sends compressed diff between the
client and the server) but definately isn't useful by itself without
some application level adaptation.

Now, let me give you a scenario I'd like to see happening: imagine to
have this CMS system implemented and you provide a WebDAV view of your
database.

You connect to this 'web folder' (both Windows, Linux and MacOSX come
with the ability to mount webdav hosts as they were file system
folders), you browse it and you save your file from your favorite XML
editor (or even using stuff like Adobe Illustrator for SVG).

The CMS will control your accessibility (after authentication or using
client side certification, whatever), perform the necessary steps
defined on that folder by the workflow configurations (for example,
sending email to the editor and placing the document with a status of
'to be reviewed') and save the document.

Now, can I use XIndice to provide the storage system underneath this
CMS?

For example, in order to have a webdav view I need the ability to have
'node flavors': a node can be a 'folder' (currently done with
collections), what is a 'document' and what is a 'document fragment' and
what is a symlink to another document fragment.

How can I perform access control at the node level without duplicating
the information at the CMS level? how can I perform versioning without
having to duplicate every document entirely?

Currently, whenever the CMS saves something on top of another document,
it has to call for the document, perform the diff, get the XUpdate and
send that.

I'm not asking to remove XUpdate from the feature list, but to give the
appropriate tools depending on the uses.

> >  - internal aggregation of document fragments (the equivalent of file
> > system symlinks) [content aggregation at the database level will be much
> > faster than aggregation at the publishing level, very useful for content
> > that must be included in the same place... should replace the notion of
> > XML entities]
> 
> We have this functionality in a very experimental form.  It's called
> AutoLinking.  It's been around for a while, but it's going away at some
> point, to be replaced by XQuery.  The problem with it is that you have
> to modify the structure of your XML content, so it can't be treated as
> data.  XQuery will allow this aggregation using the data in the
> documents rather than instructions within the document.  Beyond that,
> there's nothing stopping somebody from using XLink, its just not a task
> that the server will perform because of the passive nature of XLinks.

Yes, you are right saying that XQuery does include this functionality,
but I suggest you to consider the following scenario:

<db:database xmlns:db="xindice#internal" xmlns:cms="CMS">

 <legal db:type="folder">
  <copyright db:type="document" db:version="10.2"
db:last-modified="20010223">
    This is copyright info and blah blah...
  </copyright>
 </legal>

 <press db:type="folder">
  <press-releases db:type="folder">
   <press-release date="20010212" author="blah" 
     db:type="document" db:version="10.2" db:last-modified="20010213" 
     cms:status="published">
    <title>XIndice 2.0 released!</title>
    <content>
     <p>blah blah blah</p>
     <p><db:link href="/legal/copyright[text()]"/></p>
    </content>
   </press-release>
  </press-releases>
 </press>

</db:database>

then, you can ask for the document

 /press/press-releases/[EMAIL PROTECTED] = '20010212']

and you get

 <press-release>
  <title>XIndice 2.0 released!</title>
  <content>
   <p>blah blah blah</p>
   <p>This is copyright info and blah blah...</p>
  </content>
 </press-release>

which allows your users to avoid probably 200 pages of XQuery syntax to
accomplish the same task (and also, probably, be much faster!).

> >  - native metadata support (last modified time, author, etc..) [vital
> > for any useful caching system around the engine!]
> 
> Some of this is already available, there's no way to expose it currently
> though.
>
> >  - node-granular event triggers [inverts the control of the database:
> > when something happens the database does something, useful mostly to
> > avoid expensive validity lookup for cached resources]
> 
> We talked about this early on in developing the product, but decided to
> put it on a back burner for a while... probably for the same reason we
> decided to shelve any specification validation system.

Without appropriate hooks for caches, any data storage system is
destined not to scale in real life systems.

I suggest you to place the above two features very high in the todo list
or you'll find people very disappointed when they start getting
scalability problems and you can't give them solutions to avoid
saturation.

> > In short: I'd like to have a file system able to decompose XML documents
> > and store each single node as a file, scale to billions of nodes and
> > perform fast queries with XPath-like syntaxes.
> 
> This is not to far from where we are at the moment.  Nodes are
> individually addressable, but we cluster them into Documents for
> atomicity, much like an object database will cluster objects together in
> a way that ensures optimal I/O performance.
> 
> > This is my vision.
> 
> Now if this can work within the framework of my vision then nobody'll
> get hurt. :-)

Absolutely! That's why this project is here in the first place :)

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<[EMAIL PROTECTED]>                             Friedrich Nietzsche
--------------------------------------------------------------------

Re: XIndice 2.0 [was Re: Data or Documents for Xindice 2.0]

Reply via email to