Re: how-to query an xml repository efficiently

andre . davignon Tue, 08 Sep 2009 09:51:11 -0700

We do the same way here.

Thanks for this excellent post and all theses explanations and use cases.


André

-----Original Message-----
From: Mark Diggory <mdigg...@apache.org>

Date: Tue, 8 Sep 2009 09:03:04 
To: <users@cocoon.apache.org>
Subject: Re: how-to query an xml repository efficiently


I utilize Solr as well.  I would exemplify the differences between
using Solr+Cocoon and eXist+Cocoon as the following

Solr: good for term indexing large amounts of content while not
retaining the structural nature of that XML content or necessarily
having to store it.

eXist: good for storing large amounts of XML content and retaining
structure, XQuery syntax provides powerful capability to evaluate the
structure of the XML content as well.

If you are seeking to provide user search across precalculated terms
from your xml content then use Solr. You will be parsing your XML and
extracting the terms you wish to provide search on in Solr prior to
your users executing such searches. Solr response format can be
configured in the Solr server via XSLT and other languages, solr
search syntax is Lucene and Solr is capable of providing faceting and
other Search Engine centric metrics on the search results.

If you don't know what queries you (or your users) want to use, or
expect to want to evaluate some structure of your XML in those
queries, then utilize eXist as a database solution for greater
flexibility in this area. Where you can use XQuery to achieve SQL like
FLOWER expressions and control the serialization to XML at the same
time.

I think they both have their place and could even fill important
search and query requirements within the same application.  Solr will
give you a simpler search/result syntax while eXist will provide more
database capability.  Both have a good Client API in REST and multiple
client library implementations in various languages.  Both are highly
configurable.

For instance, you could place your XML repository into eXist and
utilize XQuery to extract the contents that should be indexed into
Solr for simple search and discovery of that content.  You could also
the utilize eXist to retrieve required fragments of the content for
more detailed representation of your Solr results rather than having
to parse the entire XML document into memory to do so.

Mark

2009/9/8 "DAVIGNON Andre - CETE NP/DIODé/PANDOC"
<andre.davig...@developpement-durable.gouv.fr>:
> Robby,
>
> One more thing about this subject.
>
> You can do all that stuff directly with Cocoon / Lucene with java code only,
> but Solr offers rich possibilities of index configuration by schema.xml and
> index can be handled with a HTTP client inside Cocoon through the Solr XML /
> HTTP API. Or in java code with SolrJ API if you prefer.
>
> André (not David ;-) )
>
>
> Le 08/09/2009 11:12, > Robby Pelssers (par Internet, dépôt
> users-return-97980-andre.davignon=developpement-durable.gouv...@cocoon.apache.org)
> a écrit  :
>>
>> You all convinced me to investigate the SOLR path further ;-)
>>  I already installed SOLR yesterday but I probably did not spent enough
>> time on playing with it due to lack of time.  That's why I ask the
>> experts on this mailing list ;-)
>>
>> David's answer "The facet research funtionality in Solr can give access
>> to all possible values in the index of your data for a given property so
>> the user can pick one among them, then find all matching data." was the
>> missing piece of the puzzle.
>>
>> Thx a lot guys !!
>>
>> Robby
>>
>> -----Original Message-----
>> From: Jeroen Reijn [mailto:j.re...@onehippo.com] Sent: Tuesday, September
>> 08, 2009 10:45 AM
>> To: users@cocoon.apache.org
>> Subject: Re: how-to query an xml repository efficiently
>>
>> Hi Robby,
>>
>> in this case I even think SOLR would be a great match for this use case.
>>
>> You can push XML with a http client to SOLR and let SOLR index the
>> information. See the post.jar that comes with the SOLR example. It pushes
>> XML to the solr app and indexes it based on your configuration.
>>
>> The great thing is that you can even configure all kinds of facets based
>>
>> on what is stored in such a product file, so you can create a nice facet
>>
>> view in your webapp.
>>
>> A couple of years ago I was looking a some Forrest components [1], which
>>
>> were made for using SOLR from a cocooon point of view. It helps you to
>> perform queries to a SOLR instance from your sitemap and get XML response
>> back.
>>
>> Regards,
>>
>> Jeroen
>>
>> [1]http://wiki.apache.org/solr/SolrForrest
>>
>> Robby Pelssers wrote:
>>>
>>> Hi jeroen and others who replied to my mail...  Let me further explain
>>> my usecase and existing infrastructure.
>>>
>>> My customer stores their product data in xml-files on file system
>>> E.g.  ${repofolder}/
>>>        products/
>>>                product-1/
>>>                        product-1.xml
>>>                        product-1-image.jpg
>>>                        ...
>>>                product-2/
>>>                        product-2.xml
>>>                        product-2-image.jpg
>>>                ...
>>>
>>> This is a simplified representation but as you see there is no concept
>>> of an xml database.
>>>
>>> Now let's start with a small fictive example for product-1.xml:
>>> <product>
>>>  <id>xxxx</id>
>>>  <description>grandma's cookies</description>
>>>  <category>food</category>
>>>  <price>2.0</price>
>>> </product>
>>>
>>> From a functional point of view they want to be able to search for
>>> products based on some criteria.  So I'll have to build a small
>>> searchform containing:
>>>        - Dropdown with all possible categories
>>>        - textbox to search for part of description
>>>        - price "between/ equal to / greather then / less then" search
>>> functionality
>>>
>>> So for certain "Filter"-criteria I'll have to get all possible values
>>
>> so
>>>
>>> they can pick one and for others I don't need to know anything about
>>
>> the
>>>
>>> actual data.
>>>
>>> The actual product xml-files are +- 500kb on average and I'm talking
>>> about LOTS of products so I have to consider performance upfront.
>>>
>>> SOLR seems good for indexing static html files etc but I don't get the
>>> impression it can offer the necessary functionality for this use case.
>>>
>>> Any comments??
>>>
>>> Cheers,
>>> Robby
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Jeroen Reijn [mailto:j.re...@onehippo.com] Sent: Tuesday, September
>>> 08, 2009 9:01 AM
>>> To: users@cocoon.apache.org
>>> Subject: Re: how-to query an xml repository efficiently
>>>
>>> Hi Robby,
>>>
>>> do you perhaps have any more specs on what kind of XML database it is?
>>>
>>> At our company we have experience with an Apache Slide backed
>>
>> database,
>>>
>>> which we used for storing XML files and let Slide indexed them with
>>> Lucene. Then based on DASL queries we could search the repository
>>
>> really
>>>
>>> quickly.
>>>
>>> Next to DASK I know there are also XML databases that can use XQueries
>>
>>> to perform fast searches on their XML database.
>>>
>>> Regards,
>>>
>>> Jeroen
>>>
>>> Robby Pelssers wrote:
>>>>
>>>> Hi all,
>>>>
>>>>
>>>> I have following use case.  The customer has an xml repository which
>>>
>>> is
>>>>
>>>> nothing more then a directory on filesystem which contains
>>>> subdirectories containing one or more xml files.  They now want to
>>>
>>> query
>>>>
>>>> those xml files on some predefined criteria which might change over
>>>
>>> time...
>>>>
>>>>
>>>> I'm looking for a solution which results in high performance search
>>>
>>> and
>>>>
>>>> some things that came to my mind was
>>>>
>>>> *         extracting information and storing them in a database (e.g.
>>
>>>> HSQLDB)
>>>> *         using lucene
>>>>
>>>>
>>>> Is there somewhere detailed documentation available on using these?
>>>
>>> And
>>>>
>>>> what would you recommend for my use case?
>>>>
>>>>
>>>> I already found some stuff but no real quick-start material.
>>>>
>>>> http://cocoon.apache.org/2.1/userdocs/concepts/xmlsearching.html
>>>>
>>>> http://cocoon.apache.org/2.2/blocks/hsqldb-client/1.0/
>>>>
>>>> http://cocoon.apache.org/2.2/blocks/hsqldb-server/1.0/
>>>>
>>>>
>>>> Thx in advance,
>>>>
>>>> Robby Pelssers
>>>>
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org
>>> For additional commands, e-mail: users-h...@cocoon.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org
>>> For additional commands, e-mail: users-h...@cocoon.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org
>> For additional commands, e-mail: users-h...@cocoon.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org
>> For additional commands, e-mail: users-h...@cocoon.apache.org
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org
> For additional commands, e-mail: users-h...@cocoon.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@cocoon.apache.org
For additional commands, e-mail: users-h...@cocoon.apache.org

Re: how-to query an xml repository efficiently

Reply via email to