Re: Hierarchical document

2003-10-21 Thread Peter Keegan
One way to implement hierarchical documents is through the use of
predefined phrases. Consider the 2 hierarchies:

1. Kids_and_Teens/Computers/Software/Games
2. Computers/Software/Freeware

When indexing a document belonging to (1), add these terms in consecutive
order (autoincrement=1): "dir:Top dir:Kids_and_Teens dir:Computers
dir:Software dir:Games dir:Bottom"

For documents belonging to (2), add: "dir:Top dir:Computers dir:Software
dir:Bottom"

The terms "dir:Top" and "dir:Bottom" can be used to anchor a query
to a specific portion of the hierachy.

Thus, a query containing the phrase: "dir:Computers dir:Software" would
match documents in both (1) and (2) (and perhaps others), but a query for:
"dir:Top dir:Kids_and_Teens dir:Computers dir:Software" would target only
'Computer/Software' documents from the 'Kids_and_Teens' top level directory.
(The QueryPhrase 'slop factor' would be set to 0).

Peter

- Original Message - 
From: "Tatu Saloranta" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, October 20, 2003 8:24 PM
Subject: Re: Hierarchical document


> On Monday 20 October 2003 10:31, Erik Hatcher wrote:
> > On Monday, October 20, 2003, at 11:06  AM, Tom Howe wrote:
> > There is not a more "lucene" way to do this - its really up to you to
> > be creative with this.  I'm sure there are folks that have implemented
> > something along these lines on top of Lucene.  In fact, I have a
> > particular interest in doing so at some point myself.  This is very
> > similar to the object-relational issues surrounding relational
> > databases - turning a pretty flat structure into an object graph.
> > There are several ideas that could be explored by playing tricks with
> > fields, such as giving them a hierarchical naming structure and
> > querying at the level you like (think Field.Keyword and PrefixQuery,
> > for example), and using a field to indicate type and narrowing queries
> > to documents of the desired type.
> >
> > I'm interested to see what others have done in this area, or what ideas
> > emerge about how to accomplish this.
>
> I'm planning to do something similar. In my case problem is bit simpler;
> documents have associated products, and products form a hierarchy.
> Searches should be able to match not only direct matches (searching
> product, article associated with product), but also indirect ones via
> membership (product member of a product group, matching group).
> Product hierarchy also has variable depth.
>
> To do searches using non-leaf hierarchy items (groups), all actual product
> items/groups associated with docs are expanded to full ids when
> indexing (ie. they contain path from root, up to and including node,
> each node component having its own unique id).
> Thus, when searching for an intermediate node (product grouping),
> match occurs since that node id is part of path to products that are in
> the group (either directly or as members of sub-groups).
>
> Since no such path is stored (directly) in database, this also allows me
to do
> queries that would be impossible to do in database (I could add similar
> path/full id fields for search purposes of course). Thus, Lucene index is
> optimized for searching purposes, and database structure for editing
> and retrieval of data.
>
> Another thing to keep in mind is that at least for metadata it may make
sense
> to use specialized analyzer, one that allows tokenizing using specific ids
> to store ids as separate tokens; instead of using some standard plain text
> analyzer. This way it is possible to separate ids from textual words (by
> using prefixes, for example, "@1253" or "#13945"); this allows for
accurate
> matching based on identity of associated metadata selections.
>
> -+ Tatu +-
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hierarchical document

2003-10-20 Thread Tatu Saloranta
On Monday 20 October 2003 16:41, Erik Hatcher wrote:
> One more thought related to this subject - once a nice scheme for
> representing hierarchies within a Lucene index emerges, having XPath as
> a query language would rock!  Has anyone implemented O/R or XPath-like
> query expressions on top of Lucene?

Not me... but at some point I think I briefly mentioned that someone with 
extra time might want to do a very simple JDBC driver to be used with
Lucene. Obviously it would be very minimal for queries (and might need
to invent new SQL operators for some searches), but it could also expose
metadata about index. Should be an interesting exercise at least. :-)
Plus, if done properly, tools like DBVis could be used for simple Lucene
testing as well.

If so, who knows; perhaps that would make it even easier to do prototype
implementations of Lucene replacing home-grown SQL-bound search
functionalities of apps.

Most of all above would just be a nice little hack, though. :-)

-+ Tatu +-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hierarchical document

2003-10-20 Thread Tatu Saloranta
On Monday 20 October 2003 10:31, Erik Hatcher wrote:
> On Monday, October 20, 2003, at 11:06  AM, Tom Howe wrote:
> There is not a more "lucene" way to do this - its really up to you to
> be creative with this.  I'm sure there are folks that have implemented
> something along these lines on top of Lucene.  In fact, I have a
> particular interest in doing so at some point myself.  This is very
> similar to the object-relational issues surrounding relational
> databases - turning a pretty flat structure into an object graph.
> There are several ideas that could be explored by playing tricks with
> fields, such as giving them a hierarchical naming structure and
> querying at the level you like (think Field.Keyword and PrefixQuery,
> for example), and using a field to indicate type and narrowing queries
> to documents of the desired type.
>
> I'm interested to see what others have done in this area, or what ideas
> emerge about how to accomplish this.

I'm planning to do something similar. In my case problem is bit simpler; 
documents have associated products, and products form a hierarchy.
Searches should be able to match not only direct matches (searching
product, article associated with product), but also indirect ones via
membership (product member of a product group, matching group).
Product hierarchy also has variable depth.

To do searches using non-leaf hierarchy items (groups), all actual product
items/groups associated with docs are expanded to full ids when
indexing (ie. they contain path from root, up to and including node,
each node component having its own unique id).
Thus, when searching for an intermediate node (product grouping), 
match occurs since that node id is part of path to products that are in
the group (either directly or as members of sub-groups).

Since no such path is stored (directly) in database, this also allows me to do 
queries that would be impossible to do in database (I could add similar 
path/full id fields for search purposes of course). Thus, Lucene index is 
optimized for searching purposes, and database structure for editing
and retrieval of data.

Another thing to keep in mind is that at least for metadata it may make sense 
to use specialized analyzer, one that allows tokenizing using specific ids
to store ids as separate tokens; instead of using some standard plain text
analyzer. This way it is possible to separate ids from textual words (by
using prefixes, for example, "@1253" or "#13945"); this allows for accurate
matching based on identity of associated metadata selections.
 
-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hierarchical document

2003-10-20 Thread Erik Hatcher
One more thought related to this subject - once a nice scheme for 
representing hierarchies within a Lucene index emerges, having XPath as 
a query language would rock!  Has anyone implemented O/R or XPath-like 
query expressions on top of Lucene?

On Monday, October 20, 2003, at 12:31  PM, Erik Hatcher wrote:

On Monday, October 20, 2003, at 11:06  AM, Tom Howe wrote:
contain Section and Study information and then, if a user wants a set 
of
Study documents, just aggregate them after the search by hand or is
there a more "lucene" way of doing this?  I'm trying to avoid storing
too much redundant information to implement this kind of hierarchical
structure, but that may not be possible.  I hope I'm being somewhat
clear with my question.
There is not a more "lucene" way to do this - its really up to you to 
be creative with this.  I'm sure there are folks that have implemented 
something along these lines on top of Lucene.  In fact, I have a 
particular interest in doing so at some point myself.  This is very 
similar to the object-relational issues surrounding relational 
databases - turning a pretty flat structure into an object graph.  
There are several ideas that could be explored by playing tricks with 
fields, such as giving them a hierarchical naming structure and 
querying at the level you like (think Field.Keyword and PrefixQuery, 
for example), and using a field to indicate type and narrowing queries 
to documents of the desired type.

I'm interested to see what others have done in this area, or what 
ideas emerge about how to accomplish this.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Hierarchical document

2003-10-20 Thread Erik Hatcher
On Monday, October 20, 2003, at 11:06  AM, Tom Howe wrote:
contain Section and Study information and then, if a user wants a set 
of
Study documents, just aggregate them after the search by hand or is
there a more "lucene" way of doing this?  I'm trying to avoid storing
too much redundant information to implement this kind of hierarchical
structure, but that may not be possible.  I hope I'm being somewhat
clear with my question.
There is not a more "lucene" way to do this - its really up to you to 
be creative with this.  I'm sure there are folks that have implemented 
something along these lines on top of Lucene.  In fact, I have a 
particular interest in doing so at some point myself.  This is very 
similar to the object-relational issues surrounding relational 
databases - turning a pretty flat structure into an object graph.  
There are several ideas that could be explored by playing tricks with 
fields, such as giving them a hierarchical naming structure and 
querying at the level you like (think Field.Keyword and PrefixQuery, 
for example), and using a field to indicate type and narrowing queries 
to documents of the desired type.

I'm interested to see what others have done in this area, or what ideas 
emerge about how to accomplish this.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Hierarchical document

2003-10-20 Thread Ulrich Mayring
Tom Howe wrote:
Right, but my concern is that each of these levels are really different
documents.
That's how I understood it. Just add a keyword for each level.

If you want to use just one field, then you can also think of your 
structure as a tree like this:

 doc1  Level 1
 /  \
   doc2  doc3  Level 2
 |
   doc4    Level 4
 |
   doc5    Level 8
This way, if your user says he wants to search in Level 6, you know that 
he really wants to search in levels 2 and 4, because that is the only 
way that you can get 6 (2+4). This enables you to work with just one 
field, because every document only needs to add its level to the index.

If you don't need this flexibility and always want to search from the 
current level on down, then it is even easier. You then modify your 
query to search, where docLevel >= currentLevel.

In any event, all your indexer has to do is add each document's level to 
the index.

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Hierarchical document

2003-10-20 Thread Tom Howe
>> Hi,
>> I have a very hierarchical document structure where each level of the
>> hierarchy contains indexable information.  It looks like this:  
>> 
>>  Study -> 
>>  Section -> 
>>  DataFile -> 
>>  Variable.
>> 
>> The goal is to create a situation where a user can execute a search
at 
>> any level and the search would include all of the information below
it 
>> in the hierarchy and retrieve the proper aggregated document.

>Say, you're on the level of Study/Section, then in indexing add the 
>fields "study" and "section" and set them to, say, "true". When 
>searching, just search where those two fields are "true".

>Ulrich

Right, but my concern is that each of these levels are really different
documents.  So, I guess the question should have been, Do I need to
create documents for the lowest common denominator and then aggregate
them into higher level documents by hand or make a several document
types with redundant information and search by document type or create
multiple indices (one for each level) with redundant information?   In
other words, should I just add a bunch of DataFile documents that
contain Section and Study information and then, if a user wants a set of
Study documents, just aggregate them after the search by hand or is
there a more "lucene" way of doing this?  I'm trying to avoid storing
too much redundant information to implement this kind of hierarchical
structure, but that may not be possible.  I hope I'm being somewhat
clear with my question.

Thanks again,
Tom



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hierarchical document

2003-10-20 Thread Ulrich Mayring
Tom Howe wrote:
Hi, 
I have a very hierarchical document structure where each level of the
hierarchy contains indexable information.  It looks like this:  

		Study -> 
			Section -> 
DataFile -> 
	Variable.  

The goal is to create a situation where a user can execute a search at
any level and the search would include all of the information below it
in the hierarchy and retrieve the proper aggregated document.
Say, you're on the level of Study/Section, then in indexing add the 
fields "study" and "section" and set them to, say, "true". When 
searching, just search where those two fields are "true".

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Hierarchical document

2003-10-20 Thread Tom Howe
Hi, 
I have a very hierarchical document structure where each level of the
hierarchy contains indexable information.  It looks like this:  

Study -> 
Section -> 
DataFile -> 
Variable.  

The goal is to create a situation where a user can execute a search at
any level and the search would include all of the information below it
in the hierarchy and retrieve the proper aggregated document.  In other
words, someone could search for a Study using word that appears in
several DataFiles in the study and a single study document could be
returned.  At the same time, someone could search for a DataFile and
each of the matching DataFile documents would be returned.  Is there a
good way to do this other than using multiple indexes? 

Thanks,
Tom


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]