Re: Hierarchical document
One way to implement hierarchical documents is through the use of predefined phrases. Consider the 2 hierarchies: 1. Kids_and_Teens/Computers/Software/Games 2. Computers/Software/Freeware When indexing a document belonging to (1), add these terms in consecutive order (autoincrement=1): "dir:Top dir:Kids_and_Teens dir:Computers dir:Software dir:Games dir:Bottom" For documents belonging to (2), add: "dir:Top dir:Computers dir:Software dir:Bottom" The terms "dir:Top" and "dir:Bottom" can be used to anchor a query to a specific portion of the hierachy. Thus, a query containing the phrase: "dir:Computers dir:Software" would match documents in both (1) and (2) (and perhaps others), but a query for: "dir:Top dir:Kids_and_Teens dir:Computers dir:Software" would target only 'Computer/Software' documents from the 'Kids_and_Teens' top level directory. (The QueryPhrase 'slop factor' would be set to 0). Peter - Original Message - From: "Tatu Saloranta" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Monday, October 20, 2003 8:24 PM Subject: Re: Hierarchical document > On Monday 20 October 2003 10:31, Erik Hatcher wrote: > > On Monday, October 20, 2003, at 11:06 AM, Tom Howe wrote: > > There is not a more "lucene" way to do this - its really up to you to > > be creative with this. I'm sure there are folks that have implemented > > something along these lines on top of Lucene. In fact, I have a > > particular interest in doing so at some point myself. This is very > > similar to the object-relational issues surrounding relational > > databases - turning a pretty flat structure into an object graph. > > There are several ideas that could be explored by playing tricks with > > fields, such as giving them a hierarchical naming structure and > > querying at the level you like (think Field.Keyword and PrefixQuery, > > for example), and using a field to indicate type and narrowing queries > > to documents of the desired type. > > > > I'm interested to see what others have done in this area, or what ideas > > emerge about how to accomplish this. > > I'm planning to do something similar. In my case problem is bit simpler; > documents have associated products, and products form a hierarchy. > Searches should be able to match not only direct matches (searching > product, article associated with product), but also indirect ones via > membership (product member of a product group, matching group). > Product hierarchy also has variable depth. > > To do searches using non-leaf hierarchy items (groups), all actual product > items/groups associated with docs are expanded to full ids when > indexing (ie. they contain path from root, up to and including node, > each node component having its own unique id). > Thus, when searching for an intermediate node (product grouping), > match occurs since that node id is part of path to products that are in > the group (either directly or as members of sub-groups). > > Since no such path is stored (directly) in database, this also allows me to do > queries that would be impossible to do in database (I could add similar > path/full id fields for search purposes of course). Thus, Lucene index is > optimized for searching purposes, and database structure for editing > and retrieval of data. > > Another thing to keep in mind is that at least for metadata it may make sense > to use specialized analyzer, one that allows tokenizing using specific ids > to store ids as separate tokens; instead of using some standard plain text > analyzer. This way it is possible to separate ids from textual words (by > using prefixes, for example, "@1253" or "#13945"); this allows for accurate > matching based on identity of associated metadata selections. > > -+ Tatu +- > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Hierarchical document
On Monday 20 October 2003 16:41, Erik Hatcher wrote: > One more thought related to this subject - once a nice scheme for > representing hierarchies within a Lucene index emerges, having XPath as > a query language would rock! Has anyone implemented O/R or XPath-like > query expressions on top of Lucene? Not me... but at some point I think I briefly mentioned that someone with extra time might want to do a very simple JDBC driver to be used with Lucene. Obviously it would be very minimal for queries (and might need to invent new SQL operators for some searches), but it could also expose metadata about index. Should be an interesting exercise at least. :-) Plus, if done properly, tools like DBVis could be used for simple Lucene testing as well. If so, who knows; perhaps that would make it even easier to do prototype implementations of Lucene replacing home-grown SQL-bound search functionalities of apps. Most of all above would just be a nice little hack, though. :-) -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Hierarchical document
On Monday 20 October 2003 10:31, Erik Hatcher wrote: > On Monday, October 20, 2003, at 11:06 AM, Tom Howe wrote: > There is not a more "lucene" way to do this - its really up to you to > be creative with this. I'm sure there are folks that have implemented > something along these lines on top of Lucene. In fact, I have a > particular interest in doing so at some point myself. This is very > similar to the object-relational issues surrounding relational > databases - turning a pretty flat structure into an object graph. > There are several ideas that could be explored by playing tricks with > fields, such as giving them a hierarchical naming structure and > querying at the level you like (think Field.Keyword and PrefixQuery, > for example), and using a field to indicate type and narrowing queries > to documents of the desired type. > > I'm interested to see what others have done in this area, or what ideas > emerge about how to accomplish this. I'm planning to do something similar. In my case problem is bit simpler; documents have associated products, and products form a hierarchy. Searches should be able to match not only direct matches (searching product, article associated with product), but also indirect ones via membership (product member of a product group, matching group). Product hierarchy also has variable depth. To do searches using non-leaf hierarchy items (groups), all actual product items/groups associated with docs are expanded to full ids when indexing (ie. they contain path from root, up to and including node, each node component having its own unique id). Thus, when searching for an intermediate node (product grouping), match occurs since that node id is part of path to products that are in the group (either directly or as members of sub-groups). Since no such path is stored (directly) in database, this also allows me to do queries that would be impossible to do in database (I could add similar path/full id fields for search purposes of course). Thus, Lucene index is optimized for searching purposes, and database structure for editing and retrieval of data. Another thing to keep in mind is that at least for metadata it may make sense to use specialized analyzer, one that allows tokenizing using specific ids to store ids as separate tokens; instead of using some standard plain text analyzer. This way it is possible to separate ids from textual words (by using prefixes, for example, "@1253" or "#13945"); this allows for accurate matching based on identity of associated metadata selections. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Hierarchical document
One more thought related to this subject - once a nice scheme for representing hierarchies within a Lucene index emerges, having XPath as a query language would rock! Has anyone implemented O/R or XPath-like query expressions on top of Lucene? On Monday, October 20, 2003, at 12:31 PM, Erik Hatcher wrote: On Monday, October 20, 2003, at 11:06 AM, Tom Howe wrote: contain Section and Study information and then, if a user wants a set of Study documents, just aggregate them after the search by hand or is there a more "lucene" way of doing this? I'm trying to avoid storing too much redundant information to implement this kind of hierarchical structure, but that may not be possible. I hope I'm being somewhat clear with my question. There is not a more "lucene" way to do this - its really up to you to be creative with this. I'm sure there are folks that have implemented something along these lines on top of Lucene. In fact, I have a particular interest in doing so at some point myself. This is very similar to the object-relational issues surrounding relational databases - turning a pretty flat structure into an object graph. There are several ideas that could be explored by playing tricks with fields, such as giving them a hierarchical naming structure and querying at the level you like (think Field.Keyword and PrefixQuery, for example), and using a field to indicate type and narrowing queries to documents of the desired type. I'm interested to see what others have done in this area, or what ideas emerge about how to accomplish this. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Hierarchical document
On Monday, October 20, 2003, at 11:06 AM, Tom Howe wrote: contain Section and Study information and then, if a user wants a set of Study documents, just aggregate them after the search by hand or is there a more "lucene" way of doing this? I'm trying to avoid storing too much redundant information to implement this kind of hierarchical structure, but that may not be possible. I hope I'm being somewhat clear with my question. There is not a more "lucene" way to do this - its really up to you to be creative with this. I'm sure there are folks that have implemented something along these lines on top of Lucene. In fact, I have a particular interest in doing so at some point myself. This is very similar to the object-relational issues surrounding relational databases - turning a pretty flat structure into an object graph. There are several ideas that could be explored by playing tricks with fields, such as giving them a hierarchical naming structure and querying at the level you like (think Field.Keyword and PrefixQuery, for example), and using a field to indicate type and narrowing queries to documents of the desired type. I'm interested to see what others have done in this area, or what ideas emerge about how to accomplish this. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Hierarchical document
Tom Howe wrote: Right, but my concern is that each of these levels are really different documents. That's how I understood it. Just add a keyword for each level. If you want to use just one field, then you can also think of your structure as a tree like this: doc1 Level 1 / \ doc2 doc3 Level 2 | doc4 Level 4 | doc5 Level 8 This way, if your user says he wants to search in Level 6, you know that he really wants to search in levels 2 and 4, because that is the only way that you can get 6 (2+4). This enables you to work with just one field, because every document only needs to add its level to the index. If you don't need this flexibility and always want to search from the current level on down, then it is even easier. You then modify your query to search, where docLevel >= currentLevel. In any event, all your indexer has to do is add each document's level to the index. Ulrich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Hierarchical document
>> Hi, >> I have a very hierarchical document structure where each level of the >> hierarchy contains indexable information. It looks like this: >> >> Study -> >> Section -> >> DataFile -> >> Variable. >> >> The goal is to create a situation where a user can execute a search at >> any level and the search would include all of the information below it >> in the hierarchy and retrieve the proper aggregated document. >Say, you're on the level of Study/Section, then in indexing add the >fields "study" and "section" and set them to, say, "true". When >searching, just search where those two fields are "true". >Ulrich Right, but my concern is that each of these levels are really different documents. So, I guess the question should have been, Do I need to create documents for the lowest common denominator and then aggregate them into higher level documents by hand or make a several document types with redundant information and search by document type or create multiple indices (one for each level) with redundant information? In other words, should I just add a bunch of DataFile documents that contain Section and Study information and then, if a user wants a set of Study documents, just aggregate them after the search by hand or is there a more "lucene" way of doing this? I'm trying to avoid storing too much redundant information to implement this kind of hierarchical structure, but that may not be possible. I hope I'm being somewhat clear with my question. Thanks again, Tom - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Hierarchical document
Tom Howe wrote: Hi, I have a very hierarchical document structure where each level of the hierarchy contains indexable information. It looks like this: Study -> Section -> DataFile -> Variable. The goal is to create a situation where a user can execute a search at any level and the search would include all of the information below it in the hierarchy and retrieve the proper aggregated document. Say, you're on the level of Study/Section, then in indexing add the fields "study" and "section" and set them to, say, "true". When searching, just search where those two fields are "true". Ulrich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Hierarchical document
Hi, I have a very hierarchical document structure where each level of the hierarchy contains indexable information. It looks like this: Study -> Section -> DataFile -> Variable. The goal is to create a situation where a user can execute a search at any level and the search would include all of the information below it in the hierarchy and retrieve the proper aggregated document. In other words, someone could search for a Study using word that appears in several DataFiles in the study and a single study document could be returned. At the same time, someone could search for a DataFile and each of the matching DataFile documents would be returned. Is there a good way to do this other than using multiple indexes? Thanks, Tom - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]