Re: Simple Faceted Searching out of the box
Nah, he didn't write any DB - JDBC - Lucene examples in Lucene in Action, and neither did Erik. Lazy guys, I tell you! Btw. your situation/problem is very common. Just treat your DB as the truth, and your Solr/Lucene instance as a mechanism to quickly find bits of that truth. Design your system so that in case you lose your Solr/Lucene instance, you can revive from the truth, even if you have to do it from scratch. RDBMS - transactional and relational beast that holds data Solr/Lucene - quick text lookup Otis - Original Message From: Tim Archambault [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Friday, September 22, 2006 4:15:42 PM Subject: Re: Simple Faceted Searching out of the box Okay. We are all on the same page. I just don't express myself as well in programming speak yet. I'm going to read up on Otis' Lucene in Action tonight. I'd swear he had an example of how to inject records into a lucene index using java and sql. Maybe I'm wrong though. On 9/22/06, Walter Underwood [EMAIL PROTECTED] wrote: Sorry, I was not being exact with store. Lucene has separate control over whether the value of a field is stored and whether it is indexed. The term nurse might be searchable, but the only value that is stored in the index for retrieval is the database key for each matching job. It seems like text search should be easy to add to a transactional database, but lots of smart people have tried to make that work and failed. Maybe it is possible, but neither Oracle nor Microsoft nor the open source community have been able to make it happen. The text search in RDBMSs seems to always be slow and lame. There is one product that does transactional query and text search: MarkLogic. It does a good job of both, but it is very XML-centric. It might be a good match, if you are into commercial software. It is a rather different style of programming than SQL or Lucene. You write XQuery to define the result XML with the contents fetched from the database. wunder (not affiliated with MarkLogic) On 9/22/06 12:42 PM, Tim Archambault [EMAIL PROTECTED] wrote: I'm really confused. I don't mean store the data figuratively as in a lucene/solr command. Storing an ID number in a solr index isn't going to help a user find nurse. I think part of this is that some people feel that databases like MSSQL, MYSQL should be able to provide quality search experience, but they just flat out don't. It's a separate utility. Thanks Walter. On 9/22/06, Walter Underwood [EMAIL PROTECTED] wrote: On 9/22/06 12:25 PM, Tim Archambault [EMAIL PROTECTED] wrote: A recruitment (jobs) customer goes onto our website and posts an online job posting to our newspaper website. Upon insert into the database, I need to generate an xml file to be sent to SOLR to ADD as a record to the search engine. Same goes for an edit, my database updates the record and then I have to send an ADD statement to Solr again to commit my change. 2x the work. I've been talking with other papers about Solr and I think what bothers many is that there a is a deposit of information in a structured database here [named A], then we have another set of basically the same data over here [named B] and they don't understand why they have to manage to different sets of data [A B] that are virtually the same thing. The work isn't duplicated. Two servers are building two kinds of index, a transactional record index and a text index. That is two kinds of work, not a duplication. Storing the data is the small part of a database or a search engine. The indexes are the real benefit. In fact, the data does not have to be stored in Solr. You can return a database key as the only field, then get the details from the database. That is how our current search works -- the search result is a list of keys in relevance order. Period. wunder -- Walter Underwood Search Guru, Netflix
Re: Simple Faceted Searching out of the box
: now I have simple Lucene Indexes that basically re-created once daily and : that simply isn't doing the job for about 30% of my content. do you mean it takes to long to index all your content so you can only do part of it, or do you mean it's not indexing some of your conent well ? : For indexing news articles for instance, I want the article, all reader : comments, photos, links, multimedia files associated with the article to be : indexed together as one entity so that if Chris Hostetter commented on the : high cost of heating oil in Maine article, I can find the article by : searching on your name, etc this is a great example of the last 20% of the problem i was talking about ... knowing *when* to reindex a modified record, even if you have a perfect mechanism for identifing/flattening all of the data that should go in a Document, and a perfect method for detecting when any of that data has changed, it probably isn't practical/efficient to reindex every time .. you might want to say that creating/deleteing or modifying the core aspects of a news article (ie: title, dek, byline, body, categories, publish date) should trigger an immediate index update, but for things like user comments it might make more sense to have a batch process that runs every N minutes and reindexes any article that has had comments added in the last N minutes ... except maybe you want to be more responsive to comments added to recent articles, so maybe you configur two seperate instances of that cron job, one where N is small but it only looks at articles published today, and another where N is larger and it looks at older articles. ...these are the kinds of tradoffs that typically have to be made between indexing data quickly and getting good performance out of your index, and it's why i've never tried to build a general purpose indexer for Solr -- the needs of different indexes are too differnet for it to make much sense. Besides: if it were that easy, google would have a hosted solution with a REST API and everyone would just use them to search their sites. :) -Hoss
Re: Simple Faceted Searching out of the box
I have a couple of questions from some online newspaper folks who are interested in Solr and are trying to understand how and why it came to be. I think inherent in these questions is the underlying theme I hear all the time and that is Solr is not a content management system. It's a search engine. What I really wonder about CNet is how they manage their content and how Solr fits into their overall architecture -- is it an add-on? a purpose-built hammer to handle a specific problem they were having? was it something they wanted ... or instead something they needed to do, despite preferring something else? Another question asked of me was Will Solr ever connect with datasources directly? Thanks in advance for any feedback I can supply the folks. Tim On 9/10/06, Chris Hostetter [EMAIL PROTECTED] wrote: : What is faceted browsing? Maybe an example of a site interface Whoops! ... sorry about that, i tend to get ahead of my self. The examples Erik pointed out are very representative, but there are more subtle ways faceted searching can come into play -- for example, if you look at these two search results... http://shopper-search.cnet.com/search?q=gta http://shopper-search.cnet.com/search?q=ipod ...the categories in the left nav change based on what you search on, because we treat category as a facet, and the individual categories as possible constraints ... we don't show the user the exact count of how many products match in each category but we use that information to determine the order of the categories (or wether we should include a category in the list at all) : website and this would be a great way to break out content. Kind of greys : the lines between what is search and what is browsing categories, which is a : great thing actually. Thanks for the help. Even without facets, browsing a set of documents is just a search for all docuemnts (or depending on who you talk to: searching is just browsing with a special user entered constraint on the text facet) -Hoss
Re: Simple Faceted Searching out of the box
On 9/22/06, Tim Archambault [EMAIL PROTECTED] wrote: I have a couple of questions from some online newspaper folks who are interested in Solr and are trying to understand how and why it came to be. I think inherent in these questions is the underlying theme I hear all the time and that is Solr is not a content management system. It's a search engine. What I really wonder about CNet is how they manage their content and how Solr fits into their overall architecture -- is it an add-on? a purpose-built hammer to handle a specific problem they were having? was it something they wanted ... or instead something they needed to do, despite preferring something else? Putting on my CNET hat for a little history: We had a search server... a very thin layer built around a proprietary search engine, used in a ton of places, for search-box type functionality and direct generation of dynamic content. That search engine was being discontinued by the vendor, so a replacement was needed. RFPs were put out, and all the commercial alternatives were examined, but licensing costs for the number of servers we were talking about was exorbitant. So we decided to build our own... The replacement: ATOMICS- a MySQL/Apache hybrid. http://conferences.oreillynet.com/cs/mysqluc2005/view/e_sess/7066 It works well for many of the search collections we have that don't need much in the way of full-text search (MySQL does have full-text capabilities, but nothing like Lucene). Backup plan: something based on Lucene. SOLAR really started out as a pure backup plan... just in case ATOMICS had problems in some areas. I had joined CNET a week earlier, and the task of building something lucene-based was luckily handed to me as I didn't have any other responsibilities yet. Pretty much no requirements except for the preference of something that spoke HTTP/XML that could be put behind a load-balancer and scaled. ATOMICS was pretty much done by the time I started on SOLAR, and was rapidly deployed across CNET. SOLAR had a tough time gaining traction until someone crossed a problem that ATOMICS couldn't easily handle: faceted browsing. There was finally something concrete to aim for, and filter caching, docsets, autowarming, custom query handlers, etc, were rapidly added to allow the ability to write custom plugins that could acutally do the faceting logic. The result: http://www.mail-archive.com/java-user@lucene.apache.org/msg02645.html It soulds like Hoss might go into some more details in his ApacheCon session: http://www.us.apachecon.com/html/sessions.html#FR26 Another question asked of me was Will Solr ever connect with datasources directly? As far as where Solr fits into our architecture, it's a back-end component in the generation of dynamic content... sort of the same place that a database would occupy. I don't know much about content generation in CNET, and specific content manangement syustems, but a lot of it ends up in databases. An indexer piece normally pulls stuff from one or more databases, and puts them into a solr master, which is replicated out to solr searchers (or slaves) that the app-servers generating dynamic content hit through a load-balancer. There is a diagram of that from my ApacheCon presentation: http://people.apache.org/~yonik/ApacheConEU2006/ As far as connecting to datasources directly... I think that being able to pull content from a database is a good idea, and It's on the todo list. What specific other data sources did you have in mind? -Yonik
Re: Simple Faceted Searching out of the box
Obvious datasources: MSSQL, MySQL, etc. I'm under the impression that I have to send an XML request to SOLR for every add, update, delete, etc. in my database. I believe there's a way to access MSSQL, MySQL etc. directly with Lucene, but not sure how to do this with SOLR. Thanks for all your feedback. While I started out way over my head. Solr is actually fun to play around with, even for non-programmers or marginal programmers like myself. On 9/22/06, Yonik Seeley [EMAIL PROTECTED] wrote: On 9/22/06, Tim Archambault [EMAIL PROTECTED] wrote: I have a couple of questions from some online newspaper folks who are interested in Solr and are trying to understand how and why it came to be. I think inherent in these questions is the underlying theme I hear all the time and that is Solr is not a content management system. It's a search engine. What I really wonder about CNet is how they manage their content and how Solr fits into their overall architecture -- is it an add-on? a purpose-built hammer to handle a specific problem they were having? was it something they wanted ... or instead something they needed to do, despite preferring something else? Putting on my CNET hat for a little history: We had a search server... a very thin layer built around a proprietary search engine, used in a ton of places, for search-box type functionality and direct generation of dynamic content. That search engine was being discontinued by the vendor, so a replacement was needed. RFPs were put out, and all the commercial alternatives were examined, but licensing costs for the number of servers we were talking about was exorbitant. So we decided to build our own... The replacement: ATOMICS- a MySQL/Apache hybrid. http://conferences.oreillynet.com/cs/mysqluc2005/view/e_sess/7066 It works well for many of the search collections we have that don't need much in the way of full-text search (MySQL does have full-text capabilities, but nothing like Lucene). Backup plan: something based on Lucene. SOLAR really started out as a pure backup plan... just in case ATOMICS had problems in some areas. I had joined CNET a week earlier, and the task of building something lucene-based was luckily handed to me as I didn't have any other responsibilities yet. Pretty much no requirements except for the preference of something that spoke HTTP/XML that could be put behind a load-balancer and scaled. ATOMICS was pretty much done by the time I started on SOLAR, and was rapidly deployed across CNET. SOLAR had a tough time gaining traction until someone crossed a problem that ATOMICS couldn't easily handle: faceted browsing. There was finally something concrete to aim for, and filter caching, docsets, autowarming, custom query handlers, etc, were rapidly added to allow the ability to write custom plugins that could acutally do the faceting logic. The result: http://www.mail-archive.com/java-user@lucene.apache.org/msg02645.html It soulds like Hoss might go into some more details in his ApacheCon session: http://www.us.apachecon.com/html/sessions.html#FR26 Another question asked of me was Will Solr ever connect with datasources directly? As far as where Solr fits into our architecture, it's a back-end component in the generation of dynamic content... sort of the same place that a database would occupy. I don't know much about content generation in CNET, and specific content manangement syustems, but a lot of it ends up in databases. An indexer piece normally pulls stuff from one or more databases, and puts them into a solr master, which is replicated out to solr searchers (or slaves) that the app-servers generating dynamic content hit through a load-balancer. There is a diagram of that from my ApacheCon presentation: http://people.apache.org/~yonik/ApacheConEU2006/ As far as connecting to datasources directly... I think that being able to pull content from a database is a good idea, and It's on the todo list. What specific other data sources did you have in mind? -Yonik
Re: Simple Faceted Searching out of the box
On Sep 22, 2006, at 2:45 PM, Tim Archambault wrote: I believe there's a way to access MSSQL, MySQL etc. directly with Lucene, but not sure how to do this with SOLR. Nope. Lucene is a pure search engine, with no hooks to databases, or document parsers, etc. Lots of folks have built these kinds of things on top of Lucene, but the Lucene core is purely the text engine. How would you envision communicating with Solr with a database in the picture? How would the entire database be initially indexed? How would changes to the database trigger Solr updates? I'm not quite clear on what it would mean for Solr to work with a database directly so I'm curious. Erik
Re: Simple Faceted Searching out of the box
Okay, I'll use an example. A recruitment (jobs) customer goes onto our website and posts an online job posting to our newspaper website. Upon insert into the database, I need to generate an xml file to be sent to SOLR to ADD as a record to the search engine. Same goes for an edit, my database updates the record and then I have to send an ADD statement to Solr again to commit my change. 2x the work. I've been talking with other papers about Solr and I think what bothers many is that there a is a deposit of information in a structured database here [named A], then we have another set of basically the same data over here [named B] and they don't understand why they have to manage to different sets of data [A B] that are virtually the same thing. Many foresee a maintenance nightmare. I've come to the conclusion that there's somewhat of a disconnect between what a database does and what a search engine does. I accept that the redundancy is necessary given the very different tasks that each performs [keep in mind I'm still naive to the programming details here, I understand conceptually]. In writing this to you another thought came to mind. Maybe there are alternative ways to inject records into Solr outside the bounds of the cygwin and CURL examples I've been using. Maybe that is the question we need to be asking. What are some alternative ways to populate Solr? Enough said, it's Friday afternoon. Have a great weekend. Tim On 9/22/06, Erik Hatcher [EMAIL PROTECTED] wrote: On Sep 22, 2006, at 2:45 PM, Tim Archambault wrote: I believe there's a way to access MSSQL, MySQL etc. directly with Lucene, but not sure how to do this with SOLR. Nope. Lucene is a pure search engine, with no hooks to databases, or document parsers, etc. Lots of folks have built these kinds of things on top of Lucene, but the Lucene core is purely the text engine. How would you envision communicating with Solr with a database in the picture? How would the entire database be initially indexed? How would changes to the database trigger Solr updates? I'm not quite clear on what it would mean for Solr to work with a database directly so I'm curious. Erik
Re: Simple Faceted Searching out of the box
I'm really confused. I don't mean store the data figuratively as in a lucene/solr command. Storing an ID number in a solr index isn't going to help a user find nurse. I think part of this is that some people feel that databases like MSSQL, MYSQL should be able to provide quality search experience, but they just flat out don't. It's a separate utility. Thanks Walter. On 9/22/06, Walter Underwood [EMAIL PROTECTED] wrote: On 9/22/06 12:25 PM, Tim Archambault [EMAIL PROTECTED] wrote: A recruitment (jobs) customer goes onto our website and posts an online job posting to our newspaper website. Upon insert into the database, I need to generate an xml file to be sent to SOLR to ADD as a record to the search engine. Same goes for an edit, my database updates the record and then I have to send an ADD statement to Solr again to commit my change. 2x the work. I've been talking with other papers about Solr and I think what bothers many is that there a is a deposit of information in a structured database here [named A], then we have another set of basically the same data over here [named B] and they don't understand why they have to manage to different sets of data [A B] that are virtually the same thing. The work isn't duplicated. Two servers are building two kinds of index, a transactional record index and a text index. That is two kinds of work, not a duplication. Storing the data is the small part of a database or a search engine. The indexes are the real benefit. In fact, the data does not have to be stored in Solr. You can return a database key as the only field, then get the details from the database. That is how our current search works -- the search result is a list of keys in relevance order. Period. wunder -- Walter Underwood Search Guru, Netflix
Re: Simple Faceted Searching out of the box
On 9/22/06, Tim Archambault [EMAIL PROTECTED] wrote: I've been talking with other papers about Solr and I think what bothers many is that there a is a deposit of information in a structured database here [named A], then we have another set of basically the same data over here [named B] and they don't understand why they have to manage to different sets of data [A B] that are virtually the same thing. Yes, I sympathize... if MySQL had a really good full-text search somehow integrated in it, it would simplify things. I think it's probably a lot harder to be both a database and a full-text search server and do both well. We thought about closer integration in the past, but MySQL didn't have triggers or anything, so there was no way to know when something changed and what changed. Databases also can't handle things like Solr's dynamic fields as well either. In writing this to you another thought came to mind. Maybe there are alternative ways to inject records into Solr outside the bounds of the cygwin and CURL examples I've been using. curl is just used as an example. Hopefully the XML updates are generated programatically from the database records and automatically sent to Solr? That still requires coding on the users part though, and I would eventually like to be able to index simple databases with a user supplied SQL select statement. -Yonik
Re: Simple Faceted Searching out of the box
Sorry, I was not being exact with store. Lucene has separate control over whether the value of a field is stored and whether it is indexed. The term nurse might be searchable, but the only value that is stored in the index for retrieval is the database key for each matching job. It seems like text search should be easy to add to a transactional database, but lots of smart people have tried to make that work and failed. Maybe it is possible, but neither Oracle nor Microsoft nor the open source community have been able to make it happen. The text search in RDBMSs seems to always be slow and lame. There is one product that does transactional query and text search: MarkLogic. It does a good job of both, but it is very XML-centric. It might be a good match, if you are into commercial software. It is a rather different style of programming than SQL or Lucene. You write XQuery to define the result XML with the contents fetched from the database. wunder (not affiliated with MarkLogic) On 9/22/06 12:42 PM, Tim Archambault [EMAIL PROTECTED] wrote: I'm really confused. I don't mean store the data figuratively as in a lucene/solr command. Storing an ID number in a solr index isn't going to help a user find nurse. I think part of this is that some people feel that databases like MSSQL, MYSQL should be able to provide quality search experience, but they just flat out don't. It's a separate utility. Thanks Walter. On 9/22/06, Walter Underwood [EMAIL PROTECTED] wrote: On 9/22/06 12:25 PM, Tim Archambault [EMAIL PROTECTED] wrote: A recruitment (jobs) customer goes onto our website and posts an online job posting to our newspaper website. Upon insert into the database, I need to generate an xml file to be sent to SOLR to ADD as a record to the search engine. Same goes for an edit, my database updates the record and then I have to send an ADD statement to Solr again to commit my change. 2x the work. I've been talking with other papers about Solr and I think what bothers many is that there a is a deposit of information in a structured database here [named A], then we have another set of basically the same data over here [named B] and they don't understand why they have to manage to different sets of data [A B] that are virtually the same thing. The work isn't duplicated. Two servers are building two kinds of index, a transactional record index and a text index. That is two kinds of work, not a duplication. Storing the data is the small part of a database or a search engine. The indexes are the real benefit. In fact, the data does not have to be stored in Solr. You can return a database key as the only field, then get the details from the database. That is how our current search works -- the search result is a list of keys in relevance order. Period. wunder -- Walter Underwood Search Guru, Netflix
Re: Simple Faceted Searching out of the box
I think you will find that this architecture is quite common. What commercial packages provide (remember you are getting this for free!) are the tools for managing the dynamic export of data out of your database into the full-text search engine. Solr provides a very easy way to do this, but yes, you have to do some programming to automate it. Two common ways of doing this. 1) write a component that periodically checks for new/updated database content and submits it to solr. 2) write a trigger in the database that immediately posts to solr (I would use JMS or some other asynchronous messaging system for this). I'm sure there are other solutions. When/if MYSQL full text search is as good as solr/lucene, you can cut out one of the steps. I could see a component added to solr that did #1 above for you. MG4j has a simple loader that takes a SQL query and indexes the result (JdbcDocumentCollection). For Solr, you'd want to be able to handle muti-valued fields, which complicates things. If this architecture bothers technical folks, they either are accustomed to using very expensive software, or haven't been doing this very long. Of course, I am trying to figure out a way to make Solr more like a database, so there you go... --Joachim Tim Archambault wrote: Okay, I'll use an example. A recruitment (jobs) customer goes onto our website and posts an online job posting to our newspaper website. Upon insert into the database, I need to generate an xml file to be sent to SOLR to ADD as a record to the search engine. Same goes for an edit, my database updates the record and then I have to send an ADD statement to Solr again to commit my change. 2x the work. I've been talking with other papers about Solr and I think what bothers many is that there a is a deposit of information in a structured database here [named A], then we have another set of basically the same data over here [named B] and they don't understand why they have to manage to different sets of data [A B] that are virtually the same thing. Many foresee a maintenance nightmare. I've come to the conclusion that there's somewhat of a disconnect between what a database does and what a search engine does. I accept that the redundancy is necessary given the very different tasks that each performs [keep in mind I'm still naive to the programming details here, I understand conceptually]. In writing this to you another thought came to mind. Maybe there are alternative ways to inject records into Solr outside the bounds of the cygwin and CURL examples I've been using. Maybe that is the question we need to be asking. What are some alternative ways to populate Solr? Enough said, it's Friday afternoon. Have a great weekend. Tim On 9/22/06, Erik Hatcher [EMAIL PROTECTED] wrote: On Sep 22, 2006, at 2:45 PM, Tim Archambault wrote: I believe there's a way to access MSSQL, MySQL etc. directly with Lucene, but not sure how to do this with SOLR. Nope. Lucene is a pure search engine, with no hooks to databases, or document parsers, etc. Lots of folks have built these kinds of things on top of Lucene, but the Lucene core is purely the text engine. How would you envision communicating with Solr with a database in the picture? How would the entire database be initially indexed? How would changes to the database trigger Solr updates? I'm not quite clear on what it would mean for Solr to work with a database directly so I'm curious. Erik
Re: Simple Faceted Searching out of the box
Okay. We are all on the same page. I just don't express myself as well in programming speak yet. I'm going to read up on Otis' Lucene in Action tonight. I'd swear he had an example of how to inject records into a lucene index using java and sql. Maybe I'm wrong though. On 9/22/06, Walter Underwood [EMAIL PROTECTED] wrote: Sorry, I was not being exact with store. Lucene has separate control over whether the value of a field is stored and whether it is indexed. The term nurse might be searchable, but the only value that is stored in the index for retrieval is the database key for each matching job. It seems like text search should be easy to add to a transactional database, but lots of smart people have tried to make that work and failed. Maybe it is possible, but neither Oracle nor Microsoft nor the open source community have been able to make it happen. The text search in RDBMSs seems to always be slow and lame. There is one product that does transactional query and text search: MarkLogic. It does a good job of both, but it is very XML-centric. It might be a good match, if you are into commercial software. It is a rather different style of programming than SQL or Lucene. You write XQuery to define the result XML with the contents fetched from the database. wunder (not affiliated with MarkLogic) On 9/22/06 12:42 PM, Tim Archambault [EMAIL PROTECTED] wrote: I'm really confused. I don't mean store the data figuratively as in a lucene/solr command. Storing an ID number in a solr index isn't going to help a user find nurse. I think part of this is that some people feel that databases like MSSQL, MYSQL should be able to provide quality search experience, but they just flat out don't. It's a separate utility. Thanks Walter. On 9/22/06, Walter Underwood [EMAIL PROTECTED] wrote: On 9/22/06 12:25 PM, Tim Archambault [EMAIL PROTECTED] wrote: A recruitment (jobs) customer goes onto our website and posts an online job posting to our newspaper website. Upon insert into the database, I need to generate an xml file to be sent to SOLR to ADD as a record to the search engine. Same goes for an edit, my database updates the record and then I have to send an ADD statement to Solr again to commit my change. 2x the work. I've been talking with other papers about Solr and I think what bothers many is that there a is a deposit of information in a structured database here [named A], then we have another set of basically the same data over here [named B] and they don't understand why they have to manage to different sets of data [A B] that are virtually the same thing. The work isn't duplicated. Two servers are building two kinds of index, a transactional record index and a text index. That is two kinds of work, not a duplication. Storing the data is the small part of a database or a search engine. The indexes are the real benefit. In fact, the data does not have to be stored in Solr. You can return a database key as the only field, then get the details from the database. That is how our current search works -- the search result is a list of keys in relevance order. Period. wunder -- Walter Underwood Search Guru, Netflix
Re: Simple Faceted Searching out of the box
: I've been talking with other papers about Solr and I think what bothers many : is that there a is a deposit of information in a structured database here : [named A], then we have another set of basically the same data over here : [named B] and they don't understand why they have to manage to different : sets of data [A B] that are virtually the same thing. Many foresee a The big issue is that while SQL Schemas may be fairly consistent, uses of those schemas can be very different ... there is no clear cut way to look at an arbitrary schema and know how far down a chain of foreign key relationships you should go and still consider the data you find relevant to the item you started with (from a search perspective) ... ORM tools tend to get arround this by Lazy-Loading .. if your front end application starts with a single jobPostId and then asks for the name of the city it's mapped to, or the named of the company it's mapped to it will dynamicaly fetch the Company object from teh company table, or maybe it will only fetch the single companyName field ... but when building a search index you can't get that lazy evaluation -- you have to proactively fetch that data in advance, which means you have to know in advance how far down the rabbit hole you want to go. not all relationships are equal either: you might have a Skills table and a many-to-many relationship between JobPosting and skills, with a mappintType on the mapping indicating which skills are required and which are just desirable -- those should probably go in seperate fields of your index, but some code somewhere needs to know that. once you've solved that problem, once you've got a function that you can point at your DB, give it a primary key and get back a flattened view of the data that can represent your Solr/Lucene Document you're 80% done ... the problem is that 80% isn't a genericly solvable problem ... there aren't simple rules you can apply to any DB schema to drive that function. Even the last 20% isn't really generic; knowing when to re-index a particular document ... the needs of a system where individual people update JobPostings one at a time is very differnet from a system where JobPostings are bulk imported thousands at a time ... it's hard to write a usefull indexer that can function efficiently in both cases. Even in the first case, dealing with individual document updates where the primary JobPosting data changes is only the common problem, there are still the less-common situations where a Company name changes and *all* of the associated Job Postings need reindexed ... for small indexes it might be worthwhile to just rebuild the index from scratch, for bigger indexes you might need a more complex solution for dealing with this situation. The advice i give people at CNET when they need to build a Solr index is: 1) start by deciding what the minimum freshness is for your data ... ie: what is the absolute longest you can live with needing to wait for data to be added/deleted/updated in your Solr index once it's been added/deleted/modified in your DB. 2) write a function that can generate a Solr Document from an instance of your data (be it a bean, a DB row, whatever you've got) 3) write a simple wrapper program that iterates over all of yor data, and calls the function from #1 If #3 takes less time to run then #1 - cron it to rebuild the index from scratch over and over again and use snapshooter and snappuller to expose itto the world ... if #3 takes longer then #1, then look at ways to more systematically decide docs should be updated, and how. -Hoss
Re: Simple Faceted Searching out of the box
Amen Hoss. I appreciated you explaining in terms of what I can understand, jobs. Makes it easier for me to learn. What you are saying is right-on with what I'm trying to understand. Right now I have simple Lucene Indexes that basically re-created once daily and that simply isn't doing the job for about 30% of my content. I'm learning a framework called Model-Glue Unity that uses Reactor which is an ORM. I'll have to think of how I might be able to make that work. But as you say, not all relationships are equal. For indexing news articles for instance, I want the article, all reader comments, photos, links, multimedia files associated with the article to be indexed together as one entity so that if Chris Hostetter commented on the high cost of heating oil in Maine article, I can find the article by searching on your name, etc Have a great weekend and thanks for all the help. Tim On 9/22/06, Chris Hostetter [EMAIL PROTECTED] wrote: : I've been talking with other papers about Solr and I think what bothers many : is that there a is a deposit of information in a structured database here : [named A], then we have another set of basically the same data over here : [named B] and they don't understand why they have to manage to different : sets of data [A B] that are virtually the same thing. Many foresee a The big issue is that while SQL Schemas may be fairly consistent, uses of those schemas can be very different ... there is no clear cut way to look at an arbitrary schema and know how far down a chain of foreign key relationships you should go and still consider the data you find relevant to the item you started with (from a search perspective) ... ORM tools tend to get arround this by Lazy-Loading .. if your front end application starts with a single jobPostId and then asks for the name of the city it's mapped to, or the named of the company it's mapped to it will dynamicaly fetch the Company object from teh company table, or maybe it will only fetch the single companyName field ... but when building a search index you can't get that lazy evaluation -- you have to proactively fetch that data in advance, which means you have to know in advance how far down the rabbit hole you want to go. not all relationships are equal either: you might have a Skills table and a many-to-many relationship between JobPosting and skills, with a mappintType on the mapping indicating which skills are required and which are just desirable -- those should probably go in seperate fields of your index, but some code somewhere needs to know that. once you've solved that problem, once you've got a function that you can point at your DB, give it a primary key and get back a flattened view of the data that can represent your Solr/Lucene Document you're 80% done ... the problem is that 80% isn't a genericly solvable problem ... there aren't simple rules you can apply to any DB schema to drive that function. Even the last 20% isn't really generic; knowing when to re-index a particular document ... the needs of a system where individual people update JobPostings one at a time is very differnet from a system where JobPostings are bulk imported thousands at a time ... it's hard to write a usefull indexer that can function efficiently in both cases. Even in the first case, dealing with individual document updates where the primary JobPosting data changes is only the common problem, there are still the less-common situations where a Company name changes and *all* of the associated Job Postings need reindexed ... for small indexes it might be worthwhile to just rebuild the index from scratch, for bigger indexes you might need a more complex solution for dealing with this situation. The advice i give people at CNET when they need to build a Solr index is: 1) start by deciding what the minimum freshness is for your data ... ie: what is the absolute longest you can live with needing to wait for data to be added/deleted/updated in your Solr index once it's been added/deleted/modified in your DB. 2) write a function that can generate a Solr Document from an instance of your data (be it a bean, a DB row, whatever you've got) 3) write a simple wrapper program that iterates over all of yor data, and calls the function from #1 If #3 takes less time to run then #1 - cron it to rebuild the index from scratch over and over again and use snapshooter and snappuller to expose itto the world ... if #3 takes longer then #1, then look at ways to more systematically decide docs should be updated, and how. -Hoss
Re: Simple Faceted Searching out of the box
For those using PHP to interface with can you explain to me how your PHP code interacts with Solr? Does PHP create a query_string manually and post an URL like this: http://localhost:8983/solr/select?q=vertical%3Ajobs+accountingversion=2.1start=0rows=10fl=qt=standardstylesheet=indent=onexplainOther=hl.fl= for example then using some PHP command to read a webpage, it then parses it? I'm not much of a programmer, but I do know Coldfusion so I'm trying to apply the PHP principles to CF. Thanks for any and all help. Tim On 9/10/06, Erik Hatcher [EMAIL PROTECTED] wrote: On Sep 9, 2006, at 9:09 AM, Tim Archambault wrote: I need to understand this then. Thanks. I want to use Solr for our newspaper website and this would be a great way to break out content. Kind of greys the lines between what is search and what is browsing categories, which is a great thing actually. Thanks for the help. greys the lines indeed. there isn't any difference between search and browse in my view now. let's just call it findability :) (by the way, Ambient Findability is a fantastic book) Erik
Re: Simple Faceted Searching out of the box
: What is faceted browsing? Maybe an example of a site interface Whoops! ... sorry about that, i tend to get ahead of my self. The examples Erik pointed out are very representative, but there are more subtle ways faceted searching can come into play -- for example, if you look at these two search results... http://shopper-search.cnet.com/search?q=gta http://shopper-search.cnet.com/search?q=ipod ...the categories in the left nav change based on what you search on, because we treat category as a facet, and the individual categories as possible constraints ... we don't show the user the exact count of how many products match in each category but we use that information to determine the order of the categories (or wether we should include a category in the list at all) : website and this would be a great way to break out content. Kind of greys : the lines between what is search and what is browsing categories, which is a : great thing actually. Thanks for the help. Even without facets, browsing a set of documents is just a search for all docuemnts (or depending on who you talk to: searching is just browsing with a special user entered constraint on the text facet) -Hoss
Re: Simple Faceted Searching out of the box
On Sep 9, 2006, at 8:15 AM, Tim Archambault wrote: What is faceted browsing? Maybe an example of a site interface that is using it would be good. Dumb question, I know. Faceted browsing is like this: http://shopper.cnet.com/ and http:// www.nines.org/collex In Collex, the constrain further box are the facets. Clicking on them adds them to your constraints. The idea is to divide the documents in the index into distinct buckets (or sets) and show the counts of how many results are in each set. Erik
Re: Simple Faceted Searching out of the box
I need to understand this then. Thanks. I want to use Solr for our newspaper website and this would be a great way to break out content. Kind of greys the lines between what is search and what is browsing categories, which is a great thing actually. Thanks for the help. Tim On 9/9/06, Erik Hatcher [EMAIL PROTECTED] wrote: On Sep 9, 2006, at 8:15 AM, Tim Archambault wrote: What is faceted browsing? Maybe an example of a site interface that is using it would be good. Dumb question, I know. Faceted browsing is like this: http://shopper.cnet.com/ and http:// www.nines.org/collex In Collex, the constrain further box are the facets. Clicking on them adds them to your constraints. The idea is to divide the documents in the index into distinct buckets (or sets) and show the counts of how many results are in each set. Erik
Simple Faceted Searching out of the box
Hey everybody, I just wanted to officially announce that as of the solr-2006-09-08.zip nightly build, Solr supports some simple Faceted Searching options right out of the box. Both the StandardRequestHandler and DisMaxRequestHandler now support some query params for specifying simple queries to use as facet constraints, or fields in your index you wish to use as facets - generating a constraint count for each term in the field. All of these params can be configured as defaults when registering the RequestHandler in your solrconfig.xml Information on what the new facet parameters are, how to use them, and what types of resultsthey generate can be found in the wiki... http://wiki.apache.org/solr/SimpleFacetParameters http://wiki.apache.org/solr/StandardRequestHandler http://wiki.apache.org/solr/DisMaxRequestHandler ...as allways: feedback, comments, suggestions and general discussion is strongly encouraged :) -Hoss