Re: Solr Benchmarks
Hi Walter, Thunderbird shows that there is an attachment to this message in the message list, but when I view the message, no attachment is available. Could you try sending this attachment again? Thanks --Joachim Walter Underwood wrote: I've done some testing using JMeter. I followed the instructions in the JMeter FAQ for "How do I use external data files in my test scripts?" http://wiki.apache.org/jakarta-jmeter/JMeterFAQ I'm attaching the script I built with this. A few notes:
Re: [Newbie] Solr Setup
If you have deployed solr as a root application, tomcat may be getting confused with the /admin/ url, thinking that it is the tomcat administration app. If you have it deployed, I would rename the /admin/ app to be /tadmin/ or something to distinguish from the solr /admin/ directory. --Joachim Panayiotis Papadopoulos wrote: It prompts for HTTP authorization asking for password for Admin Realm http://www.freemail.gr - äùñåÜí õðçñåóßá çëåêôñïíéêïý ôá÷õäñïìåßïõ. http://www.freemail.gr - free email service for the Greek-speaking.
Re: Fixed first hits -> custom RequestHandler?
How about a sortOrder field? Then you can sort by "sortOrder, score". If you want to promote a book that might not be in the result set, you'd OR the featured books in with the query. --Joachim Otis Gospodnetic wrote: Hello, I have a situation where I want certain documents to appear at the top of the hit list for certain searches, regardless of their score. One can think of it as the ads right on top of Google's search results (but I'm not dealing with ads). Example: If I'm searching books in a bookstore, and a person is searching for "lucene", the owner of the bookstore may want to promote the recently published "Lucene in Action" instead of some other book about Lucene, so he wants any search for "lucene" or "java search" to put the link to "Lucene in Action" on top. Is there a good way to accomplish this in Solr? My initial thoughts are that it would be best to have an external store, maybe even a Lucene index. This store would host the data to display on top of hits, as well as keywords/phrases that would have to match user's search terms. A custom RequestHandler would then perform a regular search (a la any of the existing RequestHandlers), plus pull the data from this side store, and stick those in the response. Is this a good candidate for a custom RequestHandler? Thanks, Otis
Re: Simple Faceted Searching out of the box
I think you will find that this architecture is quite common. What commercial packages provide (remember you are getting this for free!) are the tools for managing the dynamic export of data out of your database into the full-text search engine. Solr provides a very easy way to do this, but yes, you have to do some programming to automate it. Two common ways of doing this. 1) write a component that periodically checks for new/updated database content and submits it to solr. 2) write a trigger in the database that immediately posts to solr (I would use JMS or some other asynchronous messaging system for this). I'm sure there are other solutions. When/if MYSQL full text search is as good as solr/lucene, you can cut out one of the steps. I could see a component added to solr that did #1 above for you. MG4j has a simple loader that takes a SQL query and indexes the result (JdbcDocumentCollection). For Solr, you'd want to be able to handle muti-valued fields, which complicates things. If this architecture bothers technical folks, they either are accustomed to using very expensive software, or haven't been doing this very long. Of course, I am trying to figure out a way to make Solr more like a database, so there you go... --Joachim Tim Archambault wrote: Okay, I'll use an example. A recruitment (jobs) customer goes onto our website and posts an online job posting to our newspaper website. Upon insert into the database, I need to generate an xml file to be sent to SOLR to ADD as a record to the search engine. Same goes for an edit, my database updates the record and then I have to send an ADD statement to Solr again to commit my change. 2x the work. I've been talking with other papers about Solr and I think what bothers many is that there a is a deposit of information in a structured database here [named A], then we have another set of basically the same data over here [named B] and they don't understand why they have to manage to different sets of data [A & B] that are virtually the same thing. Many foresee a maintenance nightmare. I've come to the conclusion that there's somewhat of a disconnect between what a database does and what a search engine does. I accept that the redundancy is necessary given the very different tasks that each performs [keep in mind I'm still naive to the programming details here, I understand conceptually]. In writing this to you another thought came to mind. Maybe there are alternative ways to inject records into Solr outside the bounds of the cygwin and CURL examples I've been using. Maybe that is the question we need to be asking. What are some alternative ways to populate Solr? Enough said, it's Friday afternoon. Have a great weekend. Tim On 9/22/06, Erik Hatcher <[EMAIL PROTECTED]> wrote: On Sep 22, 2006, at 2:45 PM, Tim Archambault wrote: > I believe there's a way to access MSSQL, MySQL etc. directly with > Lucene, > but not sure how to do this with SOLR. Nope. Lucene is a pure search engine, with no hooks to databases, or document parsers, etc. Lots of folks have built these kinds of things on top of Lucene, but the Lucene core is purely the text engine. How would you envision communicating with Solr with a database in the picture? How would the entire database be initially indexed? How would changes to the database trigger Solr updates? I'm not quite clear on what it would mean for Solr to work with a database directly so I'm curious. Erik
Re: relational design in solr?
Chris, I think what I am trying to do is actually much simpler than what you are talking about here. I do plan on returning document ids and retrieving full entity data from the database- solr would just be used for the search, not for results display. The problem is that some data cannot be "flattened", for example when a document has repeating fields that are complex types, such as address. The best example I can think of is a resume database. You could certainly just put the whole resume document into the text index and do full text searches. But to answer the question of what people received a Harvard MBA in the last 10 years and have worked at Intel in the last 5 years, you have to correlate the years of attendance with the schoolName entry. Otherwise you might be getting years for some other education/work history entry. By adding an objType field and combining search results, you can be sure that the year/schoolName query matched a unique education record. The tricky bit is in getting a list of field values (e.g. foreign keys, which are essentially facets) for a result set very quickly. If this can be done, figuring out a generic way of specifying multiple searches and relationships between result sets (without reinventing SQL) becomes the challenge. We'll see. I have my doubts that it will work for any but the smallest of collections, which ours certainly isn't. Thanks --Joachim Chris Hostetter wrote: While it's certianly possible to "join" the results of multiple indexes, i would do so only when absolutely neccessary -- in my experience the only time i've found that it makes sense, is when one aspect of the data changes extremely rapidly compared to everything else, making complex reindexing a pain, but reindexing just the changed data in it's own index is a lot more feasible. As a rule of thumb, when building "paginated" style search applications, I would advise people to try and flatten their index as much as possible, so that the application can do one "user query" (based on the users input) to get a single page of results, and then use the uniqueKeys from that page of results to lookup ancillary data from any other indexes (or databases that you need) -- the key being that all the data you want to search on, and all hte data you need to sort are in the index, but other data you needto return to the user can come from other sources. If you find yourself wanting to "join" to indexes for hte purposes of matching or sorting, the amount of work you wind up doing tends to be prohibitive on really large indexes -- and if your indxes aren't that large, it would probably just be easier to puteverything in one index and rebuild it frequently. : I am trying to integrate solr search results with results from a rdbms : query. It's working ok, but fairly complicated due to large size of : the results from the database, and many different sort requirements. : : I know that solr/lucene was not designed to intelligently handle : multiple document types in the same collection, i.e. provide join : features, but I'm wondering if anyone on this list has any thoughts on : how to do it in lucene, and how it might be integrated into a custom : solr deployment. I can't see going back to vanilla lucene after solr! : : My basic idea is to add an objType field that would be used to define a : "table". There would be one main objType, any related objTypes would : have a field pointing back to the main objs via id, like a foreign key. : : I'd run multiple parallel searches and merge the results based on : foreign keys, either using a Filter or just using custom code. I'm : anticipating that iterating through the results to retrieve the foreign : key values will be too slow. : : Our data is highly textual, temporal and spatial, which pretty much : correspond to the 3 tables I would have. I can de-normalize a lot of : the data, but the combination of times, locations and textual : representations would be way too large to fully flatten. : : I'm about to start experimenting with different strategies, and I would : appreciate any insight anyone can provide. Would the faceting code help : here somehow? -Hoss
relational design in solr?
I am trying to integrate solr search results with results from a rdbms query. It's working ok, but fairly complicated due to large size of the results from the database, and many different sort requirements. I know that solr/lucene was not designed to intelligently handle multiple document types in the same collection, i.e. provide join features, but I'm wondering if anyone on this list has any thoughts on how to do it in lucene, and how it might be integrated into a custom solr deployment. I can't see going back to vanilla lucene after solr! My basic idea is to add an objType field that would be used to define a "table". There would be one main objType, any related objTypes would have a field pointing back to the main objs via id, like a foreign key. I'd run multiple parallel searches and merge the results based on foreign keys, either using a Filter or just using custom code. I'm anticipating that iterating through the results to retrieve the foreign key values will be too slow. Our data is highly textual, temporal and spatial, which pretty much correspond to the 3 tables I would have. I can de-normalize a lot of the data, but the combination of times, locations and textual representations would be way too large to fully flatten. I'm about to start experimenting with different strategies, and I would appreciate any insight anyone can provide. Would the faceting code help here somehow? Thanks --Joachim
Re: Facet performance with heterogeneous 'facets'?
Michael Imbeault wrote: Also, is there any plans to add an option not to run a facet search if the result set is too big? To avoid 40 seconds queries if the docset is too large... You could run one query with facet=false, check the result size and then run it again (should be fast because it is cached) with facet=true&rows=0 to get facet results only. I would think that the decision to run/not run facets would be highly custom to your collection and not easily developed as a configurable feature. --Joachim
SolrCore as Singleton?
Is there a good reason for implementing SolrCore as a Singleton? We are experimenting with running Solr as a Spring service embedded in our app. Since it is a Singleton we cannot have more than one index (not currently a problem, but could be). I note the comment: // Singleton for now... If there is no specific reason for making it a Singleton, I'd vote for removing this so that the SolrCore(dataDir, schema) constructor could be used to instantiate multiple cores. Seems to me that since the primary usage scenario of solr is access via REST (i.e. no Solr jar/API), the Singleton pattern is not necessary here. --Joachim
Re: Solr now used on Discogs.com
Can you expand on this a bit? "Main search engine" would be the search feature, but not browsing/category listing? Are you using Solr for all data storage and search? Or a RDBMS? If so, what is the split? Cool site! --Joachim Kevin Lewandowski wrote: I just wanted to say thanks to the Solr developers. I'm now using Solr for the main search engine on Discogs.com. I've been through five revisions of the search engine and this was definitely the least painful. Solr gives me the power of Lucene without having to deal with the guts. It made for a much faster implementation than all other search packages I've worked with. Some stats: there are now 1.1 million documents in the index and it handles 200,000 searches per day (on a single-cpu P4 server with 1 gig ram). Kevin
Re: Embarrasing compilation errors with solr-nightly/example
Sounds to me like you are using the JRE and not a JDK. Make sure $JAVA_HOME/lib/tools.jar is in your classpath. --Joachim James Pine wrote: I am trying to walk through the Solr tutorial at: http://incubator.apache.org/solr/tutorial.html and can't seem to get: http://localhost:8983/solr/admin/index.jsp to compile. Here's the top of the error, I've included the rest at the end of the message: HTTP ERROR: 500 Unable to compile class for JSP Generated servlet error: Jun 28, 2006 9:52:42 AM org.apache.jasper.compiler.Compiler generateClass SEVERE: Javac exception Unable to find a javac compiler; com.sun.tools.javac.Main is not on the classpath. Perhaps JAVA_HOME does not point to the JDK I believe my workstation is setup run/compile java applications because it's part of my job ;o) but apparently something is amiss. I'm running on a windows box, using cygwin. My JAVA_HOME and CLASSPATH environment variables are setup properly AFAIK and I'm running the 1.5.0_04 JDK. The rest of the stacktrace appears below. Thanx for your help. JAMES at org.apache.tools.ant.taskdefs.compilers.CompilerAdapterFactory.getCompiler(CompilerAdapterFactory.java:105) at org.apache.tools.ant.taskdefs.Javac.compile(Javac.java:929) at org.apache.tools.ant.taskdefs.Javac.execute(Javac.java:758) at org.apache.jasper.compiler.Compiler.generateClass(Compiler.java:382) at org.apache.jasper.compiler.Compiler.compile(Compiler.java:472) at org.apache.jasper.compiler.Compiler.compile(Compiler.java:451) at org.apache.jasper.compiler.Compiler.compile(Compiler.java:439) at org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:511) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:295) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236) at javax.servlet.http.HttpServlet.service(HttpServlet.java:689) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:473) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:568) at org.mortbay.http.HttpContext.handle(HttpContext.java:1530) at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:633) at org.mortbay.http.HttpContext.handle(HttpContext.java:1482) at org.mortbay.http.HttpServer.service(HttpServer.java:909) at org.mortbay.http.HttpConnection.service(HttpConnection.java:820) at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:986) at org.mortbay.http.HttpConnection.handle(HttpConnection.java:837) at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:245) at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534) Generated servlet error: Jun 28, 2006 9:52:42 AM org.apache.jasper.compiler.Compiler generateClass Generated servlet error: SEVERE: Env: Compile: javaFileName=/C:/DOCUME~1/user/LOCALS~1/Temp/Jetty__8983__solr//org/apache/jsp/admin\index_jsp.java Generated servlet error: classpath=/C:/Documents%20and%20Settings/user/Local%20Settings/Temp/Jetty__8983__solr/webapp/WEB-INF/classes/;/C:/Documents%20and%20Settings/user/Local%20Settings/Temp/Jetty__8983__solr/webapp/WEB-INF/lib/lucene-core-nightly.jar;/C:/Documents%20and%20Settings/user/Local%20Settings/Temp/Jetty__8983__solr/webapp/WEB-INF/lib/lucene-highlighter-nightly.jar;/C:/Documents%20and%20Settings/user/Local%20Settings/Temp/Jetty__8983__solr/webapp/WEB-INF/lib/lucene-snowball-nightly.jar;/C:/Documents%20and%20Settings/user/Local%20Settings/Temp/Jetty__8983__solr/webapp/WEB-INF/lib/xpp3-1.1.3.4.O.jar;C:\DOCUME~1\user\LOCALS~1\Temp\Jetty__8983__solr;C:\Documents and Settings\user\Local Settings\Temp\Jetty__8983__solr\webapp\WEB-INF\classes;C:\Documents and Settings\user\Local Settings\Temp\Jetty__8983__solr\webapp\WEB-INF\lib\lucene-core-nightly.jar;C:\Documents and Settings\user\Local Settings\Temp\Jetty__8983__solr\webapp\WEB-INF\lib\lucene-highlighter-nightly.jar;C:\Documents and Settings\user\Local Settings\Temp\Jetty__8983__solr\webapp\WEB-INF\lib\lucene-snowball-nightly.jar;C:\Documents and Settings\user\Local Settings\Temp\Jetty__8983__solr\webapp\WEB-INF\lib\xpp3-1.1.3.4.O.jar;C:\Program Files\Java\jre1.5.0_06\lib\ext\dnsns.jar;C:\Program Files\Java\jre1.5.0_06\lib\ext\jai_codec.jar;C:\Program Files\Java\jre1.5.0_06\lib\ext\jai_core.jar;C:\Program Files\Java\jre1.5.0_06\lib\ext\mlibwrapper_jai.jar;C:\Program Files\Java\jre1.5.0_06\lib\ext\sunjce_provider.jar;C:\Program Files\Java\jre1.5.0_06\lib\ext\sunpkcs11.jar;C:\solr-nightly\example\start.jar;C:\solr-nightly\example\lib\org.mortbay.jetty.jar;C:\solr-nightly\example\lib\jav
Re: embedding solr in a webapp?
Certainly running a load balanced solr cluster will be our first approach, I was just wondering if there were any glaring problems with running solr embedded in each webapp node. Sounds like there are not. As for the secondary db lookup, those will be cached, and are necessary to filter results further based on time (schedule) restrictions. We will probably also implement a custom ResponseWriter that just returns a comma separated list of ids- the IPC time is just one component of the overhead, xml parsing is another. Thanks --Joachim Yonik Seeley wrote: On 6/7/06, Joachim Martin <[EMAIL PROTECTED]> wrote: We are looking at running read-only solr nodes embedded in our webapp nodes. This would give us the additional features of solr over lucene, but would keep it in memory and reduce the overhead of http/xml transport of results. Looks like we would just create a request handler and call handleRequest(req,rsp), and deal with the search results DocList ourselves. Yes, that should work fine. Would there be any reason why this sort of setup would prohibit the use of index replication in a master/slave setup? No, that should still work fine. Does this make sense? As you might guess, speed is more important that flexibility. It can make sense in certain cases... but it does cut down on your flexibility to size the search tier independently of the appserver tier. Eliminating the IPC might get you 5% more performance, but at what development & flexibility cost? It's easier to buy a slightly faster box, or simply add another server if you are running behind a load-balancer. You know your situation best of course :-) We are using solr for a content search, returning ids, and doing a secondary db lookup for extended entity information. You go through the trouble of avoiding one IPC call, but you add it back in with the DB lookup... are the fields too large to store in Lucene? -Yonik
embedding solr in a webapp?
Hi, We are looking at running read-only solr nodes embedded in our webapp nodes. This would give us the additional features of solr over lucene, but would keep it in memory and reduce the overhead of http/xml transport of results. Looks like we would just create a request handler and call handleRequest(req,rsp), and deal with the search results DocList ourselves. Would there be any reason why this sort of setup would prohibit the use of index replication in a master/slave setup? Does this make sense? As you might guess, speed is more important that flexibility. We are using solr for a content search, returning ids, and doing a secondary db lookup for extended entity information. Thanks --Joachim