corrupt solr index on ec2
Hi, I've been running solr 1.3 on an ec2 instance for a couple of weeks and I've had some stability issues. It seems like I need to bounce the app once a day. That I could live with and ultimately maybe troubleshoot, but what's more disturbing is that three times in the last 2 weeks my index has been corrupted when FileNotFoundExceptions started to appear. I'm running in jetty and had my index on the local file system until I lost the index the first time. Then I moved it to my mounted ebs volume so I could restore from a snapshot if needed. I'm wondering if perhaps there are issues with the locking mechanize on either the local directory (which is really a virual instance), or the mounted xfs volume. Has anyone seem this, or have suggestions re the cause? I'm using the single lockType. I'm running a single solr instance that gets frequent updates from multiple threads, and commits about every hour. A few things I see in the logs: - From time to time I see write lock timeouts: SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: SingleInstanceLock: write.lock - I've seen OOM exceptions during warming. I've changed maxWarmingSearchers=1, which I suspect will do he trick - The finally, this is what I fond in the logs today when the index got corrupt: Oct 29, 2008 12:18:39 AM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true) Oct 29, 2008 12:18:41 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: java.io.FileNotFoundException: /var/local/solr/data/production/index/_2rv.fdt (No such file or directory) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:960) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:368) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:77) at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:226) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) Caused by: java.io.FileNotFoundException: /var/local/solr/data/production/index/_2rv.fdt (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:212) at org.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor.init(FSDirectory.java:552) at org.apache.lucene.store.FSDirectory$FSIndexInput.init(FSDirectory.java:582) at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:488) at org.apache.lucene.index.FieldsReader.init(FieldsReader.java:77) at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:355) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:304) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:226) at org.apache.lucene.index.MultiSegmentReader.init(MultiSegmentReader.java:56) at org.apache.lucene.index.ReadOnlyMultiSegmentReader.init(ReadOnlyMultiSegmentReader.java:27) at
Re: Highlighting and fields
Hi Lars, Thanks for it: it works great. BR Christophe Lars Kotthoff wrote: I'm doing the following query: q=text:abc AND type:typeA And I ask to return highlighting (query.setHighlight(true);). The search term for field type (typeA) is also highlighted in the text field. Anyway to avoid this ? Use setHighlightRequireFieldMatch(true) on the query object [1]. Lars [1] http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/SolrQuery.html#setHighlightRequireFieldMatch(boolean)
How to get the min and max values from facets?
Hello, I'm using Solr 1.3. I would like to get only minimum and maximum values from a facet. In fact I'm using a range to get the results : [value TO value], and I don't need to get the facets list in my XML results (which could be more than hundred thousands)... so, I have to display the range (minimum and maximum values) from a facet. Is there any way to do that? I found the new statistics components, follow the link : http://wiki.apache.org/solr/StatsComponent But it's for solr 1.4. Does anyone have any idea? Thank you ! Vincent -- View this message in context: http://www.nabble.com/How-to-get-the-min-and-max-values-from-facets--tp20243462p20243462.html Sent from the Solr - User mailing list archive at Nabble.com.
Performanec Lucene / Solr
Hello, I am validating Sorl 1.3 now for about 3 weeks... My goal is to migrate from Lucene to Solr because of the much better plugins and search functions. Right now I am stress testing the performence and sending 2500 search request via JSON protocol and from my PHPUnit testcase. All search reuqest are different so caching don´t do it for me. Right now our old Lucene-JSPs are avout 4 times faster than my SOLR Sollution :-( any chances I can tweak my solrconfig.xml ? Greets -Ralf-
Re: Performanec Lucene / Solr
On Thu, Oct 30, 2008 at 4:12 PM, Kraus, Ralf | pixelhouse GmbH [EMAIL PROTECTED] wrote: I am validating Sorl 1.3 now for about 3 weeks... My goal is to migrate from Lucene to Solr because of the much better plugins and search functions. Very nice! Right now I am stress testing the performence and sending 2500 search request via JSON protocol and from my PHPUnit testcase. All search reuqest are different so caching don´t do it for me. Right now our old Lucene-JSPs are avout 4 times faster than my SOLR Sollution :-( Well, with Lucene it is an API call in the same JVM in the same web application. With Solr, you are making HTTP calls across the network, serializing requests and de-serializing responses. So the comparison is not exactly apples to apples. Look at what Solr offers -- replication, caching, plugins etc. Will you really need to go over 2500 requests per second? Do you need to be concerned with performance above and beyond that? Will it be easier to scale out to more boxes? -- Regards, Shalin Shekhar Mangar.
Max Number of Facets
is there a limit on the number of facets that i can create in Solr?(dynamically generated facets.) -- Jeryl Cook /^\ Pharaoh /^\ http://pharaohofkush.blogspot.com/ Whether we bring our enemies to justice, or bring justice to our enemies, justice will be done. --George W. Bush, Address to a Joint Session of Congress and the American People, September 20, 2001
Re: Performanec Lucene / Solr
Mark Miller schrieb: Right now I am stress testing the performence and sending 2500 search request via JSON protocol and from my PHPUnit testcase. All search reuqest are different so caching don´t do it for me. Right now our old Lucene-JSPs are avout 4 times faster than my SOLR Sollution :-( Well, with Lucene it is an API call in the same JVM in the same web application. With Solr, you are making HTTP calls across the network, serializing requests and de-serializing responses. So the comparison is not exactly apples to apples. Look at what Solr offers -- replication, caching, plugins etc. Will you really need to go over 2500 requests per second? Do you need to be concerned with performance above and beyond that? Will it be easier to scale out to more boxes? And have you tried solrj without http? Right now I am using this php classes to send and receiver my requests : - Apache_Solr_Service.php - Responce.php It has the advantage that I don´t need to write extra JSP oder JAVA code... Greets -Ralf-
Re: Performanec Lucene / Solr
Kraus, Ralf | pixelhouse GmbH wrote: Mark Miller schrieb: Right now I am stress testing the performence and sending 2500 search request via JSON protocol and from my PHPUnit testcase. All search reuqest are different so caching don´t do it for me. Right now our old Lucene-JSPs are avout 4 times faster than my SOLR Sollution :-( Well, with Lucene it is an API call in the same JVM in the same web application. With Solr, you are making HTTP calls across the network, serializing requests and de-serializing responses. So the comparison is not exactly apples to apples. Look at what Solr offers -- replication, caching, plugins etc. Will you really need to go over 2500 requests per second? Do you need to be concerned with performance above and beyond that? Will it be easier to scale out to more boxes? And have you tried solrj without http? Right now I am using this php classes to send and receiver my requests : - Apache_Solr_Service.php - Responce.php It has the advantage that I don´t need to write extra JSP oder JAVA code... Greets -Ralf- I think it will have the disadvantage of being a lot slower though... How were you handling things with Lucene? You must have used Java then? If you even want to get close to that performance I think you need to use non http embedded solr.
Re: Performanec Lucene / Solr
Shalin Shekhar Mangar wrote: On Thu, Oct 30, 2008 at 4:12 PM, Kraus, Ralf | pixelhouse GmbH [EMAIL PROTECTED] wrote: I am validating Sorl 1.3 now for about 3 weeks... My goal is to migrate from Lucene to Solr because of the much better plugins and search functions. Very nice! Right now I am stress testing the performence and sending 2500 search request via JSON protocol and from my PHPUnit testcase. All search reuqest are different so caching don´t do it for me. Right now our old Lucene-JSPs are avout 4 times faster than my SOLR Sollution :-( Well, with Lucene it is an API call in the same JVM in the same web application. With Solr, you are making HTTP calls across the network, serializing requests and de-serializing responses. So the comparison is not exactly apples to apples. Look at what Solr offers -- replication, caching, plugins etc. Will you really need to go over 2500 requests per second? Do you need to be concerned with performance above and beyond that? Will it be easier to scale out to more boxes? And have you tried solrj without http?
Using Solrj
Hi, I am trying to use Solrj for my web application. I am indexing a table using the @Field annotation tag. Now I need to index or query multiple tables. Like, get all the employees who are managers in Finance department (interacting with 3 entities). How do I do that? Does anyone have any idea? Thanks
Re: Performanec Lucene / Solr
On Thu, Oct 30, 2008 at 5:22 PM, Kraus, Ralf | pixelhouse GmbH [EMAIL PROTECTED] wrote: Right now I am using this php classes to send and receiver my requests : - Apache_Solr_Service.php - Responce.php It has the advantage that I don´t need to write extra JSP oder JAVA code... Unfortunately, the PHP client in Solr does not take advantage of the binary response format. It is supported only by SolrJ (java client). It can reduce a lot of the overhead associated with JSON/XML parsing. -- Regards, Shalin Shekhar Mangar.
Re: where's the bottleneck
On Thu, Oct 30, 2008 at 1:02 AM, Barnett, Jeffrey [EMAIL PROTECTED] wrote: I thought it was turned off already. ( Lucene vs Solr ?) Where do I make this change? Comment out this part in your solrconfig.xml autoCommit maxDocs2/maxDocs maxTime4/maxTime /autoCommit -Yonik -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Wednesday, October 29, 2008 11:28 PM To: solr-user@lucene.apache.org Subject: Re: where's the bottleneck On Wed, Oct 29, 2008 at 9:48 PM, Barnett, Jeffrey [EMAIL PROTECTED] wrote: Reported import rates start a 70 docs per second, and decrease as more records are added. It might just be segment merges (that takes more time as segments grow in size). From the solrconfig.xml I see you have autocommit turned on... try with it off and see if it helps. -Yonik
Re: Performanec Lucene / Solr
Mark Miller schrieb: Kraus, Ralf | pixelhouse GmbH wrote: Mark Miller schrieb: Right now I am stress testing the performence and sending 2500 search request via JSON protocol and from my PHPUnit testcase. All search reuqest are different so caching don´t do it for me. Right now our old Lucene-JSPs are avout 4 times faster than my SOLR Sollution :-( Well, with Lucene it is an API call in the same JVM in the same web application. With Solr, you are making HTTP calls across the network, serializing requests and de-serializing responses. So the comparison is not exactly apples to apples. Look at what Solr offers -- replication, caching, plugins etc. Will you really need to go over 2500 requests per second? Do you need to be concerned with performance above and beyond that? Will it be easier to scale out to more boxes? And have you tried solrj without http? Right now I am using this php classes to send and receiver my requests : - Apache_Solr_Service.php - Responce.php It has the advantage that I don´t need to write extra JSP oder JAVA code... Greets -Ralf- I think it will have the disadvantage of being a lot slower though... How were you handling things with Lucene? You must have used Java then? If you even want to get close to that performance I think you need to use non http embedded solr. Okay okay :-) I am writing a new JSP Handler for my requests as we speak :-) I really hope performence will be better than with {wt=javabin} Greets -Ralf-
Re: Using Solrj
hi , There are two sides to this . 1. indexing (getting data into Solr) SolrJ or DataImportHandler can be used for this 2.querying . getting data out of solr. Here you do not have the choice of joining multiple tables. There only one index for Solr On Thu, Oct 30, 2008 at 5:34 PM, Raghunandan Rao [EMAIL PROTECTED] wrote: Hi, I am trying to use Solrj for my web application. I am indexing a table using the @Field annotation tag. Now I need to index or query multiple tables. Like, get all the employees who are managers in Finance department (interacting with 3 entities). How do I do that? Does anyone have any idea? Thanks -- --Noble Paul
Re: Performanec Lucene / Solr
On Thu, Oct 30, 2008 at 8:39 AM, Kraus, Ralf | pixelhouse GmbH [EMAIL PROTECTED] wrote: Okay okay :-) I am writing a new JSP Handler for my requests as we speak :-) I really hope performence will be better than with {wt=javabin} What are your requirements for requests/sec, and how many are you getting from Solr now? Is the bottleneck Solr, the network, or PHP? If Solr and PHP are on the same box, at least do a top to see where the CPU is going. If most of the CPU is going to Solr, post some of the URLs you are using to search - perhaps they can be optimized. -Yonik
RE: Using Solrj
Thanks Noble. So you mean to say that I need to create a view according to my query and then index on the view and fetch? -Original Message- From: Noble Paul നോബിള് नोब्ळ् [mailto:[EMAIL PROTECTED] Sent: Thursday, October 30, 2008 6:16 PM To: solr-user@lucene.apache.org Subject: Re: Using Solrj hi , There are two sides to this . 1. indexing (getting data into Solr) SolrJ or DataImportHandler can be used for this 2.querying . getting data out of solr. Here you do not have the choice of joining multiple tables. There only one index for Solr On Thu, Oct 30, 2008 at 5:34 PM, Raghunandan Rao [EMAIL PROTECTED] wrote: Hi, I am trying to use Solrj for my web application. I am indexing a table using the @Field annotation tag. Now I need to index or query multiple tables. Like, get all the employees who are managers in Finance department (interacting with 3 entities). How do I do that? Does anyone have any idea? Thanks -- --Noble Paul
Re: Performanec Lucene / Solr
All search reuqest are different so caching don´t do it for me. P.S. If caching is not helping you, turn it off. It costs to populate / maintain the cache, so if its not helping, its only hurting.
Re: Performanec Lucene / Solr
Have you gone through http://wiki.apache.org/solr/ SolrPerformanceFactors ? Can you explain a little more about your testcase, maybe even share code? I only know a little PHP, but maybe someone else who is better versed might spot something. On Oct 30, 2008, at 8:39 AM, Kraus, Ralf | pixelhouse GmbH wrote: Mark Miller schrieb: Kraus, Ralf | pixelhouse GmbH wrote: Mark Miller schrieb: Right now I am stress testing the performence and sending 2500 search request via JSON protocol and from my PHPUnit testcase. All search reuqest are different so caching don´t do it for me. Right now our old Lucene-JSPs are avout 4 times faster than my SOLR Sollution :-( Well, with Lucene it is an API call in the same JVM in the same web application. With Solr, you are making HTTP calls across the network, serializing requests and de-serializing responses. So the comparison is not exactly apples to apples. Look at what Solr offers -- replication, caching, plugins etc. Will you really need to go over 2500 requests per second? Do you need to be concerned with performance above and beyond that? Will it be easier to scale out to more boxes? And have you tried solrj without http? Right now I am using this php classes to send and receiver my requests : - Apache_Solr_Service.php - Responce.php It has the advantage that I don´t need to write extra JSP oder JAVA code... Greets -Ralf- I think it will have the disadvantage of being a lot slower though... How were you handling things with Lucene? You must have used Java then? If you even want to get close to that performance I think you need to use non http embedded solr. Okay okay :-) I am writing a new JSP Handler for my requests as we speak :-) I really hope performence will be better than with {wt=javabin} Greets -Ralf- -- Grant Ingersoll Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans. http://www.lucenebootcamp.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: Using Solrj
not really. you can explain your usecase and it will be more clear On Thu, Oct 30, 2008 at 6:20 PM, Raghunandan Rao [EMAIL PROTECTED] wrote: Thanks Noble. So you mean to say that I need to create a view according to my query and then index on the view and fetch? -Original Message- From: Noble Paul നോബിള് नोब्ळ् [mailto:[EMAIL PROTECTED] Sent: Thursday, October 30, 2008 6:16 PM To: solr-user@lucene.apache.org Subject: Re: Using Solrj hi , There are two sides to this . 1. indexing (getting data into Solr) SolrJ or DataImportHandler can be used for this 2.querying . getting data out of solr. Here you do not have the choice of joining multiple tables. There only one index for Solr On Thu, Oct 30, 2008 at 5:34 PM, Raghunandan Rao [EMAIL PROTECTED] wrote: Hi, I am trying to use Solrj for my web application. I am indexing a table using the @Field annotation tag. Now I need to index or query multiple tables. Like, get all the employees who are managers in Finance department (interacting with 3 entities). How do I do that? Does anyone have any idea? Thanks -- --Noble Paul -- --Noble Paul
ApacheCon Reminder
For those attending ApacheCon in New Orleans next week, the Lucene Search and Machine Learning Birds of a Feather (BOF) will be held Wednesday night. Please indicate your interest at: http://wiki.apache.org/apachecon/BirdsOfaFeatherUs08 Also, note there are a number of Lucene/Solr/Mahout talks on Wednesday, and, of course, it's not to late to sign up for Lucene or Solr training. See the ApacheCon website for more info: http://us.apachecon.com/c/acus2008/ See you there, Grant
Re: Max Number of Facets
On Thu, Oct 30, 2008 at 7:28 AM, Jeryl Cook [EMAIL PROTECTED] wrote: is there a limit on the number of facets that i can create in Solr?(dynamically generated facets.) Not really, It's practically limited by CPU and memory, which can vary widely with what the facet fields look like (number of unique terms, if it's multi-valued, etc). -Yonik
Re: Performanec Lucene / Solr
Grant Ingersoll schrieb: Have you gone through http://wiki.apache.org/solr/SolrPerformanceFactors ? Can you explain a little more about your testcase, maybe even share code? I only know a little PHP, but maybe someone else who is better versed might spot something. I just wrote my JSP script for using solrj instead performence is much much better now ! Greets -Ralf-
Re: Using Solrj
Generally, you need to get your head out of the database world and into the search world to be successful with Lucene. For instance, one of the cardinal tenets of database design is to normalize your data. It goes against every instinct to *denormalize* your data when creating an Lucene index explicitly so you do NOT have to think in terms of joins or sub-queries. Whenever I start thinking this way, I try to back up and think again. Both your posts indicate to me that you're thinking in database terms. There are no views in Lucene, for instance. You refer to tables. There are no tables in Lucene, there are only documents with various numbers of fields. You could conceivable make your index look like a database by creatively naming your document fields. But that doesn't play to the strengths of Lucene *or* the database. In fact, there is NO requirement that documents have the *same* fields. Which is really difficult to get into when thinking like a DBA. Lucene is designed to search text. Fast and well. It is NOT intended to efficiently manipulate relationships *between* documents. There are various hybrid solutions that people have used. That is, put the data you really need to do text searching on in a Lucene index, along with enough data to be able to get the *rest* of what you need from your database. But it all depends upon the problem you're trying to solve. But as Noble says, all this is too general to be really useful, you need to provide quite more detail about the problem you're trying to solve to get useful recommendations. Best Erick On Thu, Oct 30, 2008 at 8:50 AM, Raghunandan Rao [EMAIL PROTECTED] wrote: Thanks Noble. So you mean to say that I need to create a view according to my query and then index on the view and fetch? -Original Message- From: Noble Paul നോബിള് नोब्ळ् [mailto:[EMAIL PROTECTED] Sent: Thursday, October 30, 2008 6:16 PM To: solr-user@lucene.apache.org Subject: Re: Using Solrj hi , There are two sides to this . 1. indexing (getting data into Solr) SolrJ or DataImportHandler can be used for this 2.querying . getting data out of solr. Here you do not have the choice of joining multiple tables. There only one index for Solr On Thu, Oct 30, 2008 at 5:34 PM, Raghunandan Rao [EMAIL PROTECTED] wrote: Hi, I am trying to use Solrj for my web application. I am indexing a table using the @Field annotation tag. Now I need to index or query multiple tables. Like, get all the employees who are managers in Finance department (interacting with 3 entities). How do I do that? Does anyone have any idea? Thanks -- --Noble Paul
Re: replication handler - compression
+1 - the GzipServletFilter is the way to go. Regarding request handlers reading HTTP headers, yeah,... this will improve, for sure. Erik On Oct 30, 2008, at 12:18 AM, Chris Hostetter wrote: : You are partially right. Instead of the HTTP header , we use a request : parameter. (RequestHandlers cannot read HTP headers). If the param is hmmm, i'm with walter: we shouldn't invent new mechanisms for clients to request compression over HTTP from servers. replicatoin is both special enough and important enough that if we had to add special support to make that information available to the handler on the master we could. but frankly i don't think that's neccessary: the logic to turn on compression if the client requests it using Accept-Encoding: gzip is generic enough that there is no reason for it to be in a handler. we could easily put it in the SolrDispatchFilter, or even in a new ServletFilte (i'm guessing iv'e seen about 74 different implementations of a GzipServletFilter in the wild that could be used as is. then we'd have double wins: compression for replication, and compression of all responses generated by Solr if hte client requests it. -Hoss
RE: Performanec Lucene / Solr
I realize you said caching won't help because the searches are different, but what about Document caching? Is every document returned different? What's your hit rate on the Document cache? Can you throw memory at the problem by increasing Document cache size? I ask all this, as the Document cache was the biggest win for my application when it came to increasing performance. Hit rates of 50% resulted in 30% GC time. Hit rates 95% had GC rates below 2%. -Todd -Original Message- From: Kraus, Ralf | pixelhouse GmbH [mailto:[EMAIL PROTECTED] Sent: Thursday, October 30, 2008 6:18 AM To: solr-user@lucene.apache.org Subject: Re: Performanec Lucene / Solr Grant Ingersoll schrieb: Have you gone through http://wiki.apache.org/solr/SolrPerformanceFactors ? Can you explain a little more about your testcase, maybe even share code? I only know a little PHP, but maybe someone else who is better versed might spot something. I just wrote my JSP script for using solrj instead performence is much much better now ! Greets -Ralf-
Re: replication handler - compression
Yeah. I'm just not sure how much benefit in terms of data transfer this will save. Has anyone tested this to see if this is even worth it? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Erik Hatcher [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thursday, October 30, 2008 9:54:28 AM Subject: Re: replication handler - compression +1 - the GzipServletFilter is the way to go. Regarding request handlers reading HTTP headers, yeah,... this will improve, for sure. Erik On Oct 30, 2008, at 12:18 AM, Chris Hostetter wrote: : You are partially right. Instead of the HTTP header , we use a request : parameter. (RequestHandlers cannot read HTP headers). If the param is hmmm, i'm with walter: we shouldn't invent new mechanisms for clients to request compression over HTTP from servers. replicatoin is both special enough and important enough that if we had to add special support to make that information available to the handler on the master we could. but frankly i don't think that's neccessary: the logic to turn on compression if the client requests it using Accept-Encoding: gzip is generic enough that there is no reason for it to be in a handler. we could easily put it in the SolrDispatchFilter, or even in a new ServletFilte (i'm guessing iv'e seen about 74 different implementations of a GzipServletFilter in the wild that could be used as is. then we'd have double wins: compression for replication, and compression of all responses generated by Solr if hte client requests it. -Hoss
Re: corrupt solr index on ec2
On Thu, Oct 30, 2008 at 2:06 AM, Bill Graham [EMAIL PROTECTED] wrote: I've been running solr 1.3 on an ec2 instance for a couple of weeks and I've had some stability issues. It seems like I need to bounce the app once a day. That I could live with and ultimately maybe troubleshoot, but what's more disturbing is that three times in the last 2 weeks my index has been corrupted when FileNotFoundExceptions started to appear. I'm running in jetty and had my index on the local file system until I lost the index the first time. Then I moved it to my mounted ebs volume so I could restore from a snapshot if needed. I'm wondering if perhaps there are issues with the locking mechanize on either the local directory (which is really a virual instance), or the mounted xfs volume. Has anyone seem this, or have suggestions re the cause? I'm using the single lockType. I'm running a single solr instance that gets frequent updates from multiple threads, and commits about every hour. A few things I see in the logs: - From time to time I see write lock timeouts: SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: SingleInstanceLock: write.lock This is really strange. It suggests that there is another in-process writer that is holding the lock. That should be impossible, unless it's caused by a previous exception trying to open an IndexWriter and the lock is simply stale. What seems to be the first exception that occurs? Also, you might try changing the lock type from single to simple to make it visible cross-process. That would rule out trying to start another solr instance on the same index directory opening two writers on the same directory is one way to get missing files like you appear to have. - I've seen OOM exceptions during warming. I've changed maxWarmingSearchers=1, which I suspect will do he trick OOM errors are really tricky - if they happen in the wrong place, it's hard to recover gracefully from. Correctly cleaning up after an OOM error in the IndexWriter recently had some little fixes in lucene trunk - you might want to try the latest dev version of Lucene and see if it helps. -Yonik - The finally, this is what I fond in the logs today when the index got corrupt: Oct 29, 2008 12:18:39 AM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true) Oct 29, 2008 12:18:41 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: java.io.FileNotFoundException: /var/local/solr/data/production/index/_2rv.fdt (No such file or directory) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:960) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:368) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:77) at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:226) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) Caused by:
Re: replication handler - compression
About a factor of 2 on a small, optimized index. Gzipping took 20 seconds, so it isn't free. $ cd index-copy $ du -sk 134336 . $ gzip * $ du -sk 62084 . wunder On 10/30/08 8:20 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Yeah. I'm just not sure how much benefit in terms of data transfer this will save. Has anyone tested this to see if this is even worth it? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Erik Hatcher [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thursday, October 30, 2008 9:54:28 AM Subject: Re: replication handler - compression +1 - the GzipServletFilter is the way to go. Regarding request handlers reading HTTP headers, yeah,... this will improve, for sure. Erik On Oct 30, 2008, at 12:18 AM, Chris Hostetter wrote: : You are partially right. Instead of the HTTP header , we use a request : parameter. (RequestHandlers cannot read HTP headers). If the param is hmmm, i'm with walter: we shouldn't invent new mechanisms for clients to request compression over HTTP from servers. replicatoin is both special enough and important enough that if we had to add special support to make that information available to the handler on the master we could. but frankly i don't think that's neccessary: the logic to turn on compression if the client requests it using Accept-Encoding: gzip is generic enough that there is no reason for it to be in a handler. we could easily put it in the SolrDispatchFilter, or even in a new ServletFilte (i'm guessing iv'e seen about 74 different implementations of a GzipServletFilter in the wild that could be used as is. then we'd have double wins: compression for replication, and compression of all responses generated by Solr if hte client requests it. -Hoss
Solr Searching on other fields which are not in query
Hi, I have a data set with the following schema. PersonName:Text AnimalName:Text PlantName:Text lot more attributes about each of them like nick name, animal nick name, plant generic name etc which are multually exclusive UniqueId:long For each of the document data set, there will be only one value of the above three. In my solr query from client I am using AnimalName:German Shepard. The return result contains PersonName with 'Shepard' in it, even though I am querying on AnimalName field. Can anyone point me whats happening and how to prevent scanning other columns/fields. I appreciate your help. Thanks Ravi -- View this message in context: http://www.nabble.com/Solr-Searching-on-other-fields-which-are-not-in-query-tp20249798p20249798.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: replication handler - compression
Gziping on disk requires quite some I/O. I guess that on the fly zipping should be faster. C. Walter Underwood wrote: About a factor of 2 on a small, optimized index. Gzipping took 20 seconds, so it isn't free. $ cd index-copy $ du -sk 134336 . $ gzip * $ du -sk 62084 . wunder On 10/30/08 8:20 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Yeah. I'm just not sure how much benefit in terms of data transfer this will save. Has anyone tested this to see if this is even worth it? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Erik Hatcher [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thursday, October 30, 2008 9:54:28 AM Subject: Re: replication handler - compression +1 - the GzipServletFilter is the way to go. Regarding request handlers reading HTTP headers, yeah,... this will improve, for sure. Erik On Oct 30, 2008, at 12:18 AM, Chris Hostetter wrote: : You are partially right. Instead of the HTTP header , we use a request : parameter. (RequestHandlers cannot read HTP headers). If the param is hmmm, i'm with walter: we shouldn't invent new mechanisms for clients to request compression over HTTP from servers. replicatoin is both special enough and important enough that if we had to add special support to make that information available to the handler on the master we could. but frankly i don't think that's neccessary: the logic to turn on compression if the client requests it using Accept-Encoding: gzip is generic enough that there is no reason for it to be in a handler. we could easily put it in the SolrDispatchFilter, or even in a new ServletFilte (i'm guessing iv'e seen about 74 different implementations of a GzipServletFilter in the wild that could be used as is. then we'd have double wins: compression for replication, and compression of all responses generated by Solr if hte client requests it. -Hoss
Re: replication handler - compression
CPU was at 100%, it was not IO bound. --wunder On 10/30/08 8:58 AM, christophe [EMAIL PROTECTED] wrote: Gziping on disk requires quite some I/O. I guess that on the fly zipping should be faster. C. Walter Underwood wrote: About a factor of 2 on a small, optimized index. Gzipping took 20 seconds, so it isn't free. $ cd index-copy $ du -sk 134336 . $ gzip * $ du -sk 62084 . wunder On 10/30/08 8:20 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Yeah. I'm just not sure how much benefit in terms of data transfer this will save. Has anyone tested this to see if this is even worth it? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Erik Hatcher [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thursday, October 30, 2008 9:54:28 AM Subject: Re: replication handler - compression +1 - the GzipServletFilter is the way to go. Regarding request handlers reading HTTP headers, yeah,... this will improve, for sure. Erik On Oct 30, 2008, at 12:18 AM, Chris Hostetter wrote: : You are partially right. Instead of the HTTP header , we use a request : parameter. (RequestHandlers cannot read HTP headers). If the param is hmmm, i'm with walter: we shouldn't invent new mechanisms for clients to request compression over HTTP from servers. replicatoin is both special enough and important enough that if we had to add special support to make that information available to the handler on the master we could. but frankly i don't think that's neccessary: the logic to turn on compression if the client requests it using Accept-Encoding: gzip is generic enough that there is no reason for it to be in a handler. we could easily put it in the SolrDispatchFilter, or even in a new ServletFilte (i'm guessing iv'e seen about 74 different implementations of a GzipServletFilter in the wild that could be used as is. then we'd have double wins: compression for replication, and compression of all responses generated by Solr if hte client requests it. -Hoss
Re: corrupt solr index on ec2
One small correction below: Yonik Seeley wrote: - I've seen OOM exceptions during warming. I've changed maxWarmingSearchers=1, which I suspect will do he trick OOM errors are really tricky - if they happen in the wrong place, it's hard to recover gracefully from. Correctly cleaning up after an OOM error in the IndexWriter recently had some little fixes in lucene trunk - you might want to try the latest dev version of Lucene and see if it helps. This change (to not commit index changes after IndexWriter hits OOME) went in Feb 2008. Solr 1.3 should already have it. (I'm working now on adding javadocs to IW explaining this). Mike
Re: Solr Searching on other fields which are not in query
Your query AnimalName:German Shepard. means AnimalName:German defaultField:Shepard. whichever the defaultField is Try with AnimalName:German Shepard or AnimalName:German AND AnimalName:Shepard. On Thu, Oct 30, 2008 at 12:58 PM, Yerraguntla [EMAIL PROTECTED] wrote: Hi, I have a data set with the following schema. PersonName:Text AnimalName:Text PlantName:Text lot more attributes about each of them like nick name, animal nick name, plant generic name etc which are multually exclusive UniqueId:long For each of the document data set, there will be only one value of the above three. In my solr query from client I am using AnimalName:German Shepard. The return result contains PersonName with 'Shepard' in it, even though I am querying on AnimalName field. Can anyone point me whats happening and how to prevent scanning other columns/fields. I appreciate your help. Thanks Ravi -- View this message in context: http://www.nabble.com/Solr-Searching-on-other-fields-which-are-not-in-query-tp20249798p20249798.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: corrupt solr index on ec2
Thanks Yonik, I'll try changing the lock type to seeing how that works. Looking closer at the logs I see the app was started at Oct 28, 2008 9:49:38, but not long afterwards it got it's first exception when warming the index: INFO: [] webapp=/solr path=/update params={} status=0 QTime=3 Oct 28, 2008 9:49:47 PM org.apache.solr.common.SolrException log SEVERE: Error during auto-warming of key:[EMAIL PROTECTED]:java.lang.OutOfMemoryError: Java heap space 2008-10-28 21:49:47.674::INFO: Shutdown hook executing Oct 28, 2008 9:49:47 PM org.apache.solr.common.SolrException log SEVERE: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:915) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1217) at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:389) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:77) at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:226) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123) ... Oct 28, 2008 9:49:47 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {commit=} 0 81896 Oct 28, 2008 9:49:47 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={} status=0 QTime=81896 Oct 28, 2008 9:49:47 PM org.apache.solr.core.SolrCore close INFO: [] CLOSING SolrCore [EMAIL PROTECTED] Oct 28, 2008 9:49:47 PM org.apache.solr.core.SolrCore closeSearcher INFO: [] Closing main searcher on request. Oct 28, 2008 9:49:47 PM org.apache.solr.update.DirectUpdateHandler2 close INFO: closing DirectUpdateHandler2{commits=7,autocommits=0,optimizes=1,docsPending=10,adds=10,deletesById=0,deletesByQuery=0,errors=0,cu mulative_adds=11511,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=1} 2008-10-28 21:49:48.744::INFO: Shutdown hook complete Oct 28, 2008 9:49:52 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {add=[Inta:2113254]} 0 3 Oct 28, 2008 9:49:52 PM org.apache.solr.core.SolrCore execute Then it seemed to run well for about an hour and I saw this: Oct 28, 2008 10:38:51 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true) Oct 28, 2008 10:38:51 PM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: after flush: fdx size mismatch: 1156 docs vs 0 length in bytes of _2rv.fdx at org.apache.lucene.index.StoredFieldsWriter.closeDocStore(StoredFieldsWriter.java:94) at org.apache.lucene.index.DocFieldConsumers.closeDocStore(DocFieldConsumers.java:83) at org.apache.lucene.index.DocFieldProcessor.closeDocStore(DocFieldProcessor.java:47) at org.apache.lucene.index.DocumentsWriter.closeDocStore(DocumentsWriter.java:367) at org.apache.lucene.index.IndexWriter.flushDocStores(IndexWriter.java:1774) ... Oct 28, 2008 10:38:53 PM org.apache.solr.common.SolrException log SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: SingleInstanceLock: write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:85) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1140) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:938) at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:116) at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:122) at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:167) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:221) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:59) at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:196) These lock errors continued for about an hour, before what appears to be a successful commit/warm occurs. Then things appear normal for about another 1/2 hour until the missing index file exceptions below appeared. There really should only be one solr process running with access to this index in my environment, so I'm puzzled re how two processes could mess up the index. The only thing that comes to mind is that I'm running monit to monitor my keep processes, including solr. It's set to bounce the port if it seems to be struggling. Maybe that bounce isn't happening cleanly and I somehow get overlapping process. Seem unlikely, but who knows. I'll look into that too. Also, I'm in the process of
Re: Distributed search, standard request handler and more like this
: I'm doing some expirements with the morelikethis functionality using the : standard request handler to see if it also works with distributed search (I : saw that it will not yet work with the MoreLikeThis handler, : https://issues.apache.org/jira/browse/SOLR-788). As far as I can see, this : also does not work when using the standard request handler, i.e.: i think perhaps you are confused about that issue .. SOLR-788 isn't About the MLT Handler -- it's about the MLT Component (as mentioned in the description of the issue) which provides the exact functioality you seem to be trying to test (as a component of SearchHandler) ... : http://localhost:8080/solr/select?q=ID:*documentID* : mlt=truemlt.fl=Textmlt.mindf=1mlt.mintf=1shards=shard1,shard2 the MLT Handler, also doesn't support distributed searching, but as far as i know there aren't any plans to add it (distributed searching is a feature of SearchHandler, as a seperate handler MoreLikeThisHandler doesnt' take advantage of that at all.) -Hoss
Re: replication handler - compression
: Yeah. I'm just not sure how much benefit in terms of data transfer this : will save. Has anyone tested this to see if this is even worth it? one mans trash is another mans treasure ... if you're replicating snapshoots very frequently within a single datacenter speed is critical and bandwidth is free -- if you're replicating once a day from one data center to another over a very expensive, very small, pipe spending some time+cpu to compress may be worth it. either way: it should be almost trivial to implement if people wnat to supply a patch, and with a simple new requestDispatcher config option, easy to disable completeley on the server for people who might have clients sending Accept-Encodig: gzip willy nilly -Hoss
Re: How to get the min and max values from facets?
: hundred thousands)... so, I have to display the range (minimum and maximum : values) from a facet. Is there any way to do that? : I found the new statistics components, follow the link : : http://wiki.apache.org/solr/StatsComponent : But it's for solr 1.4. there haven't been many changes on the trunk since 1.3 that StatsComponent would depend on, you can probably use it as is. : Does anyone have any idea? assuming the field you want the min/max for are stored, you can do multiple hits sorted on that field to get the highest and lowest value. -Hoss
Re: Solr Searching on other fields which are not in query
Hmm, I dont have any defaultField defined in the schema.xml. Can you give the exact syntax how it looks like in schema.xml I have defaultSearchFieldtext/defaultSearchField. Does it mean if sufficient requested count not available, it looks for the search string in any of the text fields that are indexed? Thanks Ravi Jorge Solari wrote: Your query AnimalName:German Shepard. means AnimalName:German defaultField:Shepard. whichever the defaultField is Try with AnimalName:German Shepard or AnimalName:German AND AnimalName:Shepard. On Thu, Oct 30, 2008 at 12:58 PM, Yerraguntla [EMAIL PROTECTED] wrote: Hi, I have a data set with the following schema. PersonName:Text AnimalName:Text PlantName:Text lot more attributes about each of them like nick name, animal nick name, plant generic name etc which are multually exclusive UniqueId:long For each of the document data set, there will be only one value of the above three. In my solr query from client I am using AnimalName:German Shepard. The return result contains PersonName with 'Shepard' in it, even though I am querying on AnimalName field. Can anyone point me whats happening and how to prevent scanning other columns/fields. I appreciate your help. Thanks Ravi -- View this message in context: http://www.nabble.com/Solr-Searching-on-other-fields-which-are-not-in-query-tp20249798p20249798.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Solr-Searching-on-other-fields-which-are-not-in-query-tp20249798p20252475.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Searching on other fields which are not in query
Never mind, I understand now. I have defaultSearchFieldtext/defaultSearchField. I was searching on a string field with space in it and with no quotes. This is causing to scan for text fields(since default search field is text) in the schema. Also in my schema there is an indexed field(AnimalNameText) which is not populated which is a text field. After I populate the text field, I get only the results I am expecting. Thanks for the pointer Jorge ! Yerraguntla wrote: Hmm, I dont have any defaultField defined in the schema.xml. Can you give the exact syntax how it looks like in schema.xml I have defaultSearchFieldtext/defaultSearchField. Does it mean if sufficient requested count not available, it looks for the search string in any of the text fields that are indexed? Thanks Ravi Jorge Solari wrote: Your query AnimalName:German Shepard. means AnimalName:German defaultField:Shepard. whichever the defaultField is Try with AnimalName:German Shepard or AnimalName:German AND AnimalName:Shepard. On Thu, Oct 30, 2008 at 12:58 PM, Yerraguntla [EMAIL PROTECTED] wrote: Hi, I have a data set with the following schema. PersonName:Text AnimalName:Text PlantName:Text lot more attributes about each of them like nick name, animal nick name, plant generic name etc which are multually exclusive UniqueId:long For each of the document data set, there will be only one value of the above three. In my solr query from client I am using AnimalName:German Shepard. The return result contains PersonName with 'Shepard' in it, even though I am querying on AnimalName field. Can anyone point me whats happening and how to prevent scanning other columns/fields. I appreciate your help. Thanks Ravi -- View this message in context: http://www.nabble.com/Solr-Searching-on-other-fields-which-are-not-in-query-tp20249798p20249798.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Solr-Searching-on-other-fields-which-are-not-in-query-tp20249798p20252950.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Searching on other fields which are not in query
I didn't mean with defaultField that it was the way to define the default field it in the schema, only a generic way to say default field name. The default field name, seems to be text in your case. If the search query doesn't say on which field to search, the word will be searched in that field. in the query AnimalName:German Shepard you are saying: search the word German in the field AnimalName and the word Shepard in the default field I think that that the search you want to do is AnimalName:German AND AnimalName:Shepard I don't know if there is a way to say to solr to search on all fields. you may be copying the content of other fields to the text field. See if there is something like copyField source=* dest=text/ in the schema file. Jorge On Thu, Oct 30, 2008 at 3:13 PM, Yerraguntla [EMAIL PROTECTED] wrote: Hmm, I dont have any defaultField defined in the schema.xml. Can you give the exact syntax how it looks like in schema.xml I have defaultSearchFieldtext/defaultSearchField. Does it mean if sufficient requested count not available, it looks for the search string in any of the text fields that are indexed? Thanks Ravi Jorge Solari wrote: Your query AnimalName:German Shepard. means AnimalName:German defaultField:Shepard. whichever the defaultField is Try with AnimalName:German Shepard or AnimalName:German AND AnimalName:Shepard. On Thu, Oct 30, 2008 at 12:58 PM, Yerraguntla [EMAIL PROTECTED] wrote: Hi, I have a data set with the following schema. PersonName:Text AnimalName:Text PlantName:Text lot more attributes about each of them like nick name, animal nick name, plant generic name etc which are multually exclusive UniqueId:long For each of the document data set, there will be only one value of the above three. In my solr query from client I am using AnimalName:German Shepard. The return result contains PersonName with 'Shepard' in it, even though I am querying on AnimalName field. Can anyone point me whats happening and how to prevent scanning other columns/fields. I appreciate your help. Thanks Ravi -- View this message in context: http://www.nabble.com/Solr-Searching-on-other-fields-which-are-not-in-query-tp20249798p20249798.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Solr-Searching-on-other-fields-which-are-not-in-query-tp20249798p20252475.html Sent from the Solr - User mailing list archive at Nabble.com.
Changing mergeFactor in mid-stream?
The http://wiki.apache.org/lucene-java/ImproveIndexingSpeed page suggests that indexing will be sped up by using higher values of mergeFactor, while search speed improves with lower values. I need to create an index using multiple batches of documents. My question is, can I begin building with a high mergeFactor for the bulk of the load and then switch to a lower value for the final batch? I build the indices offline, and only swap them to online when complete. The online index is never updated.
Re: How to get the min and max values from facets?
: myfacet, ASC, limit 1 : myfacet, DESC, limit 1 : So I can get the first value and the last one. : : Do you think I will get more performance with this way than using stats? I'm guessing that by all measurable metrics, the StatsComponent will blow that out of the water -- i was just putting it out there as a possible alternative if you didn't feel comfortable enough with java to compile the StatsComponent and use it with Solr 1.3. -Hoss
Re: Max Number of Facets
the only 'limit' is the effect on your query times... you could have 1000+ facets if you are ok with the response time. Sorry to give the it depends answer, but it totally depends on your data and your needs. On Oct 30, 2008, at 7:28 AM, Jeryl Cook wrote: is there a limit on the number of facets that i can create in Solr?(dynamically generated facets.) -- Jeryl Cook /^\ Pharaoh /^\ http://pharaohofkush.blogspot.com/ Whether we bring our enemies to justice, or bring justice to our enemies, justice will be done. --George W. Bush, Address to a Joint Session of Congress and the American People, September 20, 2001
Re: Max Number of Facets
I've actually seen cases on our site where it's possible to bring up over 30,000 facets for one query. And they actually come up quickly - like, 3 seconds. It takes longer for the browser to render them. -- Steve On Oct 30, 2008, at 6:04 PM, Ryan McKinley wrote: the only 'limit' is the effect on your query times... you could have 1000+ facets if you are ok with the response time. Sorry to give the it depends answer, but it totally depends on your data and your needs. On Oct 30, 2008, at 7:28 AM, Jeryl Cook wrote: is there a limit on the number of facets that i can create in Solr?(dynamically generated facets.) -- Jeryl Cook /^\ Pharaoh /^\ http://pharaohofkush.blogspot.com/ Whether we bring our enemies to justice, or bring justice to our enemies, justice will be done. --George W. Bush, Address to a Joint Session of Congress and the American People, September 20, 2001
Re: Max Number of Facets
I understand what you mean..I am building a system that will dynammically generate facets which could possible be thousands , but at most about 6 or 7 facets will be returned using a facet ranking algorithm so I get what you mean if I request in my query that I want 1000 faets back compared to just 6 or 7 i could take a performance hit.. On 10/30/08, Ryan McKinley [EMAIL PROTECTED] wrote: the only 'limit' is the effect on your query times... you could have 1000+ facets if you are ok with the response time. Sorry to give the it depends answer, but it totally depends on your data and your needs. On Oct 30, 2008, at 7:28 AM, Jeryl Cook wrote: is there a limit on the number of facets that i can create in Solr?(dynamically generated facets.) -- Jeryl Cook /^\ Pharaoh /^\ http://pharaohofkush.blogspot.com/ Whether we bring our enemies to justice, or bring justice to our enemies, justice will be done. --George W. Bush, Address to a Joint Session of Congress and the American People, September 20, 2001 -- Jeryl Cook /^\ Pharaoh /^\ http://pharaohofkush.blogspot.com/ Whether we bring our enemies to justice, or bring justice to our enemies, justice will be done. --George W. Bush, Address to a Joint Session of Congress and the American People, September 20, 2001
Re: Max Number of Facets
wow ,30k in under 3 seconds On 10/30/08, Stephen Weiss [EMAIL PROTECTED] wrote: I've actually seen cases on our site where it's possible to bring up over 30,000 facets for one query. And they actually come up quickly - like, 3 seconds. It takes longer for the browser to render them. -- Steve On Oct 30, 2008, at 6:04 PM, Ryan McKinley wrote: the only 'limit' is the effect on your query times... you could have 1000+ facets if you are ok with the response time. Sorry to give the it depends answer, but it totally depends on your data and your needs. On Oct 30, 2008, at 7:28 AM, Jeryl Cook wrote: is there a limit on the number of facets that i can create in Solr?(dynamically generated facets.) -- Jeryl Cook /^\ Pharaoh /^\ http://pharaohofkush.blogspot.com/ Whether we bring our enemies to justice, or bring justice to our enemies, justice will be done. --George W. Bush, Address to a Joint Session of Congress and the American People, September 20, 2001 -- Jeryl Cook /^\ Pharaoh /^\ http://pharaohofkush.blogspot.com/ Whether we bring our enemies to justice, or bring justice to our enemies, justice will be done. --George W. Bush, Address to a Joint Session of Congress and the American People, September 20, 2001
Re: Solr Searching on other fields which are not in query
On Thu, 30 Oct 2008 15:50:58 -0300 Jorge Solari [EMAIL PROTECTED] wrote: copyField source=* dest=text/ in the schema file. or use Dismax query handler. b _ {Beto|Norberto|Numard} Meijome Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what? I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
DIH and rss feeds
I have a DataImportHandler configured to index from an RSS feed. It is a latest stuff feed. It reads the feed and indexes the 100 documents harvested from the feed. So far, works great. Now: a few hours later there are a different 100 lastest documents. How do I add those to the index so I will have 200 documents? 'full-import' throws away the first 100. 'delta-import' is not implemented. What is the special trick here? I'm using the Solr-1.3.0 release. Thanks, Lance Norskog
RE: Using Solrj
Thank you so much. Here goes my Use case: I need to search the database for collection of input parameters which touches 'n' number of tables. The data is very huge. The search query itself is so dynamic. I use lot of views for same search. How do I make use of Solr in this case? -Original Message- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Thursday, October 30, 2008 7:01 PM To: solr-user@lucene.apache.org Subject: Re: Using Solrj Generally, you need to get your head out of the database world and into the search world to be successful with Lucene. For instance, one of the cardinal tenets of database design is to normalize your data. It goes against every instinct to *denormalize* your data when creating an Lucene index explicitly so you do NOT have to think in terms of joins or sub-queries. Whenever I start thinking this way, I try to back up and think again. Both your posts indicate to me that you're thinking in database terms. There are no views in Lucene, for instance. You refer to tables. There are no tables in Lucene, there are only documents with various numbers of fields. You could conceivable make your index look like a database by creatively naming your document fields. But that doesn't play to the strengths of Lucene *or* the database. In fact, there is NO requirement that documents have the *same* fields. Which is really difficult to get into when thinking like a DBA. Lucene is designed to search text. Fast and well. It is NOT intended to efficiently manipulate relationships *between* documents. There are various hybrid solutions that people have used. That is, put the data you really need to do text searching on in a Lucene index, along with enough data to be able to get the *rest* of what you need from your database. But it all depends upon the problem you're trying to solve. But as Noble says, all this is too general to be really useful, you need to provide quite more detail about the problem you're trying to solve to get useful recommendations. Best Erick On Thu, Oct 30, 2008 at 8:50 AM, Raghunandan Rao [EMAIL PROTECTED] wrote: Thanks Noble. So you mean to say that I need to create a view according to my query and then index on the view and fetch? -Original Message- From: Noble Paul നോബിള് नोब्ळ् [mailto:[EMAIL PROTECTED] Sent: Thursday, October 30, 2008 6:16 PM To: solr-user@lucene.apache.org Subject: Re: Using Solrj hi , There are two sides to this . 1. indexing (getting data into Solr) SolrJ or DataImportHandler can be used for this 2.querying . getting data out of solr. Here you do not have the choice of joining multiple tables. There only one index for Solr On Thu, Oct 30, 2008 at 5:34 PM, Raghunandan Rao [EMAIL PROTECTED] wrote: Hi, I am trying to use Solrj for my web application. I am indexing a table using the @Field annotation tag. Now I need to index or query multiple tables. Like, get all the employees who are managers in Finance department (interacting with 3 entities). How do I do that? Does anyone have any idea? Thanks -- --Noble Paul
Re: DIH and rss feeds
On Thu, 30 Oct 2008 20:46:16 -0700 Lance Norskog [EMAIL PROTECTED] wrote: Now: a few hours later there are a different 100 lastest documents. How do I add those to the index so I will have 200 documents? 'full-import' throws away the first 100. 'delta-import' is not implemented. What is the special trick here? I'm using the Solr-1.3.0 release. Lance, 1) DIH has a clean parameter that, when set to true ( default, i think), will delete all existing docs in the index. 2) ensure your new documents have different values in your field defined as key ( schema.xml) . let us know how it goes, B _ {Beto|Norberto|Numard} Meijome Lack of planning on your part does not constitute an emergency on ours. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Changing mergeFactor in mid-stream?
Yes, you can change the mergeFactor. More important than the mergeFactor is this: ramBufferSizeMB32/ramBufferSizeMB Pump it up as much as your hardware/JVM allows. And use appropriate -Xmx, of course. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Barnett, Jeffrey [EMAIL PROTECTED] To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Thursday, October 30, 2008 3:49:30 PM Subject: Changing mergeFactor in mid-stream? The http://wiki.apache.org/lucene-java/ImproveIndexingSpeed page suggests that indexing will be sped up by using higher values of mergeFactor, while search speed improves with lower values. I need to create an index using multiple batches of documents. My question is, can I begin building with a high mergeFactor for the bulk of the load and then switch to a lower value for the final batch? I build the indices offline, and only swap them to online when complete. The online index is never updated.
Re: replication handler - compression
man gzip: -# --fast --best Regulate the speed of compression using the specified digit #, where -1 or --fast indicates the fastest compres- sion method (less compression) and -9 or --best indicates the slowest compression method (best compression). The default compression level is -6 (that is, biased towards high compression at expense of speed). So it could be better than the factor of 2, but also take longer. :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Walter Underwood [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thursday, October 30, 2008 11:52:47 AM Subject: Re: replication handler - compression About a factor of 2 on a small, optimized index. Gzipping took 20 seconds, so it isn't free. $ cd index-copy $ du -sk 134336 . $ gzip * $ du -sk 62084 . wunder On 10/30/08 8:20 AM, Otis Gospodnetic wrote: Yeah. I'm just not sure how much benefit in terms of data transfer this will save. Has anyone tested this to see if this is even worth it? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Erik Hatcher To: solr-user@lucene.apache.org Sent: Thursday, October 30, 2008 9:54:28 AM Subject: Re: replication handler - compression +1 - the GzipServletFilter is the way to go. Regarding request handlers reading HTTP headers, yeah,... this will improve, for sure. Erik On Oct 30, 2008, at 12:18 AM, Chris Hostetter wrote: : You are partially right. Instead of the HTTP header , we use a request : parameter. (RequestHandlers cannot read HTP headers). If the param is hmmm, i'm with walter: we shouldn't invent new mechanisms for clients to request compression over HTTP from servers. replicatoin is both special enough and important enough that if we had to add special support to make that information available to the handler on the master we could. but frankly i don't think that's neccessary: the logic to turn on compression if the client requests it using Accept-Encoding: gzip is generic enough that there is no reason for it to be in a handler. we could easily put it in the SolrDispatchFilter, or even in a new ServletFilte (i'm guessing iv'e seen about 74 different implementations of a GzipServletFilter in the wild that could be used as is. then we'd have double wins: compression for replication, and compression of all responses generated by Solr if hte client requests it. -Hoss
Re: replication handler - compression
It could also be that the C version is a lot more efficient than the Java version and it could take longer regardless. I could not find a benchmark on that, but C is usually better for bit twiddling. wunder On 10/30/08 10:36 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: man gzip: -# --fast --best Regulate the speed of compression using the specified digit #, where -1 or --fast indicates the fastest compres- sion method (less compression) and -9 or --best indicates the slowest compression method (best compression). The default compression level is -6 (that is, biased towards high compression at expense of speed). So it could be better than the factor of 2, but also take longer. :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Walter Underwood [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thursday, October 30, 2008 11:52:47 AM Subject: Re: replication handler - compression About a factor of 2 on a small, optimized index. Gzipping took 20 seconds, so it isn't free. $ cd index-copy $ du -sk 134336 . $ gzip * $ du -sk 62084 . wunder On 10/30/08 8:20 AM, Otis Gospodnetic wrote: Yeah. I'm just not sure how much benefit in terms of data transfer this will save. Has anyone tested this to see if this is even worth it? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Erik Hatcher To: solr-user@lucene.apache.org Sent: Thursday, October 30, 2008 9:54:28 AM Subject: Re: replication handler - compression +1 - the GzipServletFilter is the way to go. Regarding request handlers reading HTTP headers, yeah,... this will improve, for sure. Erik On Oct 30, 2008, at 12:18 AM, Chris Hostetter wrote: : You are partially right. Instead of the HTTP header , we use a request : parameter. (RequestHandlers cannot read HTP headers). If the param is hmmm, i'm with walter: we shouldn't invent new mechanisms for clients to request compression over HTTP from servers. replicatoin is both special enough and important enough that if we had to add special support to make that information available to the handler on the master we could. but frankly i don't think that's neccessary: the logic to turn on compression if the client requests it using Accept-Encoding: gzip is generic enough that there is no reason for it to be in a handler. we could easily put it in the SolrDispatchFilter, or even in a new ServletFilte (i'm guessing iv'e seen about 74 different implementations of a GzipServletFilter in the wild that could be used as is. then we'd have double wins: compression for replication, and compression of all responses generated by Solr if hte client requests it. -Hoss