Re: lucene-java version mismatches
could I suggest that the maven repositories are populated next-time a release of solr-specific-lucenes are made? But they are? It is inside the org.apache.solr group since those lucene jars are released by Solr -- http://repo2.maven.org/maven2/org/apache/solr/ Nope, http://repo1.maven.org/maven2/org/apache/solr/solr-lucene-core/1.3.0/ has no sources. Only the solr-specific ones have. paul smime.p7s Description: S/MIME cryptographic signature
Status of an update request
Hello, When I send an update or a commit to solr via curl, the response I get is formated in HTML ; I can't find a way to have a machine readable response file. Here what is said on the subject in the solr config file : The response format differs from solr1.1 formatting and returns a standard error code. To enable solr1.1 behavior, remove the /update handler or change its path What I want, however, is an accurate description of the error and not just a standard Apache error code. Is there a way to obtain an XML response file from solr ? Thanks, Kind regards, P-YL _ Drag n’ drop—Get easy photo sharing with Windows Live™ Photos. http://www.microsoft.com/windows/windowslive/products/photos.aspx
Re: lucene-java version mismatches
On Wed, Mar 25, 2009 at 12:30 PM, Paul Libbrecht p...@activemath.orgwrote: could I suggest that the maven repositories are populated next-time a release of solr-specific-lucenes are made? But they are? It is inside the org.apache.solr group since those lucene jars are released by Solr -- http://repo2.maven.org/maven2/org/apache/solr/ Nope, http://repo1.maven.org/maven2/org/apache/solr/solr-lucene-core/1.3.0/ has no sources. Only the solr-specific ones have. Ah, I see. Solr's build uses the lucene binaries which are checked into the SVN. So sources are a little more difficult to bundle. Either we'd need to check in the lucene source jars as well or the ant build would need to check out the lucene code with the same revision number and make a source jar. Please open an issue in the jira. It might be difficult for me to find time for this right now but we can decide on an acceptable approach. Also note that lucene's revision number is mentioned in the CHANGES.txt -- Regards, Shalin Shekhar Mangar.
Anyone use solr admin and Opera?
Hello, I'm a happy Solr user. Thanks for the excellent software!! Hopefully this is a good question, I have indeed looked around the FAQ and google and such first. I have just switched from Firefox to Opera for web browsing. (Another story) When I use the solr/admin the home page and stats works fine, but searches return unformatted results all run together. If I get source, I see it is XML, and in fact, the source is more readable then page itself. Perhaps I need a stylesheet, or something. Are there there any other Opera users that have gotten past this problem. Thanks gene
numeric range facets
Similar to getting range facets for date where we specify start, end and gap. Can we do the same thing for numeric facets where we specify start, end and gap. -- View this message in context: http://www.nabble.com/numeric-range-facets-tp22698330p22698330.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: get all facets
On Wed, Mar 25, 2009 at 7:30 AM, Ashish P ashish.ping...@gmail.com wrote: Can I get all the facets in QueryResponse?? You can get all the facets that are returned by the server. Set facet.limit to the number of facets you want to retrieve. See http://lucene.apache.org/solr/api/solrj/org/apache/solr/client/solrj/SolrQuery.html#setFacetLimit(int) -- Regards, Shalin Shekhar Mangar.
Re: numeric range facets
On Wed, Mar 25, 2009 at 3:26 PM, Ashish P ashish.ping...@gmail.com wrote: Similar to getting range facets for date where we specify start, end and gap. Can we do the same thing for numeric facets where we specify start, end and gap. No. But you can do this with multiple queries by using facet.field with fq parameters. If you are using the trunk then it should be possible to do this with one query using the new multi-select facet feature. See http://wiki.apache.org/solr/SimpleFacetParameters#head-f277d409b221b407d9c5430f552bf40ee6185c4c -- Regards, Shalin Shekhar Mangar.
Re: Anyone use solr admin and Opera?
On Wed, Mar 25, 2009 at 1:33 PM, ristretto.rb ristretto...@gmail.comwrote: Hello, I'm a happy Solr user. Thanks for the excellent software!! Hopefully this is a good question, I have indeed looked around the FAQ and google and such first. I have just switched from Firefox to Opera for web browsing. (Another story) When I use the solr/admin the home page and stats works fine, but searches return unformatted results all run together. If I get source, I see it is XML, and in fact, the source is more readable then page itself. Perhaps I need a stylesheet, or something. Are there there any other Opera users that have gotten past this problem. I'd be interested in this too. Safari/Chrome also have the same problem, they don't render raw xml nicely. -- Regards, Shalin Shekhar Mangar.
Re: Status of an update request
On Wed, Mar 25, 2009 at 12:42 PM, Pierre-Yves LANDRON pland...@hotmail.comwrote: Hello, When I send an update or a commit to solr via curl, the response I get is formated in HTML ; I can't find a way to have a machine readable response file. Here what is said on the subject in the solr config file : The response format differs from solr1.1 formatting and returns a standard error code. To enable solr1.1 behavior, remove the /update handler or change its path What I want, however, is an accurate description of the error and not just a standard Apache error code. Is there a way to obtain an XML response file from solr ? If the update command executes successfully, then the response is XML. In case of error, the error page is generated by the servlet container which is HTML I guess. Not sure what can be done about that. Perhaps Solr can have its own error pages which output XML with the stack trace information and the correct HTTP return codes? -- Regards, Shalin Shekhar Mangar.
Deleting documents
I'm trying to delete documents based on the following type of update requests: deletequerytopologyid:3140/queryquerytopologyid:3142/query/delete This doesn't cause any changes on index and if I try to read the response, the following error ocurs: 13:32:35,196 ERROR [STDERR] 25/Mar/2009 13:32:35 org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {} 0 16 13:32:35,196 ERROR [STDERR] 25/Mar/2009 13:32:35 org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: missing content stream at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:49) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182) at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:446) at java.lang.Thread.run(Unknown Source) 13:32:35,196 ERROR [STDERR] 25/Mar/2009 13:32:35 org.apache.solr.core.SolrCore execute INFO: [] webapp=/apache-solr-nightly path=/update params={deletequerytopologyid:3142/query/delete=} status=400 QTime=16 Thanks in advance, Rui Pereira
Copy solr indexes from 2 solr instance
Hi, Issue 1: I have 2 solr instances, i need to copy indexes from solr1 instance to solr2 without restarting the solr. Please suggest how will this work. Both solr are on multicore setup. Issue2: I deleted all indexes from solr and reloaded my core, solr admin return 0 results. The size of index folder under data directory of core has still number of files? Issue3: Can I copy/paste data folder in running core of solr. Thanks, Prerna -- View this message in context: http://www.nabble.com/Copy-solr-indexes-from-2-solr-instance-tp22702100p22702100.html Sent from the Solr - User mailing list archive at Nabble.com.
speeding up indexing with a LOT of indexed fields
hi, I'm having difficulty indexing a collection of documents in a reasonable time. it's now going at 20 docs / sec on a c1.xlarge instance of amazon ec2 which just isnt enough. This box has 8GB ram and the equivalent of 20 xeon processors. these document have a couple of stored, indexed, multi and single-valued fields, but the main problem lies in it having about 1500 indexed fields of type sint. Range [0,1] (Yes, I know this is a lot) I'm looking for some guidance as what strategies to try out to improve throughput in indexing. I could slam in some more servers (I will) but my feeling tells me I can get more out of this. some additional info: - I'm indexing to 10 cores in parallel. This is done because : - at query time, 1 particular index will always fullfill all requests so we can prune the search space to 1/10th of its original size. - each document as represented in a core is actually 1/10th of a 'conceptual' document (which would contain up to 15000 indexed fields) if I indexed to 1 core. Indexing as 1 doc containing 15.000 indexed fields proved to give far worse results in searching and indexing than the solution i'm going with now. - the alternative of simply putting all docs with 1500 indexed field each in the same core isn't really possible either, because this quickly results in OOM-errors when sorting on a couple of fields. (even though 9/10 th of all docs in this case would not have the field sorted on, they would still end up in a lucene fieldCache for this field) - to be clear: the 20 docs / second means 2 docs / second / core. Or 2 'conceptual' docs / second overall. - each core has maxBufferedDocs ~20 and mergeFactor~10 . (I actually set them differently for each partition so that merges of different partitions don't happen altogether. This seemed to help a bit) - running jvm with -server -Xmx6000M -Xms6000M -XX:+UseParallelGC -XX:+CMSPermGenSweepingEnabled -XX:MaxPermSize=128M to leave room for diskcaching. - I'm spreading the 10 indices over 2 physical disks. 5 to /dev/sda1 5 to /dev/sdb observations: - within minutes after feeding the server reaches it's max ram. - until then the processors are running on ~70% - although I throw in a commit at random intervals (between 600 to 800 secs, again so not to commit al partitions at the same time) the jvm just stays eating all the ram. - not a lot seems to be happening on disk (using dstat) when the ram hasn't maxed out. Obviously, aftwerwards the disk is flooded with swapping. questions: - is there a good reason why all ram keeps occupied even though I commit regularly? Perhaps fieldcaches get populated when indexing? I guess not, but I'm not sure what else could explain this - would splitting the 'conceptual docs' in even more partitions help at indexing time? from an application standpoint it's possible, it just requires some work and it's hard to compare figures so I'd like to know if it's worth it . - how is a flush different from a commit and would it help in getting the ram-usage down? - because all 15.000 indexed fields look very similar in structure (they are all sints [0,1] to start with, I was looking for more efficient ways to get them in an index using some low-level indexing operations. For example: for a given document X and Y, and indexed fields 1,2.., i,...,N if X.a Y.a than this ordening in a lot of cases holds for fields 2,...,N. Because of these special properties I could possibly create a sorting algorithm that takes advantage of this and thus would make indexing faster. Would even considering this path be something that may be useful, because obviously it would envolve some work to make it work, and presumably a lot more work to get it to go faster than out of the box ? - lastly: should I be able to get more out of this box or am I just complaining ;-) Thanks for making it to here, and hoping to receive some valuable info, Cheers, Britske -- View this message in context: http://www.nabble.com/speeding-up-indexing-with-a-LOT-of-indexed-fields-tp22702364p22702364.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: speeding up indexing with a LOT of indexed fields
Britske, Here are a few quick ones: - Does that machine really have 10 CPU cores? If it has significantly less, you may be beyond the indexing sweet spot in terms of indexer threads vs. CPU cores - Your maxBufferedDocs is super small. Comment that out anyway. use ramBufferedSizeMB and set it as high as you can afford. No need to commit very often, certainly no need to flush or optimize until the end. There is a page about indexing performance on either Solr or Lucene Wiki that will help. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Britske gbr...@gmail.com To: solr-user@lucene.apache.org Sent: Wednesday, March 25, 2009 10:05:17 AM Subject: speeding up indexing with a LOT of indexed fields hi, I'm having difficulty indexing a collection of documents in a reasonable time. it's now going at 20 docs / sec on a c1.xlarge instance of amazon ec2 which just isnt enough. This box has 8GB ram and the equivalent of 20 xeon processors. these document have a couple of stored, indexed, multi and single-valued fields, but the main problem lies in it having about 1500 indexed fields of type sint. Range [0,1] (Yes, I know this is a lot) I'm looking for some guidance as what strategies to try out to improve throughput in indexing. I could slam in some more servers (I will) but my feeling tells me I can get more out of this. some additional info: - I'm indexing to 10 cores in parallel. This is done because : - at query time, 1 particular index will always fullfill all requests so we can prune the search space to 1/10th of its original size. - each document as represented in a core is actually 1/10th of a 'conceptual' document (which would contain up to 15000 indexed fields) if I indexed to 1 core. Indexing as 1 doc containing 15.000 indexed fields proved to give far worse results in searching and indexing than the solution i'm going with now. - the alternative of simply putting all docs with 1500 indexed field each in the same core isn't really possible either, because this quickly results in OOM-errors when sorting on a couple of fields. (even though 9/10 th of all docs in this case would not have the field sorted on, they would still end up in a lucene fieldCache for this field) - to be clear: the 20 docs / second means 2 docs / second / core. Or 2 'conceptual' docs / second overall. - each core has maxBufferedDocs ~20 and mergeFactor~10 . (I actually set them differently for each partition so that merges of different partitions don't happen altogether. This seemed to help a bit) - running jvm with -server -Xmx6000M -Xms6000M -XX:+UseParallelGC -XX:+CMSPermGenSweepingEnabled -XX:MaxPermSize=128M to leave room for diskcaching. - I'm spreading the 10 indices over 2 physical disks. 5 to /dev/sda1 5 to /dev/sdb observations: - within minutes after feeding the server reaches it's max ram. - until then the processors are running on ~70% - although I throw in a commit at random intervals (between 600 to 800 secs, again so not to commit al partitions at the same time) the jvm just stays eating all the ram. - not a lot seems to be happening on disk (using dstat) when the ram hasn't maxed out. Obviously, aftwerwards the disk is flooded with swapping. questions: - is there a good reason why all ram keeps occupied even though I commit regularly? Perhaps fieldcaches get populated when indexing? I guess not, but I'm not sure what else could explain this - would splitting the 'conceptual docs' in even more partitions help at indexing time? from an application standpoint it's possible, it just requires some work and it's hard to compare figures so I'd like to know if it's worth it . - how is a flush different from a commit and would it help in getting the ram-usage down? - because all 15.000 indexed fields look very similar in structure (they are all sints [0,1] to start with, I was looking for more efficient ways to get them in an index using some low-level indexing operations. For example: for a given document X and Y, and indexed fields 1,2.., i,...,N if X.a Y.a than this ordening in a lot of cases holds for fields 2,...,N. Because of these special properties I could possibly create a sorting algorithm that takes advantage of this and thus would make indexing faster. Would even considering this path be something that may be useful, because obviously it would envolve some work to make it work, and presumably a lot more work to get it to go faster than out of the box ? - lastly: should I be able to get more out of this box or am I just complaining ;-) Thanks for making it to here, and hoping to receive some valuable info, Cheers, Britske -- View this message in context: http://www.nabble.com/speeding-up-indexing-with-a-LOT-of-indexed-fields-tp22702364p22702364.html Sent from the Solr - User
Re: Copy solr indexes from 2 solr instance
Prerna, You could create an index snapshot with snapshooter script and then copy the index. You should do that while the source index is not getting modified. Re issue #2: run optimize Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: prerna07 pkhandelw...@sapient.com To: solr-user@lucene.apache.org Sent: Wednesday, March 25, 2009 9:52:34 AM Subject: Copy solr indexes from 2 solr instance Hi, Issue 1: I have 2 solr instances, i need to copy indexes from solr1 instance to solr2 without restarting the solr. Please suggest how will this work. Both solr are on multicore setup. Issue2: I deleted all indexes from solr and reloaded my core, solr admin return 0 results. The size of index folder under data directory of core has still number of files? Issue3: Can I copy/paste data folder in running core of solr. Thanks, Prerna -- View this message in context: http://www.nabble.com/Copy-solr-indexes-from-2-solr-instance-tp22702100p22702100.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Snapinstaller + Overlapping onDeckSearchers Problems
Hm, I can't quite tell from here, but that is just a warning, so it's not super problematic at this point. Could it be that one of your other caches (query cache) is large and lots of items are copied on searcher flip? Could it be that your JVM doesn't have large or free enough enough heap? Can you tell if lots of GCing happens during the searcher flip? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Cloude Porteus clo...@instructables.com To: solr-user@lucene.apache.org Sent: Wednesday, March 25, 2009 1:06:51 AM Subject: Snapinstaller + Overlapping onDeckSearchers Problems We have been running our solr slaves without autowarming our new searchers for a long time, but that was causing us 50-75 requests in 20+ seconds timeframe after every update on the slaves. I have turned on autowarming and that has fixed our slow response times, but I'm running into occasional Overlapping onDeckSearchers. We have replication setup and are using the snapinstaller script every 10 minutes: /home/solr/bin/snappuller -M util01 -P 18984 -D /home/solr/write/data -S /home/solr/logs -d /home/solr/read/data -u instruct; /home/solr/bin/snapinstaller -M util01 -S /home/solr/write/logs -d /home/solr/read/data -u instruct Here's what a successful update/commit log looks like: [14:13:02.510] start commit(optimize=false,waitFlush=false,waitSearcher=true) [14:13:02.522] Opening searc...@e9b4bb main [14:13:02.524] end_commit_flush [14:13:02.525] autowarming searc...@e9b4bb main from searc...@159e6e8 main [14:13:02.525] filterCache{lookups=1809739,hits=1766607,hitratio=0.97,inserts=43211,evictions=0, size=43154,cumulative_lookups=1809739,cumulative_hits=1766607,cumulative_hitratio=0.97,cumulative_inserts=43211,cumulative_evictions=0} -- [14:15:42.372] {commit=} 0 159964 [14:15:42.373] /update 0 159964 Here's what a unsuccessful update/commit log looks like, where the /update took too long and we started another commit: [21:03:03.829] start commit(optimize=false,waitFlush=false,waitSearcher=true) [21:03:03.836] Opening searc...@b2f2d6 main [21:03:03.836] end_commit_flush [21:03:03.836] autowarming searc...@b2f2d6 main from searc...@103c520 main [21:03:03.836] filterCache{lookups=1062196,hits=1062160,hitratio=0.99,inserts=49144,evictions=0,size=48353,cumulative_lookups=259485564,cumulative_hits=259426904,cumulative_hitratio=0.99,cumulative_inserts=68467,cumulative_evictions=0} -- [21:23:04.794] start commit(optimize=false,waitFlush=false,waitSearcher=true) [21:23:04.794] PERFORMANCE WARNING: Overlapping onDeckSearchers=2 [21:23:04.802] Opening searc...@f11bc main [21:23:04.802] end_commit_flush -- [21:24:55.987] {commit=} 0 1312158 [21:24:55.987] /update 0 1312158 I don't understand why this sometimes takes two minutes between the start commit /update and sometimes takes 20 minutes? One of our caches has about ~40,000 items, but I can't imagine it taking 20 minutes to autowarm a searcher. It would be super handy if the Snapinstaller script would wait until the previous one was done before starting a new one, but I'm not sure how to make that happen. Thanks for any help with this. best, cloude -- VP of Product Development Instructables.com http://www.instructables.com/member/lebowski
Re: Not able to configure multicore
Hm, where does that /solr2 come from? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: mitulpatel mitulpa...@greymatterindia.com To: solr-user@lucene.apache.org Sent: Wednesday, March 25, 2009 12:30:11 AM Subject: Re: Not able to configure multicore hossman wrote: : I am facing a problem related to multiple cores configuration. I have placed : a solr.xml file in solr.home directory. eventhough when I am trying to : access http://localhost:8983/solr/admin/cores it gives me tomcat error. : : Can anyone tell me what can be possible issue with this?? not without knowing exactly what the tomcat error message is, what your solr.xml file looks like, what log messages you see on startup, etc... -Hoss Hello Hoss, Thanks for reply. Here is the error message shown on browser: HTTP Status 404 - /solr2/admin/cores type Status report message /solr2/admin/cores description The requested resource (/solr2/admin/cores) is not available. and here is the solr.xml file. -- View this message in context: http://www.nabble.com/Not-able-to-configure-multicore-tp22682691p22695098.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Hardware Questions...
Ah, it's hard to tell. I look at index size on disk, number of docs, query rate, types of queries, etc. Are you actually seeing problems with your existing servers? Or see specific performance movement in one of the aspects? (e.g. increasing latency, increased GC or memory usage, increased disk IO) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: solr s...@highbeam.com To: solr-user@lucene.apache.org Sent: Tuesday, March 24, 2009 4:51:50 PM Subject: Hardware Questions... We have three Solr servers (several two processor Dell PowerEdge servers). I'd like to get three newer servers and I wanted to see what we should be getting. I'm thinking the following... Dell PowerEdge 2950 III 2x2.33GHz/12M 1333MHz Quad Core 16GB RAM 6 x 146GB 15K RPM RAID-5 drives How do people spec out servers, especially CPU, memory and disk? Is this all based on the number of doc's, indexes, etc... Also, what are people using for benchmarking and monitoring Solr? Thanks - Mike
Re: Snapinstaller + Overlapping onDeckSearchers Problems
I don't understand why this sometimes takes two minutes between the start commit /update and sometimes takes 20 minutes? One of our caches has about ~40,000 items, but I can't imagine it taking 20 minutes to autowarm a searcher. What do your cache configs look like? How big is the autowarm count? If you have: queryResultCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=32/ that will run 32 queries when solr starts. Are you running 40K queries when it starts? ryan
Re: Snapinstaller + Overlapping onDeckSearchers Problems
Yes, I guess I'm running 40k queries when it starts :) I didn't know that each count was equal to a query. I thought it was just copying the cache entries from the previous searcher, but I guess that wouldn't include new entries. I set it to the size of our filterCache. What should I set the the autowarmCount to if I want to try and fill up the caches? lookups : 8720372 hits : 8676170 hitratio : 0.99 inserts : 44551 evictions : 0 size : 44417 cumulative_lookups : 8720372 cumulative_hits : 8676170 cumulative_hitratio : 0.99 cumulative_inserts : 44551 cumulative_evictions : 0 best, cloude On Wed, Mar 25, 2009 at 8:38 AM, Ryan McKinley ryan...@gmail.com wrote: I don't understand why this sometimes takes two minutes between the start commit /update and sometimes takes 20 minutes? One of our caches has about ~40,000 items, but I can't imagine it taking 20 minutes to autowarm a searcher. What do your cache configs look like? How big is the autowarm count? If you have: queryResultCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=32/ that will run 32 queries when solr starts. Are you running 40K queries when it starts? ryan -- VP of Product Development Instructables.com http://www.instructables.com/member/lebowski
Strange anomaly(?) with string matching in query
Hello, We've encountered a strange issue in our Solr install regarding a particular string that just doesn't seem to want to return results, despite the exact same string being in the index. What makes it even stranger is that we had the same data in a previous install of Solr, and it worked there, but doesn't here. The string that's been showing the trouble is Abilene Christian College -- Students -- Yearbooks. The field, in this case, is of type text. Strangely enough, when we search for Abilene Christian College -- Students --, the relevant documents are returned. It just fails when the full string is specified. At this point, I'm a little bit stymied. Any suggestions or ideas would be highly appreciated. In order to possibly help with diagnosis, I'm including links to, hopefully, relevant outputs and configurations. We're using Solr version 1.3. This is the output of a search for the string, with debugQuery turned on. http://pastebin.com/f72c017c1 This is the output of a document containing the string in question. The field is dc_subject. http://pastebin.com/f17a2e722 Here is our current schema. http://pastebin.com/f2768bece If there's any more information or diagnostics that I can post or run, please let me know. Thanks for your help and suggestions. -Kurt -- View this message in context: http://www.nabble.com/Strange-anomaly%28-%29-with-string-matching-in-query-tp22704639p22704639.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: speeding up indexing with a LOT of indexed fields
Thanks for the quick reply. the box has 8 real cpu's. Perhaps a good idea then to reduce the nr of cores to 8 as well. I'm testing out a different scenario with multiple boxes as well, where clients persist docs to multiple cores on multiple boxes. (which is what multicore was invented for after all) I set maxBufferedDocs this low (and instead of ramBufferedSizeMB) because I was worried for the impact on ram and to get a grip on when docs where persisted to disk . I'm still not sure if it matters much on the big amounts of ram consumed. This can't be all coming from buffering docs can it? On the other hand, maxBufferedDocs (20 ) is set for each core so in total the nrOfBufferedDocs is at max 200. Of course still at the low side, but I got some draconian docs here.. ;-) I will try to use ramBufferedSizeMB and set it higher, but I first have to get a grip why ram usage is maxed all the time, before this will make any difference I guess. Thanks and please let the suggestions coming. Britske. Otis Gospodnetic wrote: Britske, Here are a few quick ones: - Does that machine really have 10 CPU cores? If it has significantly less, you may be beyond the indexing sweet spot in terms of indexer threads vs. CPU cores - Your maxBufferedDocs is super small. Comment that out anyway. use ramBufferedSizeMB and set it as high as you can afford. No need to commit very often, certainly no need to flush or optimize until the end. There is a page about indexing performance on either Solr or Lucene Wiki that will help. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Britske gbr...@gmail.com To: solr-user@lucene.apache.org Sent: Wednesday, March 25, 2009 10:05:17 AM Subject: speeding up indexing with a LOT of indexed fields hi, I'm having difficulty indexing a collection of documents in a reasonable time. it's now going at 20 docs / sec on a c1.xlarge instance of amazon ec2 which just isnt enough. This box has 8GB ram and the equivalent of 20 xeon processors. these document have a couple of stored, indexed, multi and single-valued fields, but the main problem lies in it having about 1500 indexed fields of type sint. Range [0,1] (Yes, I know this is a lot) I'm looking for some guidance as what strategies to try out to improve throughput in indexing. I could slam in some more servers (I will) but my feeling tells me I can get more out of this. some additional info: - I'm indexing to 10 cores in parallel. This is done because : - at query time, 1 particular index will always fullfill all requests so we can prune the search space to 1/10th of its original size. - each document as represented in a core is actually 1/10th of a 'conceptual' document (which would contain up to 15000 indexed fields) if I indexed to 1 core. Indexing as 1 doc containing 15.000 indexed fields proved to give far worse results in searching and indexing than the solution i'm going with now. - the alternative of simply putting all docs with 1500 indexed field each in the same core isn't really possible either, because this quickly results in OOM-errors when sorting on a couple of fields. (even though 9/10 th of all docs in this case would not have the field sorted on, they would still end up in a lucene fieldCache for this field) - to be clear: the 20 docs / second means 2 docs / second / core. Or 2 'conceptual' docs / second overall. - each core has maxBufferedDocs ~20 and mergeFactor~10 . (I actually set them differently for each partition so that merges of different partitions don't happen altogether. This seemed to help a bit) - running jvm with -server -Xmx6000M -Xms6000M -XX:+UseParallelGC -XX:+CMSPermGenSweepingEnabled -XX:MaxPermSize=128M to leave room for diskcaching. - I'm spreading the 10 indices over 2 physical disks. 5 to /dev/sda1 5 to /dev/sdb observations: - within minutes after feeding the server reaches it's max ram. - until then the processors are running on ~70% - although I throw in a commit at random intervals (between 600 to 800 secs, again so not to commit al partitions at the same time) the jvm just stays eating all the ram. - not a lot seems to be happening on disk (using dstat) when the ram hasn't maxed out. Obviously, aftwerwards the disk is flooded with swapping. questions: - is there a good reason why all ram keeps occupied even though I commit regularly? Perhaps fieldcaches get populated when indexing? I guess not, but I'm not sure what else could explain this - would splitting the 'conceptual docs' in even more partitions help at indexing time? from an application standpoint it's possible, it just requires some work and it's hard to compare figures so I'd like to know if it's worth it . - how is a flush different from a commit and would it help in getting the ram-usage down? - because
Re: Strange anomaly(?) with string matching in query
Hi, Take the whole string to your Solr Admin - Analysis page and analyze it. Does it get analyzed the way you'd expect it to be analyzed? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Kurt Nordstrom knordst...@library.unt.edu To: solr-user@lucene.apache.org Sent: Wednesday, March 25, 2009 11:52:07 AM Subject: Strange anomaly(?) with string matching in query Hello, We've encountered a strange issue in our Solr install regarding a particular string that just doesn't seem to want to return results, despite the exact same string being in the index. What makes it even stranger is that we had the same data in a previous install of Solr, and it worked there, but doesn't here. The string that's been showing the trouble is Abilene Christian College -- Students -- Yearbooks. The field, in this case, is of type text. Strangely enough, when we search for Abilene Christian College -- Students --, the relevant documents are returned. It just fails when the full string is specified. At this point, I'm a little bit stymied. Any suggestions or ideas would be highly appreciated. In order to possibly help with diagnosis, I'm including links to, hopefully, relevant outputs and configurations. We're using Solr version 1.3. This is the output of a search for the string, with debugQuery turned on. http://pastebin.com/f72c017c1 This is the output of a document containing the string in question. The field is dc_subject. http://pastebin.com/f17a2e722 Here is our current schema. http://pastebin.com/f2768bece If there's any more information or diagnostics that I can post or run, please let me know. Thanks for your help and suggestions. -Kurt -- View this message in context: http://www.nabble.com/Strange-anomaly%28-%29-with-string-matching-in-query-tp22704639p22704639.html Sent from the Solr - User mailing list archive at Nabble.com.
REST interface for Query
Greetings, I am a new subscriber. I'm Curtis Olson and I work for CACI under contract at the U.S. Department of State, where we deal with massive quantities of documents, so Solr is ideal for us. We have a good sized index that we are starting to build up in development. Some of the filter constraints can get reasonable complex (based upon individual user's access), and I find myself creating long query strings for selection. I like the REST interfaces for adding to the index, and wish I could create an XML document for querying. I haven't found a request handler that can do this, does one exist? Cheers, Curtis Olson, S/ES-IRM, CACI Contractor
Re: Snapinstaller + Overlapping onDeckSearchers Problems
It looks like the cache is configured big enough, but the autowarm count is too big to have good performance. Try something smaller and see if that fixes both problems. I imagine even just warming the most recent 100 queries would precache the most important ones, but try some higher numbers and see if the performance is acceptable. for the filterCache and queryCache, autowarm queries the new index and caches the results. On Mar 25, 2009, at 11:48 AM, Cloude Porteus wrote: Yes, I guess I'm running 40k queries when it starts :) I didn't know that each count was equal to a query. I thought it was just copying the cache entries from the previous searcher, but I guess that wouldn't include new entries. I set it to the size of our filterCache. What should I set the the autowarmCount to if I want to try and fill up the caches? lookups : 8720372 hits : 8676170 hitratio : 0.99 inserts : 44551 evictions : 0 size : 44417 cumulative_lookups : 8720372 cumulative_hits : 8676170 cumulative_hitratio : 0.99 cumulative_inserts : 44551 cumulative_evictions : 0 best, cloude On Wed, Mar 25, 2009 at 8:38 AM, Ryan McKinley ryan...@gmail.com wrote: I don't understand why this sometimes takes two minutes between the start commit /update and sometimes takes 20 minutes? One of our caches has about ~40,000 items, but I can't imagine it taking 20 minutes to autowarm a searcher. What do your cache configs look like? How big is the autowarm count? If you have: queryResultCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=32/ that will run 32 queries when solr starts. Are you running 40K queries when it starts? ryan -- VP of Product Development Instructables.com http://www.instructables.com/member/lebowski
How do I accomplish this (semi-)complicated setup?
Hi list, I've finally settled on Solr, seeing as it has almost everything I could want out of the box. My setup is a complicated one. It will serve as the search backend on Bitbucket.org, a mercurial hosting site. We have literally thousands of code repositories, as well as users and other data. All this needs to be indexed. The complication comes in when we have private repositories. Only select users have access to these, but we still need to index them. How would I go about accomplishing this? I can't think of a clean way to do it. Any pointers much appreciated. Jesper
Re: REST interface for Query
Curtis, Like this? https://issues.apache.org/jira/browse/SOLR-839 Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Olson, Curtis B olso...@state.gov To: solr-user@lucene.apache.org Sent: Wednesday, March 25, 2009 12:28:35 PM Subject: REST interface for Query Greetings, I am a new subscriber. I'm Curtis Olson and I work for CACI under contract at the U.S. Department of State, where we deal with massive quantities of documents, so Solr is ideal for us. We have a good sized index that we are starting to build up in development. Some of the filter constraints can get reasonable complex (based upon individual user's access), and I find myself creating long query strings for selection. I like the REST interfaces for adding to the index, and wish I could create an XML document for querying. I haven't found a request handler that can do this, does one exist? Cheers, Curtis Olson, S/ES-IRM, CACI Contractor
Re: How do I accomplish this (semi-)complicated setup?
You could index the user name or ID, and then in your application add as filter the username as you pass the query back to Solr. Maybe have a access_type that is Public or Private, and then for public searches only include the ones that meet the access_type of Public. Eric On Mar 25, 2009, at 12:52 PM, Jesper Nøhr wrote: Hi list, I've finally settled on Solr, seeing as it has almost everything I could want out of the box. My setup is a complicated one. It will serve as the search backend on Bitbucket.org, a mercurial hosting site. We have literally thousands of code repositories, as well as users and other data. All this needs to be indexed. The complication comes in when we have private repositories. Only select users have access to these, but we still need to index them. How would I go about accomplishing this? I can't think of a clean way to do it. Any pointers much appreciated. Jesper - Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com Free/Busy: http://tinyurl.com/eric-cal
Re: How do I accomplish this (semi-)complicated setup?
On Wed, Mar 25, 2009 at 5:57 PM, Eric Pugh ep...@opensourceconnections.com wrote: You could index the user name or ID, and then in your application add as filter the username as you pass the query back to Solr. Maybe have a access_type that is Public or Private, and then for public searches only include the ones that meet the access_type of Public. That makes sense. Two questions on that: 1. More than one user can have access to a repository, so how would that work? Also, if a user is added/removed, what's the best way to keep that in sync? 2. In the event that a repository that is private, is made public, how easy would it be to run an UPDATE so to speak? Jesper On Mar 25, 2009, at 12:52 PM, Jesper Nøhr wrote: Hi list, I've finally settled on Solr, seeing as it has almost everything I could want out of the box. My setup is a complicated one. It will serve as the search backend on Bitbucket.org, a mercurial hosting site. We have literally thousands of code repositories, as well as users and other data. All this needs to be indexed. The complication comes in when we have private repositories. Only select users have access to these, but we still need to index them. How would I go about accomplishing this? I can't think of a clean way to do it. Any pointers much appreciated. Jesper - Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com Free/Busy: http://tinyurl.com/eric-cal
Re: How do I accomplish this (semi-)complicated setup?
you can even create separated indexes for private or public access if u need (and place them in separated machines), but i think Eric's suggestion is the best and easier On Wed, Mar 25, 2009 at 5:52 PM, Jesper Nøhr jno...@gmail.com wrote: Hi list, I've finally settled on Solr, seeing as it has almost everything I could want out of the box. My setup is a complicated one. It will serve as the search backend on Bitbucket.org, a mercurial hosting site. We have literally thousands of code repositories, as well as users and other data. All this needs to be indexed. The complication comes in when we have private repositories. Only select users have access to these, but we still need to index them. How would I go about accomplishing this? I can't think of a clean way to do it. Any pointers much appreciated. Jesper
Re: Strange anomaly(?) with string matching in query
Otis: Okay, I'm not sure whether I should be including the quotes in the query when using the analyzer, so I've run it both ways (no quotes on the index value). I'll try to approximate the final tables returned for each term: The field is dc_subject in both cases, being of type text *** Version 1 (With Quotes) Index Value: Abilene Christian College -- Students -- Yearbooks Query Value: Abilene Christian College -- Students -- Yearbooks Index final table: 1 2 3 4 5 abilene christian college students yearbooks Query final table: 1 2 3 4 6 abilene christian college studentsyearbooks Version 2 (Without Quotes) Index Value: Abilene Christian College -- Students -- Yearbooks Query Value: Abilene Christian College -- Students -- Yearbooks Index final table: 1 2 3 4 5 abilene christian college students yearbooks Query final table: 1 2 3 4 5 abilene christian college students yearbooks *** The main difference seems to be that there is no 5 index when I surround the string with quotes, and instead it skips to 6. This happens at the WordDelimiterFilterFactory step. It seems to me like those tokens should be returning a match, but either way, apparently they're not? Any suggestions at this point? Otis Gospodnetic wrote: Hi, Take the whole string to your Solr Admin - Analysis page and analyze it. Does it get analyzed the way you'd expect it to be analyzed? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch -- View this message in context: http://www.nabble.com/Strange-anomaly%28-%29-with-string-matching-in-query-tp22704639p22706495.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How do I accomplish this (semi-)complicated setup?
i can't see the problem about that. you can manage your users using a DB and keep there the permissions they could have, and create or erase users without problems. you just have to manage a working index field for each user with repositories' ids he can access. or u can create several indexes and a users solr index with a multi-valued field with the indexes the user can access. if then u want to turn a private repository into public u just have to change the permissions field in your DB or users' index. On Wed, Mar 25, 2009 at 6:02 PM, Jesper Nøhr jes...@noehr.org wrote: On Wed, Mar 25, 2009 at 5:57 PM, Eric Pugh ep...@opensourceconnections.com wrote: You could index the user name or ID, and then in your application add as filter the username as you pass the query back to Solr. Maybe have a access_type that is Public or Private, and then for public searches only include the ones that meet the access_type of Public. That makes sense. Two questions on that: 1. More than one user can have access to a repository, so how would that work? Also, if a user is added/removed, what's the best way to keep that in sync? 2. In the event that a repository that is private, is made public, how easy would it be to run an UPDATE so to speak? Jesper On Mar 25, 2009, at 12:52 PM, Jesper Nøhr wrote: Hi list, I've finally settled on Solr, seeing as it has almost everything I could want out of the box. My setup is a complicated one. It will serve as the search backend on Bitbucket.org, a mercurial hosting site. We have literally thousands of code repositories, as well as users and other data. All this needs to be indexed. The complication comes in when we have private repositories. Only select users have access to these, but we still need to index them. How would I go about accomplishing this? I can't think of a clean way to do it. Any pointers much appreciated. Jesper - Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com Free/Busy: http://tinyurl.com/eric-cal
RE: REST interface for Query
Otis, that very much looks like what I'm after. Curtis -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 25, 2009 12:53 PM To: solr-user@lucene.apache.org Subject: Re: REST interface for Query Curtis, Like this? https://issues.apache.org/jira/browse/SOLR-839 Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Olson, Curtis B olso...@state.gov To: solr-user@lucene.apache.org Sent: Wednesday, March 25, 2009 12:28:35 PM Subject: REST interface for Query Greetings, I am a new subscriber. I'm Curtis Olson and I work for CACI under contract at the U.S. Department of State, where we deal with massive quantities of documents, so Solr is ideal for us. We have a good sized index that we are starting to build up in development. Some of the filter constraints can get reasonable complex (based upon individual user's access), and I find myself creating long query strings for selection. I like the REST interfaces for adding to the index, and wish I could create an XML document for querying. I haven't found a request handler that can do this, does one exist? Cheers, Curtis Olson, S/ES-IRM, CACI Contractor
getting started
Hi Some of the getting started link dont work. Can you please enable it?
Re: How do I accomplish this (semi-)complicated setup?
Hm, I must be missing something, then. Consider this. There are three repositories, A and B, C. There are two users, U1 and U2. Repository A is public, while B and C are private. Only U1 can access B. No one can access C. I index this data, such that Is_Private is true for B. Now, when U2 searches, he will only see data for repo A. This is correct. When U1 searches, what happens? AFAIK, he will also only see data for A, unless we specify Is_Private:True, but then he will only see data for B (and C, which he doesn't have access to.) Secondly, say we grant U2 access to B. How do we tell Solr that he can see it, then? Sorry if I'm not making much sense here, but I'm quite confused. Jesper On Wed, Mar 25, 2009 at 6:13 PM, Alejandro Gonzalez alejandrogonzalezd...@gmail.com wrote: i can't see the problem about that. you can manage your users using a DB and keep there the permissions they could have, and create or erase users without problems. you just have to manage a working index field for each user with repositories' ids he can access. or u can create several indexes and a users solr index with a multi-valued field with the indexes the user can access. if then u want to turn a private repository into public u just have to change the permissions field in your DB or users' index. On Wed, Mar 25, 2009 at 6:02 PM, Jesper Nøhr jes...@noehr.org wrote: On Wed, Mar 25, 2009 at 5:57 PM, Eric Pugh ep...@opensourceconnections.com wrote: You could index the user name or ID, and then in your application add as filter the username as you pass the query back to Solr. Maybe have a access_type that is Public or Private, and then for public searches only include the ones that meet the access_type of Public. That makes sense. Two questions on that: 1. More than one user can have access to a repository, so how would that work? Also, if a user is added/removed, what's the best way to keep that in sync? 2. In the event that a repository that is private, is made public, how easy would it be to run an UPDATE so to speak? Jesper On Mar 25, 2009, at 12:52 PM, Jesper Nøhr wrote: Hi list, I've finally settled on Solr, seeing as it has almost everything I could want out of the box. My setup is a complicated one. It will serve as the search backend on Bitbucket.org, a mercurial hosting site. We have literally thousands of code repositories, as well as users and other data. All this needs to be indexed. The complication comes in when we have private repositories. Only select users have access to these, but we still need to index them. How would I go about accomplishing this? I can't think of a clean way to do it. Any pointers much appreciated. Jesper - Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com Free/Busy: http://tinyurl.com/eric-cal
Re: How do I accomplish this (semi-)complicated setup?
ok so u can create a table in a DB where you have a row foreach user and a field with the reps he/she can access. Then you just have to take a look on the db and include the repository name in the index. so you just have to control (using query parameters) if the query is done for the right reps for that user. is it good for u? On Wed, Mar 25, 2009 at 6:20 PM, Jesper Nøhr jes...@noehr.org wrote: Hm, I must be missing something, then. Consider this. There are three repositories, A and B, C. There are two users, U1 and U2. Repository A is public, while B and C are private. Only U1 can access B. No one can access C. I index this data, such that Is_Private is true for B. Now, when U2 searches, he will only see data for repo A. This is correct. When U1 searches, what happens? AFAIK, he will also only see data for A, unless we specify Is_Private:True, but then he will only see data for B (and C, which he doesn't have access to.) Secondly, say we grant U2 access to B. How do we tell Solr that he can see it, then? Sorry if I'm not making much sense here, but I'm quite confused. Jesper On Wed, Mar 25, 2009 at 6:13 PM, Alejandro Gonzalez alejandrogonzalezd...@gmail.com wrote: i can't see the problem about that. you can manage your users using a DB and keep there the permissions they could have, and create or erase users without problems. you just have to manage a working index field for each user with repositories' ids he can access. or u can create several indexes and a users solr index with a multi-valued field with the indexes the user can access. if then u want to turn a private repository into public u just have to change the permissions field in your DB or users' index. On Wed, Mar 25, 2009 at 6:02 PM, Jesper Nøhr jes...@noehr.org wrote: On Wed, Mar 25, 2009 at 5:57 PM, Eric Pugh ep...@opensourceconnections.com wrote: You could index the user name or ID, and then in your application add as filter the username as you pass the query back to Solr. Maybe have a access_type that is Public or Private, and then for public searches only include the ones that meet the access_type of Public. That makes sense. Two questions on that: 1. More than one user can have access to a repository, so how would that work? Also, if a user is added/removed, what's the best way to keep that in sync? 2. In the event that a repository that is private, is made public, how easy would it be to run an UPDATE so to speak? Jesper On Mar 25, 2009, at 12:52 PM, Jesper Nøhr wrote: Hi list, I've finally settled on Solr, seeing as it has almost everything I could want out of the box. My setup is a complicated one. It will serve as the search backend on Bitbucket.org, a mercurial hosting site. We have literally thousands of code repositories, as well as users and other data. All this needs to be indexed. The complication comes in when we have private repositories. Only select users have access to these, but we still need to index them. How would I go about accomplishing this? I can't think of a clean way to do it. Any pointers much appreciated. Jesper - Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com Free/Busy: http://tinyurl.com/eric-cal
Re: Strange anomaly(?) with string matching in query
Otis, Absolutely. Here are the tokenizers and filters for the text fieldtype in the schema. http://pastebin.com/f2bb249f3 Thanks! That's what I suspected. Want to paste the relevant tokenizer+filters sections of your schema? The index-time and query-time analysis has to be the same or compatible enough, and that's not the case here. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch -- View this message in context: http://www.nabble.com/Strange-anomaly%28-%29-with-string-matching-in-query-tp22704639p22707191.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: getting started
Which links? Please be as specific as possible. Erick On Wed, Mar 25, 2009 at 1:20 PM, nga pham nga.p...@gmail.com wrote: Hi Some of the getting started link dont work. Can you please enable it?
Re: getting started
Oops my mistake. Sorry for the trouble On Wed, Mar 25, 2009 at 10:42 AM, Erick Erickson erickerick...@gmail.comwrote: Which links? Please be as specific as possible. Erick On Wed, Mar 25, 2009 at 1:20 PM, nga pham nga.p...@gmail.com wrote: Hi Some of the getting started link dont work. Can you please enable it?
Can TermIndexInterval be set in Solr?
Hello all, We are experimenting with the ShingleFilter with a very large document set (1 million full-text books). Because the ShingleFilter indexes every word pair as a token, the number of unique terms increases tremendously. In our experiments so far the tii and tis files are getting very large and the tii file will eventually be too large to fit into memory. If we set the TermIndexInterval to a larger number than the default 128, the tii file size should go down. Is it possible to set this somehow through Solr configuration or do we need to modify the code somewhere and call IndexWriter.setTermIndexInterval? Tom Tom Burton-West Digital Library Production Services University of Michigan Library
Re: getting started
http://lucene.apache.org/solr/tutorial.html#Getting+Started link - lucene QueryParser syntax is not working On Wed, Mar 25, 2009 at 10:48 AM, nga pham nga.p...@gmail.com wrote: Oops my mistake. Sorry for the trouble On Wed, Mar 25, 2009 at 10:42 AM, Erick Erickson erickerick...@gmail.com wrote: Which links? Please be as specific as possible. Erick On Wed, Mar 25, 2009 at 1:20 PM, nga pham nga.p...@gmail.com wrote: Hi Some of the getting started link dont work. Can you please enable it?
Re: Realtime Searching..
Hi Jon: We are running various LinkedIn search systems on Zoie in production. -John On Thu, Feb 19, 2009 at 9:11 AM, Jon Baer jonb...@gmail.com wrote: This part: The part of Zoie that enables real-time searchability is the fact that ZoieSystem contains three IndexDataLoader objects: * a RAMLuceneIndexDataLoader, which is a simple wrapper around a RAMDirectory, * a DiskLuceneIndexDataLoader, which can index directly to the FSDirectory (followed by an optimize() call if a specified optimizeDuration has been exceeded) in batches via an intermediary * BatchedIndexDataLoader, whose primary job is to queue up and batch DataEvents that need to be flushed to disk Sounds like it (might) be / (can) be layered into Solr somehow, has anyone been using this project or testing it? - Jon On Feb 19, 2009, at 9:44 AM, Genta Kaneyama wrote: Michael, I think you might be get interested in zoie. zoie: real-time search and indexing system built on Apache Lucene http://code.google.com/p/zoie/ Zoie is realtime search project for lucene by Linkedin. Basically, I think it is similar technique to a Otis's trick. In the mean time you can use the trick of one large and less frequently updated core and one small and more frequently updated core + distributed search across them. Otis Genta On Sat, Feb 7, 2009 at 3:02 AM, Michael Austin mausti...@gmail.com wrote: I need to find a solution for our current social application. It's low traffic now because we are early on.. However I'm expecting and want to be prepaired to grow. We have messages of different types that are aggregated into one stream. Each of these message types have much different data so that our main queries have a few unions and many joins. I know that Solr would work great for searching but we need a realtime system (twitter-like) to view user updates. I'm not interested in a few minutes delay; I need something that will be fast updating and searchable and have n columns per record/document. Can solor do this? what is Ocean? Thanks
Re: getting started
OK, now I'll turn it over to the folks who actually maintain that site G. Meanwhile, here's the link to the 2.4.1 query syntax. http://lucene.apache.org/java/2_4_1/queryparsersyntax.html Best Erick On Wed, Mar 25, 2009 at 2:00 PM, nga pham nga.p...@gmail.com wrote: http://lucene.apache.org/solr/tutorial.html#Getting+Started link - lucene QueryParser syntax is not working On Wed, Mar 25, 2009 at 10:48 AM, nga pham nga.p...@gmail.com wrote: Oops my mistake. Sorry for the trouble On Wed, Mar 25, 2009 at 10:42 AM, Erick Erickson erickerick...@gmail.com wrote: Which links? Please be as specific as possible. Erick On Wed, Mar 25, 2009 at 1:20 PM, nga pham nga.p...@gmail.com wrote: Hi Some of the getting started link dont work. Can you please enable it?
Solr OpenBitSet OutofMemory Error
Hello, After running a nightly release from around January of Solr for about 4 weeks without any problems, I'm starting to see OutofMemory errors: Mar 24, 2009 1:35:36 AM org.apache.solr.common.SolrException log SEVERE: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.util.OpenBitSet.clone(OpenBitSet.java:640) Is this a common error to see? I'm running a lot of faceted queries on an index with about 7.5 million documents. I'm giving about 8GBs of memory to Solr. While I do update the index frequently, I also optimize frequently - its a little strange to me that this problem is showing up now after four weeks of zero problems. Any suggestions/ideas would be very much appreciated! Thanks, Harish -- View this message in context: http://www.nabble.com/Solr-OpenBitSet-OutofMemory-Error-tp22707576p22707576.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How do I accomplish this (semi-)complicated setup?
OK, we're getting closer. I just have two final questions regarding this then: 1. This would also include all the public repositories, right? If so, how would such a query look? Some kind of is_public:true AND ...? 2. When a repository is made public, the is_public property in the Solr index needs to reflect this. How can such an update be made without having to purge and re-index? Jesper On Wed, Mar 25, 2009 at 6:29 PM, Alejandro Gonzalez alejandrogonzalezd...@gmail.com wrote: ok so u can create a table in a DB where you have a row foreach user and a field with the reps he/she can access. Then you just have to take a look on the db and include the repository name in the index. so you just have to control (using query parameters) if the query is done for the right reps for that user. is it good for u? On Wed, Mar 25, 2009 at 6:20 PM, Jesper Nøhr jes...@noehr.org wrote: Hm, I must be missing something, then. Consider this. There are three repositories, A and B, C. There are two users, U1 and U2. Repository A is public, while B and C are private. Only U1 can access B. No one can access C. I index this data, such that Is_Private is true for B. Now, when U2 searches, he will only see data for repo A. This is correct. When U1 searches, what happens? AFAIK, he will also only see data for A, unless we specify Is_Private:True, but then he will only see data for B (and C, which he doesn't have access to.) Secondly, say we grant U2 access to B. How do we tell Solr that he can see it, then? Sorry if I'm not making much sense here, but I'm quite confused. Jesper On Wed, Mar 25, 2009 at 6:13 PM, Alejandro Gonzalez alejandrogonzalezd...@gmail.com wrote: i can't see the problem about that. you can manage your users using a DB and keep there the permissions they could have, and create or erase users without problems. you just have to manage a working index field for each user with repositories' ids he can access. or u can create several indexes and a users solr index with a multi-valued field with the indexes the user can access. if then u want to turn a private repository into public u just have to change the permissions field in your DB or users' index. On Wed, Mar 25, 2009 at 6:02 PM, Jesper Nøhr jes...@noehr.org wrote: On Wed, Mar 25, 2009 at 5:57 PM, Eric Pugh ep...@opensourceconnections.com wrote: You could index the user name or ID, and then in your application add as filter the username as you pass the query back to Solr. Maybe have a access_type that is Public or Private, and then for public searches only include the ones that meet the access_type of Public. That makes sense. Two questions on that: 1. More than one user can have access to a repository, so how would that work? Also, if a user is added/removed, what's the best way to keep that in sync? 2. In the event that a repository that is private, is made public, how easy would it be to run an UPDATE so to speak? Jesper On Mar 25, 2009, at 12:52 PM, Jesper Nøhr wrote: Hi list, I've finally settled on Solr, seeing as it has almost everything I could want out of the box. My setup is a complicated one. It will serve as the search backend on Bitbucket.org, a mercurial hosting site. We have literally thousands of code repositories, as well as users and other data. All this needs to be indexed. The complication comes in when we have private repositories. Only select users have access to these, but we still need to index them. How would I go about accomplishing this? I can't think of a clean way to do it. Any pointers much appreciated. Jesper - Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com Free/Busy: http://tinyurl.com/eric-cal
Re: Can TermIndexInterval be set in Solr?
I think it's the later. I don't think the term interval is exposed anywhere. If you expose it through the config and provide a patch, I think we can add this to the core quickly. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Burton-West, Tom tburt...@umich.edu To: solr-user@lucene.apache.org solr-user@lucene.apache.org Cc: Farber, Phillip pfar...@umich.edu; Dueber, William dueb...@umich.edu Sent: Wednesday, March 25, 2009 1:50:17 PM Subject: Can TermIndexInterval be set in Solr? Hello all, We are experimenting with the ShingleFilter with a very large document set (1 million full-text books). Because the ShingleFilter indexes every word pair as a token, the number of unique terms increases tremendously. In our experiments so far the tii and tis files are getting very large and the tii file will eventually be too large to fit into memory. If we set the TermIndexInterval to a larger number than the default 128, the tii file size should go down. Is it possible to set this somehow through Solr configuration or do we need to modify the code somewhere and call IndexWriter.setTermIndexInterval? Tom Tom Burton-West Digital Library Production Services University of Michigan Library
Re: Realtime Searching..
Would it not make more sense to wait for the Lucene's IW+IR marriage and other things happening in core Lucene that will make near-real-time search possible? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: John Wang john.w...@gmail.com To: solr-user@lucene.apache.org Sent: Wednesday, March 25, 2009 2:34:04 PM Subject: Re: Realtime Searching.. Hi Jon: We are running various LinkedIn search systems on Zoie in production. -John On Thu, Feb 19, 2009 at 9:11 AM, Jon Baer wrote: This part: The part of Zoie that enables real-time searchability is the fact that ZoieSystem contains three IndexDataLoader objects: * a RAMLuceneIndexDataLoader, which is a simple wrapper around a RAMDirectory, * a DiskLuceneIndexDataLoader, which can index directly to the FSDirectory (followed by an optimize() call if a specified optimizeDuration has been exceeded) in batches via an intermediary * BatchedIndexDataLoader, whose primary job is to queue up and batch DataEvents that need to be flushed to disk Sounds like it (might) be / (can) be layered into Solr somehow, has anyone been using this project or testing it? - Jon On Feb 19, 2009, at 9:44 AM, Genta Kaneyama wrote: Michael, I think you might be get interested in zoie. zoie: real-time search and indexing system built on Apache Lucene http://code.google.com/p/zoie/ Zoie is realtime search project for lucene by Linkedin. Basically, I think it is similar technique to a Otis's trick. In the mean time you can use the trick of one large and less frequently updated core and one small and more frequently updated core + distributed search across them. Otis Genta On Sat, Feb 7, 2009 at 3:02 AM, Michael Austin wrote: I need to find a solution for our current social application. It's low traffic now because we are early on.. However I'm expecting and want to be prepaired to grow. We have messages of different types that are aggregated into one stream. Each of these message types have much different data so that our main queries have a few unions and many joins. I know that Solr would work great for searching but we need a realtime system (twitter-like) to view user updates. I'm not interested in a few minutes delay; I need something that will be fast updating and searchable and have n columns per record/document. Can solor do this? what is Ocean? Thanks
SRW/U and OAI-PMH servers over solr
Hello there, I'm looking for a way to implement SRW/U and a OAI-PMH servers over solr, similar to what i have found here: http://marc.info/?l=solr-devm=116405019011211w=2 . Well actually if it is decoupled (not a plugin) would be ok, if not better =). I wanted to know if anyone knows if there is something available out there that accomplishes this. For what i have found so far, OCLC has both server implementations available. I haven't looked too deep into the SRW/U one, but the OAI-PMH can be configured to work with solr (by implementing a class that does the actual calls to the data provider). Any information that you guys can provide is welcome =). -- All the best, Miguel Coxo.
Partition index by time using Solr
Hi, I've used Lucene before, but new to Solr. I've gone through the mailing list, but unable to find any clear idea on how to partition Solr indexes. Here is what we want, 1) Be able to partition indexes by timestamp - basically partition per day (create a new index directory every day) 2) Be able to search partitions based on timestamp. All our queries are time based, so instead of looking into all the partitions I want to go directly to the partitions where the data might be. 3) Be able to purge any data older than 6 months without bringing down the application. Since, partitions would be marked by timestamp we would just have to delete the old partitions. This is going to be a distributed system with 2 boxes each running an instance of Solr. I don't want to replicate data, but each box may have same timestamp partition with different data. We would be indexing on avg of 20 million documents (each document = 500 bytes) with estimate of 10g in index size - evenly distributed across machines (each machine would get roughly 5g of index everyday). My questions, 1) Is this all possible using Solr? If not, should I just do this using Lucene or is there any other out-of-box alternative? 2) If it's possible in Solr how do we do this - configuration, setup etc. 3) How would I optimize the partitions - would it be required when using Solr? Thanks, -vivek
Re: How do I accomplish this (semi-)complicated setup?
try using db for permission management and when u want to make a rep public u just have to add it's id or name to everyuser permissions field. i think you don't need to add any is_public field to index, just an id or name field in wich the indexed doc is.So you can pre-filter the reps quering the db obtaining the reps for wich user has permissions and adding this restrictions to the solr query. this way you can't change reps'permissions without re-indexing. so the query for solr if the current user is allowed for search in the 1 and 2 reps should be something like ...rep_id:1OR2... Alex On Wed, Mar 25, 2009 at 8:06 PM, Jesper Nøhr jes...@noehr.org wrote: OK, we're getting closer. I just have two final questions regarding this then: 1. This would also include all the public repositories, right? If so, how would such a query look? Some kind of is_public:true AND ...? 2. When a repository is made public, the is_public property in the Solr index needs to reflect this. How can such an update be made without having to purge and re-index? Jesper On Wed, Mar 25, 2009 at 6:29 PM, Alejandro Gonzalez alejandrogonzalezd...@gmail.com wrote: ok so u can create a table in a DB where you have a row foreach user and a field with the reps he/she can access. Then you just have to take a look on the db and include the repository name in the index. so you just have to control (using query parameters) if the query is done for the right reps for that user. is it good for u? On Wed, Mar 25, 2009 at 6:20 PM, Jesper Nøhr jes...@noehr.org wrote: Hm, I must be missing something, then. Consider this. There are three repositories, A and B, C. There are two users, U1 and U2. Repository A is public, while B and C are private. Only U1 can access B. No one can access C. I index this data, such that Is_Private is true for B. Now, when U2 searches, he will only see data for repo A. This is correct. When U1 searches, what happens? AFAIK, he will also only see data for A, unless we specify Is_Private:True, but then he will only see data for B (and C, which he doesn't have access to.) Secondly, say we grant U2 access to B. How do we tell Solr that he can see it, then? Sorry if I'm not making much sense here, but I'm quite confused. Jesper On Wed, Mar 25, 2009 at 6:13 PM, Alejandro Gonzalez alejandrogonzalezd...@gmail.com wrote: i can't see the problem about that. you can manage your users using a DB and keep there the permissions they could have, and create or erase users without problems. you just have to manage a working index field for each user with repositories' ids he can access. or u can create several indexes and a users solr index with a multi-valued field with the indexes the user can access. if then u want to turn a private repository into public u just have to change the permissions field in your DB or users' index. On Wed, Mar 25, 2009 at 6:02 PM, Jesper Nøhr jes...@noehr.org wrote: On Wed, Mar 25, 2009 at 5:57 PM, Eric Pugh ep...@opensourceconnections.com wrote: You could index the user name or ID, and then in your application add as filter the username as you pass the query back to Solr. Maybe have a access_type that is Public or Private, and then for public searches only include the ones that meet the access_type of Public. That makes sense. Two questions on that: 1. More than one user can have access to a repository, so how would that work? Also, if a user is added/removed, what's the best way to keep that in sync? 2. In the event that a repository that is private, is made public, how easy would it be to run an UPDATE so to speak? Jesper On Mar 25, 2009, at 12:52 PM, Jesper Nøhr wrote: Hi list, I've finally settled on Solr, seeing as it has almost everything I could want out of the box. My setup is a complicated one. It will serve as the search backend on Bitbucket.org, a mercurial hosting site. We have literally thousands of code repositories, as well as users and other data. All this needs to be indexed. The complication comes in when we have private repositories. Only select users have access to these, but we still need to index them. How would I go about accomplishing this? I can't think of a clean way to do it. Any pointers much appreciated. Jesper - Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com Free/Busy: http://tinyurl.com/eric-cal
Re: Delta import
Yes my database is remote, mysql 5 and i'm using connector/J 5.1.7. My index has 2 documents. When i try to do lets say 14 updates it takes about 18 sec total. Here's the resulting log of the operation : 2009-03-25 15:53:57 org.apache.solr.handler.dataimport.JdbcDataSource$1 call INFO: Time taken for getConnection(): 411 2009-03-25 15:53:59 org.apache.solr.handler.dataimport.DocBuilder collectDelta INFO: Completed ModifiedRowKey for Entity: profil rows obtained : 14 2009-03-25 15:53:59 org.apache.solr.handler.dataimport.DocBuilder collectDelta INFO: Completed DeletedRowKey for Entity: profil rows obtained : 0 2009-03-25 15:53:59 org.apache.solr.handler.dataimport.DocBuilder collectDelta INFO: Completed parentDeltaQuery for Entity: profil 2009-03-25 15:54:00 org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=/home/solr-tomcat/solr/data/index,segFN=segments_sb,version=1237322897338,generation=1019,filenames=[_uj.frq, _uj.fdx, _uj.tii, _uj.nrm, _uj.tis, _uj.fnm, _uj.prx, segments_sb, _uj.fdt] 2009-03-25 15:54:00 org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1237322897338 2009-03-25 15:54:13 org.apache.solr.handler.dataimport.DocBuilder doDelta INFO: Delta Import completed successfully BOTTLE NECK 2009-03-25 15:54:13 org.apache.solr.handler.dataimport.DocBuilder commit INFO: Full Import completed successfully 2009-03-25 15:54:13 org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=true,waitFlush=false,waitSearcher=true) 2009-03-25 15:54:15 org.apache.solr.core.SolrDeletionPolicy onCommit INFO: SolrDeletionPolicy.onCommit: commits:num=2 commit{dir=/home/solr-tomcat/solr/data/index,segFN=segments_sb,version=1237322897338,generation=1019,filenames=[_uj.frq, _uj.fdx, _uj.tii, _uj.nrm, _uj.tis, _uj.fnm, _uj.prx, segments_sb, _uj.fdt] commit{dir=/home/solr-tomcat/solr/data/index,segFN=segments_sc,version=1237322897339,generation=1020,filenames=[_ul.prx, _ul.fnm, _ul.tii, _ul.fdt, _ul.nrm, _ul.fdx, _ul.tis, _ul.frq, segments_sc] 2009-03-25 15:54:15 org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1237322897339 2009-03-25 15:54:15 org.apache.solr.search.SolrIndexSearcher init INFO: Opening searc...@3da850 main When i do a full-import it is much faster. Take about 1 min to index 2 documents. I tried to play a bit with the config but nothing seems to work for the moment. What i want to do is pretty interactive, my production db has 1,2M documents and must be able to delta-import around 2k update every 5min. Is it possible with the dataimporthandle to reach those kinda of number ? Shalin Shekhar Mangar wrote: On Wed, Mar 25, 2009 at 2:25 AM, AlexxelA alexandre.boudrea...@canoe.cawrote: Ok i'm ok with the fact the solr gonna do X request to database for X update.. but when i try to run the delta-import command with 2 row to update is it normal that its kinda really slow ~ 1 document fetched / sec ? Not really, I've seen 1000x faster. Try firing a few of those queries on the database directly. Are they slow? Is the database remote? -- Regards, Shalin Shekhar Mangar. -- View this message in context: http://www.nabble.com/Delta-import-tp22663196p22710222.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Snapinstaller + Overlapping onDeckSearchers Problems
I set the autowarm to 2000, which only takes about two minutes and resolves my issues. Thanks for your help! best, cloude On Wed, Mar 25, 2009 at 9:34 AM, Ryan McKinley ryan...@gmail.com wrote: It looks like the cache is configured big enough, but the autowarm count is too big to have good performance. Try something smaller and see if that fixes both problems. I imagine even just warming the most recent 100 queries would precache the most important ones, but try some higher numbers and see if the performance is acceptable. for the filterCache and queryCache, autowarm queries the new index and caches the results. On Mar 25, 2009, at 11:48 AM, Cloude Porteus wrote: Yes, I guess I'm running 40k queries when it starts :) I didn't know that each count was equal to a query. I thought it was just copying the cache entries from the previous searcher, but I guess that wouldn't include new entries. I set it to the size of our filterCache. What should I set the the autowarmCount to if I want to try and fill up the caches? lookups : 8720372 hits : 8676170 hitratio : 0.99 inserts : 44551 evictions : 0 size : 44417 cumulative_lookups : 8720372 cumulative_hits : 8676170 cumulative_hitratio : 0.99 cumulative_inserts : 44551 cumulative_evictions : 0 best, cloude On Wed, Mar 25, 2009 at 8:38 AM, Ryan McKinley ryan...@gmail.com wrote: I don't understand why this sometimes takes two minutes between the start commit /update and sometimes takes 20 minutes? One of our caches has about ~40,000 items, but I can't imagine it taking 20 minutes to autowarm a searcher. What do your cache configs look like? How big is the autowarm count? If you have: queryResultCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=32/ that will run 32 queries when solr starts. Are you running 40K queries when it starts? ryan -- VP of Product Development Instructables.com http://www.instructables.com/member/lebowski -- VP of Product Development Instructables.com http://www.instructables.com/member/lebowski
Re: SRW/U and OAI-PMH servers over solr
I implemented OAI-PMH for solr a few years back for the Massachusetts library system... it appears not to be running right now, but check... http://www.digitalcommonwealth.org/ It would be great to get that code revived and live open source somewhere. As is, it uses a pre 1.3 release that was patched to do support modifiable fields. (If I did it again, I would suggest keeping a parallel SQL database for some of this stuff) ryan On Mar 25, 2009, at 3:30 PM, Miguel Coxo wrote: Hello there, I'm looking for a way to implement SRW/U and a OAI-PMH servers over solr, similar to what i have found here: http://marc.info/?l=solr-devm=116405019011211w=2 . Well actually if it is decoupled (not a plugin) would be ok, if not better =). I wanted to know if anyone knows if there is something available out there that accomplishes this. For what i have found so far, OCLC has both server implementations available. I haven't looked too deep into the SRW/U one, but the OAI- PMH can be configured to work with solr (by implementing a class that does the actual calls to the data provider). Any information that you guys can provide is welcome =). -- All the best, Miguel Coxo.
large index vs multicore
Hi All, In my project, I have one primary core containing all the basic information for a product. Now I need to add additional information which will be searched and displayed in conjunction with the product results. My question is - From design and query speed point of - should I add new core to handle the additional data or should I add the data to the existing core. The data size is not very large around 150,000 - 200,000 documents. Any insights into this will be helpful Thanks, Kalyan Manepalli
solr_hostname in scripts.conf
I've a question. Is it safe to use 'localhost' as solr_hostname in scripts.conf? -- -Tim
Re: get all facets
Actually what I meant was if there are 100 indexed fields. So there are 100 facet fields right.. So whenever I create solrQuery, I have to do addFacetField(fieldName) can I avoid this and just get all facet fields. Sorry for the confusion. Thanks again, Ashish Shalin Shekhar Mangar wrote: On Wed, Mar 25, 2009 at 7:30 AM, Ashish P ashish.ping...@gmail.com wrote: Can I get all the facets in QueryResponse?? You can get all the facets that are returned by the server. Set facet.limit to the number of facets you want to retrieve. See http://lucene.apache.org/solr/api/solrj/org/apache/solr/client/solrj/SolrQuery.html#setFacetLimit(int) -- Regards, Shalin Shekhar Mangar. -- View this message in context: http://www.nabble.com/get-all-facets-tp22693809p22714256.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: large index vs multicore
My question is - From design and query speed point of - should I add new core to handle the additional data or should I add the data to the existing core. Do you ever need to get results from both sets of data in the same query? If so, putting them in the same index will be faster. If every query is always limited to results within on set or the other -- and the doc count is not huge, then the choice of single core vs multi core is more about what you are more comfortable managing then it is about query speeds. Advantages of multicore- - the distinct data is in different indexes, you can maintain them independently (perhaps one data set never changes and the other changes often) Advantages of single core (with multiple data sets) - everything is in one place - replicate / load balance a single index rather then multiple. ryan
Re: large index vs multicore
Hi, Without knowing the details, I'd say keep it in the same index if the additional information shares some/enough fields with the main product data and separately if it's sufficiently distinct (this also means 2 queries and manual merging/joining). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Manepalli, Kalyan kalyan.manepa...@orbitz.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Wednesday, March 25, 2009 5:46:40 PM Subject: large index vs multicore Hi All, In my project, I have one primary core containing all the basic information for a product. Now I need to add additional information which will be searched and displayed in conjunction with the product results. My question is - From design and query speed point of - should I add new core to handle the additional data or should I add the data to the existing core. The data size is not very large around 150,000 - 200,000 documents. Any insights into this will be helpful Thanks, Kalyan Manepalli
Re: Solr OpenBitSet OutofMemory Error
Hi, I'm not sure if anyone will be able to help without more detail. First suggestion would be to look at Solr with a debugger/profiler to see where memory is used up. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: smock harish.agar...@gmail.com To: solr-user@lucene.apache.org Sent: Wednesday, March 25, 2009 2:37:26 PM Subject: Solr OpenBitSet OutofMemory Error Hello, After running a nightly release from around January of Solr for about 4 weeks without any problems, I'm starting to see OutofMemory errors: Mar 24, 2009 1:35:36 AM org.apache.solr.common.SolrException log SEVERE: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.util.OpenBitSet.clone(OpenBitSet.java:640) Is this a common error to see? I'm running a lot of faceted queries on an index with about 7.5 million documents. I'm giving about 8GBs of memory to Solr. While I do update the index frequently, I also optimize frequently - its a little strange to me that this problem is showing up now after four weeks of zero problems. Any suggestions/ideas would be very much appreciated! Thanks, Harish -- View this message in context: http://www.nabble.com/Solr-OpenBitSet-OutofMemory-Error-tp22707576p22707576.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Delta import
Hi Alex , you may be able to use CachedSqlEntityprocessor. you can do delta-import using full-import http://wiki.apache.org/solr/DataImportHandlerFaq#fullimportdelta the inner entity can use a CachedSqlEntityProcessor On Thu, Mar 26, 2009 at 1:45 AM, AlexxelA alexandre.boudrea...@canoe.ca wrote: Yes my database is remote, mysql 5 and i'm using connector/J 5.1.7. My index has 2 documents. When i try to do lets say 14 updates it takes about 18 sec total. Here's the resulting log of the operation : 2009-03-25 15:53:57 org.apache.solr.handler.dataimport.JdbcDataSource$1 call INFO: Time taken for getConnection(): 411 2009-03-25 15:53:59 org.apache.solr.handler.dataimport.DocBuilder collectDelta INFO: Completed ModifiedRowKey for Entity: profil rows obtained : 14 2009-03-25 15:53:59 org.apache.solr.handler.dataimport.DocBuilder collectDelta INFO: Completed DeletedRowKey for Entity: profil rows obtained : 0 2009-03-25 15:53:59 org.apache.solr.handler.dataimport.DocBuilder collectDelta INFO: Completed parentDeltaQuery for Entity: profil 2009-03-25 15:54:00 org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=/home/solr-tomcat/solr/data/index,segFN=segments_sb,version=1237322897338,generation=1019,filenames=[_uj.frq, _uj.fdx, _uj.tii, _uj.nrm, _uj.tis, _uj.fnm, _uj.prx, segments_sb, _uj.fdt] 2009-03-25 15:54:00 org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1237322897338 2009-03-25 15:54:13 org.apache.solr.handler.dataimport.DocBuilder doDelta INFO: Delta Import completed successfully BOTTLE NECK 2009-03-25 15:54:13 org.apache.solr.handler.dataimport.DocBuilder commit INFO: Full Import completed successfully 2009-03-25 15:54:13 org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=true,waitFlush=false,waitSearcher=true) 2009-03-25 15:54:15 org.apache.solr.core.SolrDeletionPolicy onCommit INFO: SolrDeletionPolicy.onCommit: commits:num=2 commit{dir=/home/solr-tomcat/solr/data/index,segFN=segments_sb,version=1237322897338,generation=1019,filenames=[_uj.frq, _uj.fdx, _uj.tii, _uj.nrm, _uj.tis, _uj.fnm, _uj.prx, segments_sb, _uj.fdt] commit{dir=/home/solr-tomcat/solr/data/index,segFN=segments_sc,version=1237322897339,generation=1020,filenames=[_ul.prx, _ul.fnm, _ul.tii, _ul.fdt, _ul.nrm, _ul.fdx, _ul.tis, _ul.frq, segments_sc] 2009-03-25 15:54:15 org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1237322897339 2009-03-25 15:54:15 org.apache.solr.search.SolrIndexSearcher init INFO: Opening searc...@3da850 main When i do a full-import it is much faster. Take about 1 min to index 2 documents. I tried to play a bit with the config but nothing seems to work for the moment. What i want to do is pretty interactive, my production db has 1,2M documents and must be able to delta-import around 2k update every 5min. Is it possible with the dataimporthandle to reach those kinda of number ? Shalin Shekhar Mangar wrote: On Wed, Mar 25, 2009 at 2:25 AM, AlexxelA alexandre.boudrea...@canoe.cawrote: Ok i'm ok with the fact the solr gonna do X request to database for X update.. but when i try to run the delta-import command with 2 row to update is it normal that its kinda really slow ~ 1 document fetched / sec ? Not really, I've seen 1000x faster. Try firing a few of those queries on the database directly. Are they slow? Is the database remote? -- Regards, Shalin Shekhar Mangar. -- View this message in context: http://www.nabble.com/Delta-import-tp22663196p22710222.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul
Re: Not able to configure multicore
Actually solr2 is an application other then default one(example) on which I have configured my application. let me explain things more in details: so my application path is http://localhost:8983/solr2/admin and I would like to configure it for multi-cores so I have placed solr.xml in config directory which contains following: solr persistent=true sharedLib=lib cores adminPath=/admin/cores core name=core0 instanceDir=core0 / core name=core1 instanceDir=core1 / /cores /solr but when I am trying to access following: http://localhost:8983/solr2/admin/cores it gives me tomcat 404 error. Thanks, Mitul Patel. Otis Gospodnetic wrote: Hm, where does that /solr2 come from? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: mitulpatel mitulpa...@greymatterindia.com To: solr-user@lucene.apache.org Sent: Wednesday, March 25, 2009 12:30:11 AM Subject: Re: Not able to configure multicore hossman wrote: : I am facing a problem related to multiple cores configuration. I have placed : a solr.xml file in solr.home directory. eventhough when I am trying to : access http://localhost:8983/solr/admin/cores it gives me tomcat error. : : Can anyone tell me what can be possible issue with this?? not without knowing exactly what the tomcat error message is, what your solr.xml file looks like, what log messages you see on startup, etc... -Hoss Hello Hoss, Thanks for reply. Here is the error message shown on browser: HTTP Status 404 - /solr2/admin/cores type Status report message /solr2/admin/cores description The requested resource (/solr2/admin/cores) is not available. and here is the solr.xml file. -- View this message in context: http://www.nabble.com/Not-able-to-configure-multicore-tp22682691p22695098.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Not-able-to-configure-multicore-tp22682691p22715876.html Sent from the Solr - User mailing list archive at Nabble.com.
Scheduling DIH
Hello, Is there a best way to schedule the DataImportHandler? The idea being to schedule a delta-import every Sunday morning at 7am or perhaps every hour without human intervention. Writing a cron job to do this wouldn't be difficult. I'm just wondering is this a built in feature? Tricia
Re: Scheduling DIH
right now a cron job is the only option. building this into DIH has been a common request? What do others think about this? On Thu, Mar 26, 2009 at 10:11 AM, Tricia Williams williams.tri...@gmail.com wrote: Hello, Is there a best way to schedule the DataImportHandler? The idea being to schedule a delta-import every Sunday morning at 7am or perhaps every hour without human intervention. Writing a cron job to do this wouldn't be difficult. I'm just wondering is this a built in feature? Tricia -- --Noble Paul