Where is NGramFilter?
Hi. On the Sunspot (a Ruby Solr client) Wiki (https://github.com/outoftime/sunspot/wiki/Matching-substrings-in-fulltext-search) it says that the NGramFilter should allow substring indexing. As I never got it working, I searched a bit and found this site: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory There is only EdgeNGramFilterFactory listed (which I got working for prefix indexing), but no NGramFilterFactory. Is that filter not supported anymore, or is that list not up to date? Is there an alternative filter for getting substring searching working? Best regards, Kai
high cpu usage
Hello, We have been running read only solr instances for a few months now, yesterday i have noticed an high cpu usage coming from the JVM, it simply use 100% of the CPU for no reason. Nothing was changed, we are using Jetty as a Servlet container for solr. Where can i start looking what cause it? it has been using 100% CPU for almost 24 hours now. Thanks, Erez.
Re: General question about Solr Caches
Hi Hoss, Ok, that makes much more sense now. I was under the impression that values were copied as well which seemed a bit odd.. unless you have to deal with a use case similar to yours. :) Cheers, - Savvas On 9 February 2011 02:25, Chris Hostetter hossman_luc...@fucit.org wrote: : In my understanding, the Current Index Searcher uses a cache instance and : when a New Index Searcher is registered a new cache instance is used which : is also auto-warmed. However, what happens when the New Index Searcher is a : view of an index which has been modified? If the entries contained in the : old cache are copied during auto warming to the new cache wouldn’t that new : cache contain invalid entries? a) i'm not sure what you mean by view of an index which has been modified ... except for the first time an index is created, an Index Searcher always contains a view of an index which has been modified -- that view that the IndexSearcher represents is entirely consinsitent and doesn't change as documents are added/removed - that's why a new Searcher needs to be opened. b) entries are not copied during autowarming. the *keys* of the entries in the old cache are used to warm the new cache -- using the new searcher to generate new values. (caveat: if you have a custom cache, you could write a custom cache regenerator that did copy the values from the old cache verbatim -- i have done that in special cases where the type of object i was caching didn't vary based on the IndexSearcher -- or did vary, but in such a way that i could use the new Searcher to determine a cheap piece of information and based on the result either reuse an old value that was expensive to compute, or recompute it using hte new Searcher. ... but none of the default cache regenerators for the stock solr caches work this way) : : : : Thanks, : - Savvas : -Hoss
IndexOutOfBoundsException
hi, we have a problem with our solr test instance. This instance is running with 90 cores with about 2 GB of Index-Data per core. This worked fine for a few weeks. Now we get an exception querying data from one core : java.lang.IndexOutOfBoundsException: Index: 104, Size: 11 at java.util.ArrayList.rangeCheck(ArrayList.java:571) at java.util.ArrayList.get(ArrayList.java:349) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:288) at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:277) at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:129) at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:160) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:211) at org.apache.lucene.index.TermInfosReader.terms(TermInfosReader.java:277) at org.apache.lucene.index.SegmentReader.terms(SegmentReader.java:961) at org.apache.lucene.index.DirectoryReader$MultiTermEnum.lt;initgt;(DirectoryReader.java:989) at org.apache.lucene.index.DirectoryReader.terms(DirectoryReader.java:626) at org.apache.solr.search.SolrIndexReader.terms(SolrIndexReader.java:302) at org.apache.lucene.search.PrefixTermEnum.lt;initgt;(PrefixTermEnum.java:41) at org.apache.lucene.search.PrefixQuery.getEnum(PrefixQuery.java:45) at org.apache.lucene.search.MultiTermQuery$ConstantScoreAutoRewrite.rewrite(MultiTermQuery.java:227) at org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:382) at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:438) at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:311) at org.apache.lucene.search.Query.weight(Query.java:98) at org.apache.lucene.search.Searcher.createWeight(Searcher.java:230) at org.apache.lucene.search.Searcher.search(Searcher.java:171) at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:988) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) ... All other cores are working fine with the same schema. This problem only occurs when querying for specific data like q=fieldA:valueA%20AND%20fieldB:valueB By using the following query data is returned q=*:* Has anybody any suggestions on what is causing this problem? Are 90 cores too much for a single solr instance? Thanks in advance, Dominik
Maintain stopwords.txt and synonyms.txt
Hello together, i am currently developing a search solution, based on Apache Solr. Currently I have the problem that I want to offer the user the possibility to maintain synonyms and stopwords in a userfriendy tool. But currently I could not find any possibility to write the stopwords.txt or synonyms.txt. Are there any other solutions? Currently I have some ideas how to handle it: 1.Implement another SynonymFilterFactory to allow other datasources like databases. I already saw approaches for that but no solutions yet. 2.Implement a fileWriter request handler to write the stopwords.txt Are there other solutions which are maybe already implemented? Thanks and best regards Timo Timo Schmidt Entwickler (Diplom Informatiker FH) AOE media GmbH Borsigstr. 3 65205 Wiesbaden Germany Tel. +49 (0) 6122 70 70 7 - 234 Fax. +49 (0) 6122 70 70 7 -199 e-Mail: timo.schm...@aoemedia.de Web: http://www.aoemedia.de/ Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a USt-ID Nr.: DE250247455 Handelsregister: Wiesbaden B Handelsregister Nr.: 22567 Stammsitz: Wiesbaden Creditreform: 625.0209354 Geschäftsführer: Kian Toyouri Gould Diese E-Mail Nachricht enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail. This e-mail message may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail.
Re: Maintain stopwords.txt and synonyms.txt
Timo, On Wed, Feb 9, 2011 at 11:07 AM, Timo Schmidt timo.schm...@aoemedia.de wrote: But currently I could not find any possibility to write the stopwords.txt or synonyms.txt. what about writing the Files from an external Application and reload your Solr Core!? Seemed to be the simplest way to solve your problem, not? Regards Stefan
AW: Maintain stopwords.txt and synonyms.txt
Hi Stefan, i allready thought about that. Maybe some php service or something like that. But this would mean, that I need additional software on that server like a normal Apache installation, which needs to be maintained. That's why I thought a solution that is build into solr would be nice. Thanks Timo Schmidt Entwickler (Diplom Informatiker FH) AOE media GmbH Borsigstr. 3 65205 Wiesbaden Germany Tel. +49 (0) 6122 70 70 7 - 234 Fax. +49 (0) 6122 70 70 7 -199 e-Mail: timo.schm...@aoemedia.de Web: http://www.aoemedia.de/ Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a USt-ID Nr.: DE250247455 Handelsregister: Wiesbaden B Handelsregister Nr.: 22567 Stammsitz: Wiesbaden Creditreform: 625.0209354 Geschäftsführer: Kian Toyouri Gould -Ursprüngliche Nachricht- Von: Stefan Matheis [mailto:matheis.ste...@googlemail.com] Gesendet: Mittwoch, 9. Februar 2011 11:14 An: solr-user@lucene.apache.org Betreff: Re: Maintain stopwords.txt and synonyms.txt Timo, On Wed, Feb 9, 2011 at 11:07 AM, Timo Schmidt timo.schm...@aoemedia.de wrote: But currently I could not find any possibility to write the stopwords.txt or synonyms.txt. what about writing the Files from an external Application and reload your Solr Core!? Seemed to be the simplest way to solve your problem, not? Regards Stefan
Re: Maintain stopwords.txt and synonyms.txt
Hi Timo, of course - that's right. Write some JSP (i guess) which could be integrated in the already existing jetty/tomcat Server? Just wondering about, how do you perform Search-Requests to Solr? Normally, there is already any other Service running, which acts as 'proxy' to the outer world? ;) Regards Stefan On Wed, Feb 9, 2011 at 11:20 AM, Timo Schmidt timo.schm...@aoemedia.de wrote: Hi Stefan, i allready thought about that. Maybe some php service or something like that. But this would mean, that I need additional software on that server like a normal Apache installation, which needs to be maintained. That's why I thought a solution that is build into solr would be nice. Thanks Timo Schmidt Entwickler (Diplom Informatiker FH) AOE media GmbH Borsigstr. 3 65205 Wiesbaden Germany Tel. +49 (0) 6122 70 70 7 - 234 Fax. +49 (0) 6122 70 70 7 -199 e-Mail: timo.schm...@aoemedia.de Web: http://www.aoemedia.de/ Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a USt-ID Nr.: DE250247455 Handelsregister: Wiesbaden B Handelsregister Nr.: 22567 Stammsitz: Wiesbaden Creditreform: 625.0209354 Geschäftsführer: Kian Toyouri Gould -Ursprüngliche Nachricht- Von: Stefan Matheis [mailto:matheis.ste...@googlemail.com] Gesendet: Mittwoch, 9. Februar 2011 11:14 An: solr-user@lucene.apache.org Betreff: Re: Maintain stopwords.txt and synonyms.txt Timo, On Wed, Feb 9, 2011 at 11:07 AM, Timo Schmidt timo.schm...@aoemedia.de wrote: But currently I could not find any possibility to write the stopwords.txt or synonyms.txt. what about writing the Files from an external Application and reload your Solr Core!? Seemed to be the simplest way to solve your problem, not? Regards Stefan
Re: Nutch and Solr search on the fly
The parsed data is only sent to the Solr index of you tell a segment to be indexed; solrindex crawldb linkdb segment If you did this only once after injecting and then the consequent fetch,parse,update,index sequence then you, of course, only see those URL's. If you don't index a segment after it's being parsed, you need to do it later on. On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote: Hi all, I am a newbie to nutch and solr. Well relatively much newer to Solr than Nutch :) I have been using nutch for past two weeks, and I wanted to know if I can query or search on my nutch crawls on the fly(before it completes). I am asking this because the websites I am crawling are really huge and it takes around 3-4 days for a crawl to complete. I want to analyze some quick results while the nutch crawler is still crawling the URLs. Some one suggested me that Solr would make it possible. I followed the steps in http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. By this process, I see only the injected URLs are shown in the Solr search. I know I did something really foolish and the crawl never happened, I feel I am missing some information here. I think somewhere in the process there should be a crawling happening and I missed it out. Just wanted to see if some one could help me pointing this out and where I went wrong in the process. Forgive my foolishness and thanks for your patience. Cheers, Abi -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: [WKT] Spatial Searching
The show stopper for JTS is it's license, unfortunately. Otherwise, I think it would be done already! We could, since it's LGPL, make it an optional dependency, assuming someone can stub it out. On Feb 8, 2011, at 11:18 PM, Adam Estrada wrote: I just came across a ~nudge post over in the SIS list on what the status is for that project. This got me looking more in to spatial mods with Solr4.0. I found this enhancement in Jira. https://issues.apache.org/jira/browse/SOLR-2155. In this issue, David mentions that he's already integrated JTS in to Solr4.0 for querying on polygons stored as WKT. It's relatively easy to get WKT strings in to Solr but does the Field type exist yet? Is there a patch or something that I can test out? Here's how I would do it using GDAL/OGR and the already existing csv update handler. http://www.gdal.org/ogr/drv_csv.html ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT This converts a shapefile to a csv with the geometries in tact in the form of WKT. You can then get the data in to Solr by running the following command. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8; There are lots of flavors of geometries so I suspect that this will be a daunting task but because JTS recognizes each geometry type it should be possible to work with them. Does anyone know of a patch or even when this functionality might be included in to Solr4.0? I need to query for polygons ;-) Thanks, Adam -- Grant Ingersoll http://www.lucidimagination.com/
Re: [WKT] Spatial Searching
How could i stub this out not being a java guy? What is needed in order to do this? Licensing is always going to be an issue with JTS which is why I am interested in the project SIS sitting in incubation right now. I willing to put forth the effort if I had a little direction from the peanut gallery ;-) Adam On Feb 9, 2011, at 7:03 AM, Grant Ingersoll gsing...@apache.org wrote: The show stopper for JTS is it's license, unfortunately. Otherwise, I think it would be done already! We could, since it's LGPL, make it an optional dependency, assuming someone can stub it out. On Feb 8, 2011, at 11:18 PM, Adam Estrada wrote: I just came across a ~nudge post over in the SIS list on what the status is for that project. This got me looking more in to spatial mods with Solr4.0. I found this enhancement in Jira. https://issues.apache.org/jira/browse/SOLR-2155. In this issue, David mentions that he's already integrated JTS in to Solr4.0 for querying on polygons stored as WKT. It's relatively easy to get WKT strings in to Solr but does the Field type exist yet? Is there a patch or something that I can test out? Here's how I would do it using GDAL/OGR and the already existing csv update handler. http://www.gdal.org/ogr/drv_csv.html ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT This converts a shapefile to a csv with the geometries in tact in the form of WKT. You can then get the data in to Solr by running the following command. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8; There are lots of flavors of geometries so I suspect that this will be a daunting task but because JTS recognizes each geometry type it should be possible to work with them. Does anyone know of a patch or even when this functionality might be included in to Solr4.0? I need to query for polygons ;-) Thanks, Adam -- Grant Ingersoll http://www.lucidimagination.com/
AW: Maintain stopwords.txt and synonyms.txt
Yes we have something, but on another machine. Timo Schmidt Entwickler (Diplom Informatiker FH) AOE media GmbH Borsigstr. 3 65205 Wiesbaden Germany Tel. +49 (0) 6122 70 70 7 - 234 Fax. +49 (0) 6122 70 70 7 -199 e-Mail: timo.schm...@aoemedia.de Web: http://www.aoemedia.de/ Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a USt-ID Nr.: DE250247455 Handelsregister: Wiesbaden B Handelsregister Nr.: 22567 Stammsitz: Wiesbaden Creditreform: 625.0209354 Geschäftsführer: Kian Toyouri Gould -Ursprüngliche Nachricht- Von: Stefan Matheis [mailto:matheis.ste...@googlemail.com] Gesendet: Mittwoch, 9. Februar 2011 11:34 An: solr-user@lucene.apache.org Betreff: Re: Maintain stopwords.txt and synonyms.txt Hi Timo, of course - that's right. Write some JSP (i guess) which could be integrated in the already existing jetty/tomcat Server? Just wondering about, how do you perform Search-Requests to Solr? Normally, there is already any other Service running, which acts as 'proxy' to the outer world? ;) Regards Stefan On Wed, Feb 9, 2011 at 11:20 AM, Timo Schmidt timo.schm...@aoemedia.de wrote: Hi Stefan, i allready thought about that. Maybe some php service or something like that. But this would mean, that I need additional software on that server like a normal Apache installation, which needs to be maintained. That's why I thought a solution that is build into solr would be nice. Thanks Timo Schmidt Entwickler (Diplom Informatiker FH) AOE media GmbH Borsigstr. 3 65205 Wiesbaden Germany Tel. +49 (0) 6122 70 70 7 - 234 Fax. +49 (0) 6122 70 70 7 -199 e-Mail: timo.schm...@aoemedia.de Web: http://www.aoemedia.de/ Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a USt-ID Nr.: DE250247455 Handelsregister: Wiesbaden B Handelsregister Nr.: 22567 Stammsitz: Wiesbaden Creditreform: 625.0209354 Geschäftsführer: Kian Toyouri Gould -Ursprüngliche Nachricht- Von: Stefan Matheis [mailto:matheis.ste...@googlemail.com] Gesendet: Mittwoch, 9. Februar 2011 11:14 An: solr-user@lucene.apache.org Betreff: Re: Maintain stopwords.txt and synonyms.txt Timo, On Wed, Feb 9, 2011 at 11:07 AM, Timo Schmidt timo.schm...@aoemedia.de wrote: But currently I could not find any possibility to write the stopwords.txt or synonyms.txt. what about writing the Files from an external Application and reload your Solr Core!? Seemed to be the simplest way to solve your problem, not? Regards Stefan
Re: [WKT] Spatial Searching
Thought I would share this on web mapping...it's a great write up and something to consider when talking about working with spatial data. http://www.tokumine.com/2010/09/20/gis-data-payload-sizes/ Adam On Feb 9, 2011, at 7:03 AM, Grant Ingersoll gsing...@apache.org wrote: The show stopper for JTS is it's license, unfortunately. Otherwise, I think it would be done already! We could, since it's LGPL, make it an optional dependency, assuming someone can stub it out. On Feb 8, 2011, at 11:18 PM, Adam Estrada wrote: I just came across a ~nudge post over in the SIS list on what the status is for that project. This got me looking more in to spatial mods with Solr4.0. I found this enhancement in Jira. https://issues.apache.org/jira/browse/SOLR-2155. In this issue, David mentions that he's already integrated JTS in to Solr4.0 for querying on polygons stored as WKT. It's relatively easy to get WKT strings in to Solr but does the Field type exist yet? Is there a patch or something that I can test out? Here's how I would do it using GDAL/OGR and the already existing csv update handler. http://www.gdal.org/ogr/drv_csv.html ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT This converts a shapefile to a csv with the geometries in tact in the form of WKT. You can then get the data in to Solr by running the following command. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8; There are lots of flavors of geometries so I suspect that this will be a daunting task but because JTS recognizes each geometry type it should be possible to work with them. Does anyone know of a patch or even when this functionality might be included in to Solr4.0? I need to query for polygons ;-) Thanks, Adam -- Grant Ingersoll http://www.lucidimagination.com/
Re: Where is NGramFilter?
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory There is only EdgeNGramFilterFactory listed (which I got working for prefix indexing), but no NGramFilterFactory. Is that filter not supported anymore, or is that list not up to date? It should be there. Here is the javadoc for it: https://hudson.apache.org/hudson/job/Solr-trunk/javadoc/org/apache/solr/analysis/NGramFilterFactory.html Anyone who have an account can update the wiki. Contribution is welcome! Koji -- http://www.rondhuit.com/en/
Re: [WKT] Spatial Searching
Grant, How could i stub this out not being a java guy? What is needed in order to do this? Licensing is always going to be an issue with JTS which is why I am interested in the project SIS sitting in incubation right now. I'm willing to put forth the effort if I had a little direction on how to implement it from the peanut gallery ;-) Adam On Feb 9, 2011, at 7:03 AM, Grant Ingersoll wrote: The show stopper for JTS is it's license, unfortunately. Otherwise, I think it would be done already! We could, since it's LGPL, make it an optional dependency, assuming someone can stub it out. On Feb 8, 2011, at 11:18 PM, Adam Estrada wrote: I just came across a ~nudge post over in the SIS list on what the status is for that project. This got me looking more in to spatial mods with Solr4.0. I found this enhancement in Jira. https://issues.apache.org/jira/browse/SOLR-2155. In this issue, David mentions that he's already integrated JTS in to Solr4.0 for querying on polygons stored as WKT. It's relatively easy to get WKT strings in to Solr but does the Field type exist yet? Is there a patch or something that I can test out? Here's how I would do it using GDAL/OGR and the already existing csv update handler. http://www.gdal.org/ogr/drv_csv.html ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT This converts a shapefile to a csv with the geometries in tact in the form of WKT. You can then get the data in to Solr by running the following command. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8; There are lots of flavors of geometries so I suspect that this will be a daunting task but because JTS recognizes each geometry type it should be possible to work with them. Does anyone know of a patch or even when this functionality might be included in to Solr4.0? I need to query for polygons ;-) Thanks, Adam -- Grant Ingersoll http://www.lucidimagination.com/
Re: [WKT] Spatial Searching
Thought I would share this on web mapping...it's a great write up and something to consider when talking about working with spatial data. http://www.tokumine.com/2010/09/20/gis-data-payload-sizes/ Adam On Feb 9, 2011, at 7:03 AM, Grant Ingersoll wrote: The show stopper for JTS is it's license, unfortunately. Otherwise, I think it would be done already! We could, since it's LGPL, make it an optional dependency, assuming someone can stub it out. On Feb 8, 2011, at 11:18 PM, Adam Estrada wrote: I just came across a ~nudge post over in the SIS list on what the status is for that project. This got me looking more in to spatial mods with Solr4.0. I found this enhancement in Jira. https://issues.apache.org/jira/browse/SOLR-2155. In this issue, David mentions that he's already integrated JTS in to Solr4.0 for querying on polygons stored as WKT. It's relatively easy to get WKT strings in to Solr but does the Field type exist yet? Is there a patch or something that I can test out? Here's how I would do it using GDAL/OGR and the already existing csv update handler. http://www.gdal.org/ogr/drv_csv.html ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT This converts a shapefile to a csv with the geometries in tact in the form of WKT. You can then get the data in to Solr by running the following command. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8; There are lots of flavors of geometries so I suspect that this will be a daunting task but because JTS recognizes each geometry type it should be possible to work with them. Does anyone know of a patch or even when this functionality might be included in to Solr4.0? I need to query for polygons ;-) Thanks, Adam -- Grant Ingersoll http://www.lucidimagination.com/
Re: Maintain stopwords.txt and synonyms.txt
Timo, then use cronjobs on your solr-machine to fetch the generated synonyms-file, put in to the correct location and reload the core-configuration (which is required to update the synonyms-file)? :) Regards Stefan On Wed, Feb 9, 2011 at 1:15 PM, Timo Schmidt timo.schm...@aoemedia.de wrote: Yes we have something, but on another machine. Timo Schmidt Entwickler (Diplom Informatiker FH) AOE media GmbH Borsigstr. 3 65205 Wiesbaden Germany Tel. +49 (0) 6122 70 70 7 - 234 Fax. +49 (0) 6122 70 70 7 -199 e-Mail: timo.schm...@aoemedia.de Web: http://www.aoemedia.de/ Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a USt-ID Nr.: DE250247455 Handelsregister: Wiesbaden B Handelsregister Nr.: 22567 Stammsitz: Wiesbaden Creditreform: 625.0209354 Geschäftsführer: Kian Toyouri Gould -Ursprüngliche Nachricht- Von: Stefan Matheis [mailto:matheis.ste...@googlemail.com] Gesendet: Mittwoch, 9. Februar 2011 11:34 An: solr-user@lucene.apache.org Betreff: Re: Maintain stopwords.txt and synonyms.txt Hi Timo, of course - that's right. Write some JSP (i guess) which could be integrated in the already existing jetty/tomcat Server? Just wondering about, how do you perform Search-Requests to Solr? Normally, there is already any other Service running, which acts as 'proxy' to the outer world? ;) Regards Stefan On Wed, Feb 9, 2011 at 11:20 AM, Timo Schmidt timo.schm...@aoemedia.de wrote: Hi Stefan, i allready thought about that. Maybe some php service or something like that. But this would mean, that I need additional software on that server like a normal Apache installation, which needs to be maintained. That's why I thought a solution that is build into solr would be nice. Thanks Timo Schmidt Entwickler (Diplom Informatiker FH) AOE media GmbH Borsigstr. 3 65205 Wiesbaden Germany Tel. +49 (0) 6122 70 70 7 - 234 Fax. +49 (0) 6122 70 70 7 -199 e-Mail: timo.schm...@aoemedia.de Web: http://www.aoemedia.de/ Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a USt-ID Nr.: DE250247455 Handelsregister: Wiesbaden B Handelsregister Nr.: 22567 Stammsitz: Wiesbaden Creditreform: 625.0209354 Geschäftsführer: Kian Toyouri Gould -Ursprüngliche Nachricht- Von: Stefan Matheis [mailto:matheis.ste...@googlemail.com] Gesendet: Mittwoch, 9. Februar 2011 11:14 An: solr-user@lucene.apache.org Betreff: Re: Maintain stopwords.txt and synonyms.txt Timo, On Wed, Feb 9, 2011 at 11:07 AM, Timo Schmidt timo.schm...@aoemedia.de wrote: But currently I could not find any possibility to write the stopwords.txt or synonyms.txt. what about writing the Files from an external Application and reload your Solr Core!? Seemed to be the simplest way to solve your problem, not? Regards Stefan
AW: IndexOutOfBoundsException
I think we had a similar exception recently when attempting to sort on a multi-valued field ... could that be possible in your case? André -Ursprüngliche Nachricht- Von: Dominik Lange [mailto:dominikla...@searchmetrics.com] Gesendet: Mittwoch, 9. Februar 2011 10:55 An: solr-user@lucene.apache.org Betreff: IndexOutOfBoundsException hi, we have a problem with our solr test instance. This instance is running with 90 cores with about 2 GB of Index-Data per core. This worked fine for a few weeks. Now we get an exception querying data from one core : java.lang.IndexOutOfBoundsException: Index: 104, Size: 11 at java.util.ArrayList.rangeCheck(ArrayList.java:571) at java.util.ArrayList.get(ArrayList.java:349) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:288) at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:277) at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:129) at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:160) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:211) at org.apache.lucene.index.TermInfosReader.terms(TermInfosReader.java:277) at org.apache.lucene.index.SegmentReader.terms(SegmentReader.java:961) at org.apache.lucene.index.DirectoryReader$MultiTermEnum.lt;initgt;(DirectoryReader.java:989) at org.apache.lucene.index.DirectoryReader.terms(DirectoryReader.java:626) at org.apache.solr.search.SolrIndexReader.terms(SolrIndexReader.java:302) at org.apache.lucene.search.PrefixTermEnum.lt;initgt;(PrefixTermEnum.java:41) at org.apache.lucene.search.PrefixQuery.getEnum(PrefixQuery.java:45) at org.apache.lucene.search.MultiTermQuery$ConstantScoreAutoRewrite.rewrite(MultiTermQuery.java:227) at org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:382) at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:438) at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:311) at org.apache.lucene.search.Query.weight(Query.java:98) at org.apache.lucene.search.Searcher.createWeight(Searcher.java:230) at org.apache.lucene.search.Searcher.search(Searcher.java:171) at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:988) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) ... All other cores are working fine with the same schema. This problem only occurs when querying for specific data like q=fieldA:valueA%20AND%20fieldB:valueB By using the following query data is returned q=*:* Has anybody any suggestions on what is causing this problem? Are 90 cores too much for a single solr instance? Thanks in advance, Dominik
Re: high cpu usage
You can try attaching jConsole to the process to see what it shows. If you're on a *nix box you can get a gross idea what's going on with top. Best Erick On Wed, Feb 9, 2011 at 4:31 AM, Erez Zarum e...@icinga.org.il wrote: Hello, We have been running read only solr instances for a few months now, yesterday i have noticed an high cpu usage coming from the JVM, it simply use 100% of the CPU for no reason. Nothing was changed, we are using Jetty as a Servlet container for solr. Where can i start looking what cause it? it has been using 100% CPU for almost 24 hours now. Thanks, Erez.
AW: IndexOutOfBoundsException
No, we do not have multivalued fields and we do not sort (in this case). We reindexed csv file and the error disappeared, but it would we interesting why this error occured... Thank you for you suggestion. Dominik -Ursprüngliche Nachricht- Von: André Widhani [mailto:andre.widh...@digicol.de] Gesendet: Mi 09.02.2011 13:58 An: solr-user@lucene.apache.org Betreff: AW: IndexOutOfBoundsException I think we had a similar exception recently when attempting to sort on a multi-valued field ... could that be possible in your case? André -Ursprüngliche Nachricht- Von: Dominik Lange [mailto:dominikla...@searchmetrics.com] Gesendet: Mittwoch, 9. Februar 2011 10:55 An: solr-user@lucene.apache.org Betreff: IndexOutOfBoundsException hi, we have a problem with our solr test instance. This instance is running with 90 cores with about 2 GB of Index-Data per core. This worked fine for a few weeks. Now we get an exception querying data from one core : java.lang.IndexOutOfBoundsException: Index: 104, Size: 11 at java.util.ArrayList.rangeCheck(ArrayList.java:571) at java.util.ArrayList.get(ArrayList.java:349) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:288) at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:277) at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:129) at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:160) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:211) at org.apache.lucene.index.TermInfosReader.terms(TermInfosReader.java:277) at org.apache.lucene.index.SegmentReader.terms(SegmentReader.java:961) at org.apache.lucene.index.DirectoryReader$MultiTermEnum.lt;initgt;(DirectoryReader.java:989) at org.apache.lucene.index.DirectoryReader.terms(DirectoryReader.java:626) at org.apache.solr.search.SolrIndexReader.terms(SolrIndexReader.java:302) at org.apache.lucene.search.PrefixTermEnum.lt;initgt;(PrefixTermEnum.java:41) at org.apache.lucene.search.PrefixQuery.getEnum(PrefixQuery.java:45) at org.apache.lucene.search.MultiTermQuery$ConstantScoreAutoRewrite.rewrite(MultiTermQuery.java:227) at org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:382) at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:438) at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:311) at org.apache.lucene.search.Query.weight(Query.java:98) at org.apache.lucene.search.Searcher.createWeight(Searcher.java:230) at org.apache.lucene.search.Searcher.search(Searcher.java:171) at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:988) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) ... All other cores are working fine with the same schema. This problem only occurs when querying for specific data like q=fieldA:valueA%20AND%20fieldB:valueB By using the following query data is returned q=*:* Has anybody any suggestions on what is causing this problem? Are 90 cores too much for a single solr instance? Thanks in advance, Dominik
Re: Where is NGramFilter?
In addition to Koji's note, see the bold comment at the top of that page that says that this not a complete list, the definitive list is always the javadocs... Best Erick On Wed, Feb 9, 2011 at 3:34 AM, Kai Schlamp schl...@gmx.de wrote: Hi. On the Sunspot (a Ruby Solr client) Wiki ( https://github.com/outoftime/sunspot/wiki/Matching-substrings-in-fulltext-search ) it says that the NGramFilter should allow substring indexing. As I never got it working, I searched a bit and found this site: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory There is only EdgeNGramFilterFactory listed (which I got working for prefix indexing), but no NGramFilterFactory. Is that filter not supported anymore, or is that list not up to date? Is there an alternative filter for getting substring searching working? Best regards, Kai
Re: TermVector query using Solr Tutorial
Hello, On Tue, Feb 8, 2011 at 11:12 PM, Grant Ingersoll gsing...@apache.org wrote: It's a little hard to read due to the indentation, but AFAICT you have two terms, usb and cabl. USB appears at position 0 and cabl at position 1. Those are the relative positions to each other. Perhaps you can explain a bit more what you are trying to do? I am searching the keyword 25, in the field field name=features30 TFT active matrix LCD, 2560 x 1600, .25mm dot pitch, 700:1 contrast/field I want to know the character position of matched keyword in the corresponding field. usb or cabl is not what I want.
Re: Nutch and Solr search on the fly
Hi Markus, I am sorry for not being clear, I meant to say that... Suppose if a url namely www.somehost.com/gifts/greetingcard.html(which in turn contain links to a.html, b.html, c.html, d.html) is injected into the seed.txt, after the whole process I was expecting a bunch of other pages which crawled from this seed url. However, at the end of it all I see is the contents from only this page namely www.somehost.com/gifts/greetingcard.htmland I do not see any other pages(here a.html, b.html, c.html, d.html) crawled from this one. The crawling happens only for the URLs mentioned in the seed.txt and does not proceed further from there. So I am just bit confused. Why is it not crawling the linked pages(a.html, b.html, c.html and d.html). I get a feeling that I am missing something that the author of the blog( http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed everyone would know. Thanks, Abi On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma markus.jel...@openindex.iowrote: The parsed data is only sent to the Solr index of you tell a segment to be indexed; solrindex crawldb linkdb segment If you did this only once after injecting and then the consequent fetch,parse,update,index sequence then you, of course, only see those URL's. If you don't index a segment after it's being parsed, you need to do it later on. On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote: Hi all, I am a newbie to nutch and solr. Well relatively much newer to Solr than Nutch :) I have been using nutch for past two weeks, and I wanted to know if I can query or search on my nutch crawls on the fly(before it completes). I am asking this because the websites I am crawling are really huge and it takes around 3-4 days for a crawl to complete. I want to analyze some quick results while the nutch crawler is still crawling the URLs. Some one suggested me that Solr would make it possible. I followed the steps in http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. By this process, I see only the injected URLs are shown in the Solr search. I know I did something really foolish and the crawl never happened, I feel I am missing some information here. I think somewhere in the process there should be a crawling happening and I missed it out. Just wanted to see if some one could help me pointing this out and where I went wrong in the process. Forgive my foolishness and thanks for your patience. Cheers, Abi -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Nutch and Solr search on the fly
WARNING: I don't do Nutch much, but could it be that your crawl depth is 1? See: http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/NutchTutorialand search for depth Best Erick On Wed, Feb 9, 2011 at 9:06 AM, .: Abhishek :. ab1s...@gmail.com wrote: Hi Markus, I am sorry for not being clear, I meant to say that... Suppose if a url namely www.somehost.com/gifts/greetingcard.html(which in turn contain links to a.html, b.html, c.html, d.html) is injected into the seed.txt, after the whole process I was expecting a bunch of other pages which crawled from this seed url. However, at the end of it all I see is the contents from only this page namely www.somehost.com/gifts/greetingcard.htmland I do not see any other pages(here a.html, b.html, c.html, d.html) crawled from this one. The crawling happens only for the URLs mentioned in the seed.txt and does not proceed further from there. So I am just bit confused. Why is it not crawling the linked pages(a.html, b.html, c.html and d.html). I get a feeling that I am missing something that the author of the blog( http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed everyone would know. Thanks, Abi On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma markus.jel...@openindex.io wrote: The parsed data is only sent to the Solr index of you tell a segment to be indexed; solrindex crawldb linkdb segment If you did this only once after injecting and then the consequent fetch,parse,update,index sequence then you, of course, only see those URL's. If you don't index a segment after it's being parsed, you need to do it later on. On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote: Hi all, I am a newbie to nutch and solr. Well relatively much newer to Solr than Nutch :) I have been using nutch for past two weeks, and I wanted to know if I can query or search on my nutch crawls on the fly(before it completes). I am asking this because the websites I am crawling are really huge and it takes around 3-4 days for a crawl to complete. I want to analyze some quick results while the nutch crawler is still crawling the URLs. Some one suggested me that Solr would make it possible. I followed the steps in http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. By this process, I see only the injected URLs are shown in the Solr search. I know I did something really foolish and the crawl never happened, I feel I am missing some information here. I think somewhere in the process there should be a crawling happening and I missed it out. Just wanted to see if some one could help me pointing this out and where I went wrong in the process. Forgive my foolishness and thanks for your patience. Cheers, Abi -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Nutch and Solr search on the fly
Are you using the depth parameter with the crawl command or are you using the separate generate, fetch etc. commands? What's $ nutch readdb crawldb -stats returning? On Wednesday 09 February 2011 15:06:40 .: Abhishek :. wrote: Hi Markus, I am sorry for not being clear, I meant to say that... Suppose if a url namely www.somehost.com/gifts/greetingcard.html(which in turn contain links to a.html, b.html, c.html, d.html) is injected into the seed.txt, after the whole process I was expecting a bunch of other pages which crawled from this seed url. However, at the end of it all I see is the contents from only this page namely www.somehost.com/gifts/greetingcard.htmland I do not see any other pages(here a.html, b.html, c.html, d.html) crawled from this one. The crawling happens only for the URLs mentioned in the seed.txt and does not proceed further from there. So I am just bit confused. Why is it not crawling the linked pages(a.html, b.html, c.html and d.html). I get a feeling that I am missing something that the author of the blog( http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed everyone would know. Thanks, Abi On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma markus.jel...@openindex.iowrote: The parsed data is only sent to the Solr index of you tell a segment to be indexed; solrindex crawldb linkdb segment If you did this only once after injecting and then the consequent fetch,parse,update,index sequence then you, of course, only see those URL's. If you don't index a segment after it's being parsed, you need to do it later on. On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote: Hi all, I am a newbie to nutch and solr. Well relatively much newer to Solr than Nutch :) I have been using nutch for past two weeks, and I wanted to know if I can query or search on my nutch crawls on the fly(before it completes). I am asking this because the websites I am crawling are really huge and it takes around 3-4 days for a crawl to complete. I want to analyze some quick results while the nutch crawler is still crawling the URLs. Some one suggested me that Solr would make it possible. I followed the steps in http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. By this process, I see only the injected URLs are shown in the Solr search. I know I did something really foolish and the crawl never happened, I feel I am missing some information here. I think somewhere in the process there should be a crawling happening and I missed it out. Just wanted to see if some one could help me pointing this out and where I went wrong in the process. Forgive my foolishness and thanks for your patience. Cheers, Abi -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Does Distributed Search support {!boost }?
On Tue, Feb 8, 2011 at 9:02 PM, Andy angelf...@yahoo.com wrote: Is it possible to do a query like {!boost b=log(popularity)}foo over sharded indexes? Yep, that should work fine. -Yonik http://lucidimagination.com
Re: Nutch and Solr search on the fly
Hi Erick, Thanks a bunch for the response Could be a chance..but all I am wondering is where to specify the depth in the whole entire process in the URL http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/? I tried specifying it during the fetcher phase but it was just ignored :( Thanks, Abi On Wed, Feb 9, 2011 at 10:11 PM, Erick Erickson erickerick...@gmail.comwrote: WARNING: I don't do Nutch much, but could it be that your crawl depth is 1? See: http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/NutchTutorialand search for depth Best Erick On Wed, Feb 9, 2011 at 9:06 AM, .: Abhishek :. ab1s...@gmail.com wrote: Hi Markus, I am sorry for not being clear, I meant to say that... Suppose if a url namely www.somehost.com/gifts/greetingcard.html(whichhttp://www.somehost.com/gifts/greetingcard.html%28whichin turn contain links to a.html, b.html, c.html, d.html) is injected into the seed.txt, after the whole process I was expecting a bunch of other pages which crawled from this seed url. However, at the end of it all I see is the contents from only this page namely www.somehost.com/gifts/greetingcard.htmland I do not see any other pages(here a.html, b.html, c.html, d.html) crawled from this one. The crawling happens only for the URLs mentioned in the seed.txt and does not proceed further from there. So I am just bit confused. Why is it not crawling the linked pages(a.html, b.html, c.html and d.html). I get a feeling that I am missing something that the author of the blog( http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed everyone would know. Thanks, Abi On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma markus.jel...@openindex.io wrote: The parsed data is only sent to the Solr index of you tell a segment to be indexed; solrindex crawldb linkdb segment If you did this only once after injecting and then the consequent fetch,parse,update,index sequence then you, of course, only see those URL's. If you don't index a segment after it's being parsed, you need to do it later on. On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote: Hi all, I am a newbie to nutch and solr. Well relatively much newer to Solr than Nutch :) I have been using nutch for past two weeks, and I wanted to know if I can query or search on my nutch crawls on the fly(before it completes). I am asking this because the websites I am crawling are really huge and it takes around 3-4 days for a crawl to complete. I want to analyze some quick results while the nutch crawler is still crawling the URLs. Some one suggested me that Solr would make it possible. I followed the steps in http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. By this process, I see only the injected URLs are shown in the Solr search. I know I did something really foolish and the crawl never happened, I feel I am missing some information here. I think somewhere in the process there should be a crawling happening and I missed it out. Just wanted to see if some one could help me pointing this out and where I went wrong in the process. Forgive my foolishness and thanks for your patience. Cheers, Abi -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Solr 1.4.1 using more memory than Solr 1.3
Hi Solr Users, We are in the process of upgrading from Solr 1.3 to Solr 1.4.1. While performing stress test on Solr 1.4.1 to measure the performance improvement in Query times (QTime) and no more blocked threads, we ran into memory issues with Solr 1.4.1. Test Setup details: - 2 identical hosts running Solr 1.3 and Solr 1.4.1 individually. - 3 cores with index sizes : 10 GB, 2 GB, 1 GB. - JVM Max RAM : 3GB ( Xmx3072m) , Total RAM : 4GB - No other application/service running on the servers. - For querying solr servers, we are using wget queries from a standalone host. For the same index data and same set of queries, Solr 1.3 is hovering between 1.5 to 2.2 GB, whereas with about 20K requests Solr 1.4.1 is reaching its 3 GB limit and performing FULL GC after almost every query. The Full GC is also not freeing up any memory. Has anyone also faced similar issues with Solr 1.4.1 ? Also why is Solr 1.4.1 using more memory for the same amount of processing compared to Solr 1.3 ? Is there any particular configuration that needs to be done to avoid this high memory usage ? Thanks, Rachita
Re: Solr 1.4.1 using more memory than Solr 1.3
Searching and sorting is now done on a per-segment basis, meaning that the FieldCache entries used for sorting and for function queries are created and used per-segment and can be reused for segments that don't change between index updates. While generally beneficial, this can lead to increased memory usage over 1.3 in certain scenarios: 1) A single valued field that was used for both sorting and faceting in 1.3 would have used the same top level FieldCache entry. In 1.4, sorting will use entries at the segment level while faceting will still use entries at the top reader level, leading to increased memory usage. 2) Certain function queries such as ord() and rord() require a top level FieldCache instance and can thus lead to increased memory usage. Consider replacing ord() and rord() with alternatives, such as function queries based on ms() for date boosting. http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.4/CHANGES.txt On Wednesday 09 February 2011 16:07:01 Rachita Choudhary wrote: Hi Solr Users, We are in the process of upgrading from Solr 1.3 to Solr 1.4.1. While performing stress test on Solr 1.4.1 to measure the performance improvement in Query times (QTime) and no more blocked threads, we ran into memory issues with Solr 1.4.1. Test Setup details: - 2 identical hosts running Solr 1.3 and Solr 1.4.1 individually. - 3 cores with index sizes : 10 GB, 2 GB, 1 GB. - JVM Max RAM : 3GB ( Xmx3072m) , Total RAM : 4GB - No other application/service running on the servers. - For querying solr servers, we are using wget queries from a standalone host. For the same index data and same set of queries, Solr 1.3 is hovering between 1.5 to 2.2 GB, whereas with about 20K requests Solr 1.4.1 is reaching its 3 GB limit and performing FULL GC after almost every query. The Full GC is also not freeing up any memory. Has anyone also faced similar issues with Solr 1.4.1 ? Also why is Solr 1.4.1 using more memory for the same amount of processing compared to Solr 1.3 ? Is there any particular configuration that needs to be done to avoid this high memory usage ? Thanks, Rachita -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
RE: Concurrent updates/commits
Solr does handle concurrency fine. But there is NOT transaction isolation like you'll get from an rdbms. All 'pending' changes are (conceptually, anyway) held in a single queue, and any commit will commit ALL of them. There isn't going to be any data corruption issues or anything from concurrent adds (unless there's a bug in Solr, there isn't supposed to be) -- but there is no kind of transactions or isolation between different concurrent adders. So, sure, everyone can add concurrently -- but any time any of those actors issues a commit, all pending adds are committed. In addition, there are problems with Solr's basic architecture and _too frequent_ commits (whether made by different processes or not, doesn''t matter). When a new commit happens, Solr fires up a new index searcher and warms it up on the new version of the index. Until the new index searcher is fully warmed, the old index searcher is still serving queries. Which can also mean that there are, for this period, TWO versions of all your caches in RAM and such. So let's say it takes 5 minutes for the new index to be fully warmed. But if you have commits happening every 1 minute -- then you'll end up with FIVE 'new indexes' being warmed -- meaning potentially 5 times the RAM usage (quickly running into a JVM out of memory error), lots of CPU activity going on warming indexes that will never actually been used (because even though they aren't even done being warmed and ready to use, they've already been superseded by a later commit). I don't know of any good way to deal with this except less frequent commits. One way to get less frequent commits is to use Solr replication, and 'stage' all your commits in a 'master' index, but only replicate to 'slave' at a frequency slow enough so the new index is fully warmed before the next commit happens. Some new features in trunk (both lucene and solr) for 'near real time' search ameliorate this problem somewhat, depending on the nature of your commits. Jonathan From: Savvas-Andreas Moysidis [savvas.andreas.moysi...@googlemail.com] Sent: Wednesday, February 09, 2011 10:34 AM To: solr-user@lucene.apache.org Subject: Concurrent updates/commits Hello, This topic has probably been covered before here, but we're still not very clear about how multiple commits work in Solr. We currently have a requirement to make our domain objects searchable immediately after the get updated in the database by some user action. This could potentially cause multiple updates/commits to be fired to Solr and we are trying to investigate how Solr handles those multiple requests. This thread: http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search suggests that Solr will handle all of the lower level details and that Before a *COMMIT* is done , lock is obtained and its released after the operation which in my understanding means that Solr will serialise all update/commit requests? However, the Solr book, in the Commit, Optimise, Rollback section reads: if more than one Solr client were to submit modifications and commit them at similar times, it is possible for part of one client's set of changes to be committed before that client told Solr to commit which suggests that requests are *not* serialised. Our questions are: - Does Solr handle concurrent requests or do we need to add synchronisation logic around our code? - If Solr *does* handle concurrent requests, does it serialise each request or has some other strategy for processing those? Thanks, - Savvas
RE: Concurrent updates/commits
However, the Solr book, in the Commit, Optimise, Rollback section reads: if more than one Solr client were to submit modifications and commit them at similar times, it is possible for part of one client's set of changes to be committed before that client told Solr to commit which suggests that requests are *not* serialised. I read this as If two client submit modifications and commits every couple of minutes, it could happen that modifications of client1 got committed by client2's commit before client1 asks for a commit. As far as I understand Solr commit, they are serialized by design. And committing too often could lead you to trouble if you have many warm-up queries (?). Hope this helps, Pierre -Message d'origine- De : Savvas-Andreas Moysidis [mailto:savvas.andreas.moysi...@googlemail.com] Envoyé : mercredi 9 février 2011 16:34 À : solr-user@lucene.apache.org Objet : Concurrent updates/commits Hello, This topic has probably been covered before here, but we're still not very clear about how multiple commits work in Solr. We currently have a requirement to make our domain objects searchable immediately after the get updated in the database by some user action. This could potentially cause multiple updates/commits to be fired to Solr and we are trying to investigate how Solr handles those multiple requests. This thread: http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search suggests that Solr will handle all of the lower level details and that Before a *COMMIT* is done , lock is obtained and its released after the operation which in my understanding means that Solr will serialise all update/commit requests? However, the Solr book, in the Commit, Optimise, Rollback section reads: if more than one Solr client were to submit modifications and commit them at similar times, it is possible for part of one client's set of changes to be committed before that client told Solr to commit which suggests that requests are *not* serialised. Our questions are: - Does Solr handle concurrent requests or do we need to add synchronisation logic around our code? - If Solr *does* handle concurrent requests, does it serialise each request or has some other strategy for processing those? Thanks, - Savvas
Re: Concurrent updates/commits
Hello, Thanks very much for your quick replies. So, according to Pierre, all updates will be immediately posted to Solr, but all commits will be serialised. But doesn't that contradict Jonathan's example where you can end up with FIVE 'new indexes' being warmed? If commits are serialised, then there can only ever be one Index Searcher being auto-warmed at a time or have I got this wrong? The reason we are investigating commit serialisation, is because we want to know whether the commit requests will be blocked until the previous ones finish. Cheers, - Savvas On 9 February 2011 15:44, Pierre GOSSE pierre.go...@arisem.com wrote: However, the Solr book, in the Commit, Optimise, Rollback section reads: if more than one Solr client were to submit modifications and commit them at similar times, it is possible for part of one client's set of changes to be committed before that client told Solr to commit which suggests that requests are *not* serialised. I read this as If two client submit modifications and commits every couple of minutes, it could happen that modifications of client1 got committed by client2's commit before client1 asks for a commit. As far as I understand Solr commit, they are serialized by design. And committing too often could lead you to trouble if you have many warm-up queries (?). Hope this helps, Pierre -Message d'origine- De : Savvas-Andreas Moysidis [mailto: savvas.andreas.moysi...@googlemail.com] Envoyé : mercredi 9 février 2011 16:34 À : solr-user@lucene.apache.org Objet : Concurrent updates/commits Hello, This topic has probably been covered before here, but we're still not very clear about how multiple commits work in Solr. We currently have a requirement to make our domain objects searchable immediately after the get updated in the database by some user action. This could potentially cause multiple updates/commits to be fired to Solr and we are trying to investigate how Solr handles those multiple requests. This thread: http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search suggests that Solr will handle all of the lower level details and that Before a *COMMIT* is done , lock is obtained and its released after the operation which in my understanding means that Solr will serialise all update/commit requests? However, the Solr book, in the Commit, Optimise, Rollback section reads: if more than one Solr client were to submit modifications and commit them at similar times, it is possible for part of one client's set of changes to be committed before that client told Solr to commit which suggests that requests are *not* serialised. Our questions are: - Does Solr handle concurrent requests or do we need to add synchronisation logic around our code? - If Solr *does* handle concurrent requests, does it serialise each request or has some other strategy for processing those? Thanks, - Savvas
Re: Concurrent updates/commits
Don't think commit, that is confusing. Solr is not a database. In particular, it does not have the isolation property from ACID. Solr indexes new documents as a batch, then installs a new version of the entire index. Installing a new index isn't instant, especially with warming queries. Solr creates the index, then warms it, then makes it available for regular queries. If you are creating indexes frequently, don't bother warming. wunder == Walter Underwood Lead Engineer, MarkLogic On Feb 9, 2011, at 8:03 AM, Savvas-Andreas Moysidis wrote: Hello, Thanks very much for your quick replies. So, according to Pierre, all updates will be immediately posted to Solr, but all commits will be serialised. But doesn't that contradict Jonathan's example where you can end up with FIVE 'new indexes' being warmed? If commits are serialised, then there can only ever be one Index Searcher being auto-warmed at a time or have I got this wrong? The reason we are investigating commit serialisation, is because we want to know whether the commit requests will be blocked until the previous ones finish. Cheers, - Savvas On 9 February 2011 15:44, Pierre GOSSE pierre.go...@arisem.com wrote: However, the Solr book, in the Commit, Optimise, Rollback section reads: if more than one Solr client were to submit modifications and commit them at similar times, it is possible for part of one client's set of changes to be committed before that client told Solr to commit which suggests that requests are *not* serialised. I read this as If two client submit modifications and commits every couple of minutes, it could happen that modifications of client1 got committed by client2's commit before client1 asks for a commit. As far as I understand Solr commit, they are serialized by design. And committing too often could lead you to trouble if you have many warm-up queries (?). Hope this helps, Pierre -Message d'origine- De : Savvas-Andreas Moysidis [mailto: savvas.andreas.moysi...@googlemail.com] Envoyé : mercredi 9 février 2011 16:34 À : solr-user@lucene.apache.org Objet : Concurrent updates/commits Hello, This topic has probably been covered before here, but we're still not very clear about how multiple commits work in Solr. We currently have a requirement to make our domain objects searchable immediately after the get updated in the database by some user action. This could potentially cause multiple updates/commits to be fired to Solr and we are trying to investigate how Solr handles those multiple requests. This thread: http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search suggests that Solr will handle all of the lower level details and that Before a *COMMIT* is done , lock is obtained and its released after the operation which in my understanding means that Solr will serialise all update/commit requests? However, the Solr book, in the Commit, Optimise, Rollback section reads: if more than one Solr client were to submit modifications and commit them at similar times, it is possible for part of one client's set of changes to be committed before that client told Solr to commit which suggests that requests are *not* serialised. Our questions are: - Does Solr handle concurrent requests or do we need to add synchronisation logic around our code? - If Solr *does* handle concurrent requests, does it serialise each request or has some other strategy for processing those? Thanks, - Savvas
Re: Concurrent updates/commits
Hi Savvas, well, although it sounds strange: If a commit happens, a new Index Searcher is warming. If a new commit happens while a 'new' Index Searcher is warming, another Index Searcher is warming. So, at this point of time, you got 3 Index Searchers: The old one, the 'new' one and the newest one. I don't know wheter the old one will be replaced by the new one until the newest one has finished warming, but it seems to be a good guess, since you can search while the new index is still committing. You should know that Lucene is built on a segment-architecture. This means every time you do a commit you write a completely new index-segment. Example: You got one segment in your index and a searcher for it, now you are committing. After the commit finished you got two segments for it and one searcher for both segments. Internally your indexSearcher consists of at least two segmentReaders. If you are committing three times at the same moment, you will warm 3 new SolrIndexSearchers that contain 3,4 and 5 segmentReaders. Your old SolrIndexSearcher contains 2 segmentReaders and is valid until the newer SolrIndexReader based on 3 segmentReaders is warmed. Regards, Em -- View this message in context: http://lucene.472066.n3.nabble.com/Concurrent-updates-commits-tp2459222p2459522.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Concurrent updates/commits
Well, Jonathan explanations are much more accurate than mine. :) I took the word serialization as meaning kind of isolation between commits, which is not very smart. Sorry to have introduce more confusion in this. Pierre -Message d'origine- De : Savvas-Andreas Moysidis [mailto:savvas.andreas.moysi...@googlemail.com] Envoyé : mercredi 9 février 2011 17:04 À : solr-user@lucene.apache.org Objet : Re: Concurrent updates/commits Hello, Thanks very much for your quick replies. So, according to Pierre, all updates will be immediately posted to Solr, but all commits will be serialised. But doesn't that contradict Jonathan's example where you can end up with FIVE 'new indexes' being warmed? If commits are serialised, then there can only ever be one Index Searcher being auto-warmed at a time or have I got this wrong? The reason we are investigating commit serialisation, is because we want to know whether the commit requests will be blocked until the previous ones finish. Cheers, - Savvas On 9 February 2011 15:44, Pierre GOSSE pierre.go...@arisem.com wrote: However, the Solr book, in the Commit, Optimise, Rollback section reads: if more than one Solr client were to submit modifications and commit them at similar times, it is possible for part of one client's set of changes to be committed before that client told Solr to commit which suggests that requests are *not* serialised. I read this as If two client submit modifications and commits every couple of minutes, it could happen that modifications of client1 got committed by client2's commit before client1 asks for a commit. As far as I understand Solr commit, they are serialized by design. And committing too often could lead you to trouble if you have many warm-up queries (?). Hope this helps, Pierre -Message d'origine- De : Savvas-Andreas Moysidis [mailto: savvas.andreas.moysi...@googlemail.com] Envoyé : mercredi 9 février 2011 16:34 À : solr-user@lucene.apache.org Objet : Concurrent updates/commits Hello, This topic has probably been covered before here, but we're still not very clear about how multiple commits work in Solr. We currently have a requirement to make our domain objects searchable immediately after the get updated in the database by some user action. This could potentially cause multiple updates/commits to be fired to Solr and we are trying to investigate how Solr handles those multiple requests. This thread: http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search suggests that Solr will handle all of the lower level details and that Before a *COMMIT* is done , lock is obtained and its released after the operation which in my understanding means that Solr will serialise all update/commit requests? However, the Solr book, in the Commit, Optimise, Rollback section reads: if more than one Solr client were to submit modifications and commit them at similar times, it is possible for part of one client's set of changes to be committed before that client told Solr to commit which suggests that requests are *not* serialised. Our questions are: - Does Solr handle concurrent requests or do we need to add synchronisation logic around our code? - If Solr *does* handle concurrent requests, does it serialise each request or has some other strategy for processing those? Thanks, - Savvas
Re: Concurrent updates/commits
Yes, we'll probably go towards that path as our index files are relatively small, so auto warming might not be extremely useful in our case.. Yep, we do realise the difference between a db and a Solr commit. :) Thanks. On 9 February 2011 16:15, Walter Underwood wun...@wunderwood.org wrote: Don't think commit, that is confusing. Solr is not a database. In particular, it does not have the isolation property from ACID. Solr indexes new documents as a batch, then installs a new version of the entire index. Installing a new index isn't instant, especially with warming queries. Solr creates the index, then warms it, then makes it available for regular queries. If you are creating indexes frequently, don't bother warming. wunder == Walter Underwood Lead Engineer, MarkLogic On Feb 9, 2011, at 8:03 AM, Savvas-Andreas Moysidis wrote: Hello, Thanks very much for your quick replies. So, according to Pierre, all updates will be immediately posted to Solr, but all commits will be serialised. But doesn't that contradict Jonathan's example where you can end up with FIVE 'new indexes' being warmed? If commits are serialised, then there can only ever be one Index Searcher being auto-warmed at a time or have I got this wrong? The reason we are investigating commit serialisation, is because we want to know whether the commit requests will be blocked until the previous ones finish. Cheers, - Savvas On 9 February 2011 15:44, Pierre GOSSE pierre.go...@arisem.com wrote: However, the Solr book, in the Commit, Optimise, Rollback section reads: if more than one Solr client were to submit modifications and commit them at similar times, it is possible for part of one client's set of changes to be committed before that client told Solr to commit which suggests that requests are *not* serialised. I read this as If two client submit modifications and commits every couple of minutes, it could happen that modifications of client1 got committed by client2's commit before client1 asks for a commit. As far as I understand Solr commit, they are serialized by design. And committing too often could lead you to trouble if you have many warm-up queries (?). Hope this helps, Pierre -Message d'origine- De : Savvas-Andreas Moysidis [mailto: savvas.andreas.moysi...@googlemail.com] Envoyé : mercredi 9 février 2011 16:34 À : solr-user@lucene.apache.org Objet : Concurrent updates/commits Hello, This topic has probably been covered before here, but we're still not very clear about how multiple commits work in Solr. We currently have a requirement to make our domain objects searchable immediately after the get updated in the database by some user action. This could potentially cause multiple updates/commits to be fired to Solr and we are trying to investigate how Solr handles those multiple requests. This thread: http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search suggests that Solr will handle all of the lower level details and that Before a *COMMIT* is done , lock is obtained and its released after the operation which in my understanding means that Solr will serialise all update/commit requests? However, the Solr book, in the Commit, Optimise, Rollback section reads: if more than one Solr client were to submit modifications and commit them at similar times, it is possible for part of one client's set of changes to be committed before that client told Solr to commit which suggests that requests are *not* serialised. Our questions are: - Does Solr handle concurrent requests or do we need to add synchronisation logic around our code? - If Solr *does* handle concurrent requests, does it serialise each request or has some other strategy for processing those? Thanks, - Savvas
Re: Concurrent updates/commits
Thanks very much Em. - Savvas On 9 February 2011 16:22, Savvas-Andreas Moysidis savvas.andreas.moysi...@googlemail.com wrote: Yes, we'll probably go towards that path as our index files are relatively small, so auto warming might not be extremely useful in our case.. Yep, we do realise the difference between a db and a Solr commit. :) Thanks. On 9 February 2011 16:15, Walter Underwood wun...@wunderwood.org wrote: Don't think commit, that is confusing. Solr is not a database. In particular, it does not have the isolation property from ACID. Solr indexes new documents as a batch, then installs a new version of the entire index. Installing a new index isn't instant, especially with warming queries. Solr creates the index, then warms it, then makes it available for regular queries. If you are creating indexes frequently, don't bother warming. wunder == Walter Underwood Lead Engineer, MarkLogic On Feb 9, 2011, at 8:03 AM, Savvas-Andreas Moysidis wrote: Hello, Thanks very much for your quick replies. So, according to Pierre, all updates will be immediately posted to Solr, but all commits will be serialised. But doesn't that contradict Jonathan's example where you can end up with FIVE 'new indexes' being warmed? If commits are serialised, then there can only ever be one Index Searcher being auto-warmed at a time or have I got this wrong? The reason we are investigating commit serialisation, is because we want to know whether the commit requests will be blocked until the previous ones finish. Cheers, - Savvas On 9 February 2011 15:44, Pierre GOSSE pierre.go...@arisem.com wrote: However, the Solr book, in the Commit, Optimise, Rollback section reads: if more than one Solr client were to submit modifications and commit them at similar times, it is possible for part of one client's set of changes to be committed before that client told Solr to commit which suggests that requests are *not* serialised. I read this as If two client submit modifications and commits every couple of minutes, it could happen that modifications of client1 got committed by client2's commit before client1 asks for a commit. As far as I understand Solr commit, they are serialized by design. And committing too often could lead you to trouble if you have many warm-up queries (?). Hope this helps, Pierre -Message d'origine- De : Savvas-Andreas Moysidis [mailto: savvas.andreas.moysi...@googlemail.com] Envoyé : mercredi 9 février 2011 16:34 À : solr-user@lucene.apache.org Objet : Concurrent updates/commits Hello, This topic has probably been covered before here, but we're still not very clear about how multiple commits work in Solr. We currently have a requirement to make our domain objects searchable immediately after the get updated in the database by some user action. This could potentially cause multiple updates/commits to be fired to Solr and we are trying to investigate how Solr handles those multiple requests. This thread: http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search suggests that Solr will handle all of the lower level details and that Before a *COMMIT* is done , lock is obtained and its released after the operation which in my understanding means that Solr will serialise all update/commit requests? However, the Solr book, in the Commit, Optimise, Rollback section reads: if more than one Solr client were to submit modifications and commit them at similar times, it is possible for part of one client's set of changes to be committed before that client told Solr to commit which suggests that requests are *not* serialised. Our questions are: - Does Solr handle concurrent requests or do we need to add synchronisation logic around our code? - If Solr *does* handle concurrent requests, does it serialise each request or has some other strategy for processing those? Thanks, - Savvas
Query regarding search term count in Solr
Hi All, This is Rahul and am using Solr for one of my upcoming projects. I had a query regarding search term count using Solr. We have a requirement in one of our search based projects to search the results based on search term counts per document. For eg, if a user searches for something like solr[4:9], this query should return only documents in which solr appears between 4 and 9 times (inclusively). if a user searches for something like solr lucene[4:9], this query should return only documents in which the phrase solr lucene appears between 4 and 9 times (inclusively). Is there any way from Solr to return results based on the search term and phrase counts ? If not, can it be customized by extending existing Solr/Lucene libraries ? -- Thanks and Regards Rahul A. Warawdekar
Re: Nutch and Solr search on the fly
Hi Abishek, depth is a param of crawl command, not fetch command If you are using custom script calling individual stages of nutch crawl, then depth N means , you running that script for N times.. You can put a loop, in the script. Thanks, Charan On Wed, Feb 9, 2011 at 6:26 AM, .: Abhishek :. ab1s...@gmail.com wrote: Hi Erick, Thanks a bunch for the response Could be a chance..but all I am wondering is where to specify the depth in the whole entire process in the URL http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/? I tried specifying it during the fetcher phase but it was just ignored :( Thanks, Abi On Wed, Feb 9, 2011 at 10:11 PM, Erick Erickson erickerick...@gmail.com wrote: WARNING: I don't do Nutch much, but could it be that your crawl depth is 1? See: http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/NutchTutorialand search for depth Best Erick On Wed, Feb 9, 2011 at 9:06 AM, .: Abhishek :. ab1s...@gmail.com wrote: Hi Markus, I am sorry for not being clear, I meant to say that... Suppose if a url namely www.somehost.com/gifts/greetingcard.html(whichhttp://www.somehost.com/gifts/greetingcard.html%28which http://www.somehost.com/gifts/greetingcard.html%28whichin turn contain links to a.html, b.html, c.html, d.html) is injected into the seed.txt, after the whole process I was expecting a bunch of other pages which crawled from this seed url. However, at the end of it all I see is the contents from only this page namely www.somehost.com/gifts/greetingcard.htmland I do not see any other pages(here a.html, b.html, c.html, d.html) crawled from this one. The crawling happens only for the URLs mentioned in the seed.txt and does not proceed further from there. So I am just bit confused. Why is it not crawling the linked pages(a.html, b.html, c.html and d.html). I get a feeling that I am missing something that the author of the blog( http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed everyone would know. Thanks, Abi On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma markus.jel...@openindex.io wrote: The parsed data is only sent to the Solr index of you tell a segment to be indexed; solrindex crawldb linkdb segment If you did this only once after injecting and then the consequent fetch,parse,update,index sequence then you, of course, only see those URL's. If you don't index a segment after it's being parsed, you need to do it later on. On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote: Hi all, I am a newbie to nutch and solr. Well relatively much newer to Solr than Nutch :) I have been using nutch for past two weeks, and I wanted to know if I can query or search on my nutch crawls on the fly(before it completes). I am asking this because the websites I am crawling are really huge and it takes around 3-4 days for a crawl to complete. I want to analyze some quick results while the nutch crawler is still crawling the URLs. Some one suggested me that Solr would make it possible. I followed the steps in http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. By this process, I see only the injected URLs are shown in the Solr search. I know I did something really foolish and the crawl never happened, I feel I am missing some information here. I think somewhere in the process there should be a crawling happening and I missed it out. Just wanted to see if some one could help me pointing this out and where I went wrong in the process. Forgive my foolishness and thanks for your patience. Cheers, Abi -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
QueryWeight for Solr
Hello folks, I got a question regarding an own QueryWeight implementation for a special usecase. For the current usecase we want to experiment with different values for the idf based on different algorithms and how they affect the scoring. Is there a way to plug-in an own weight-implementation without rewriting the full query-class? Let's say we extend the DismaxQParser to create an extended boolean Query (let's call it EBooleanQuery, E for extended) and we implement a QueryWeight for this Query-class that takes some values into account that are not part of the current approach. Is this the way we have to go? Or what would you suggest? Regards -- View this message in context: http://lucene.472066.n3.nabble.com/QueryWeight-for-Solr-tp2459933p2459933.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: QueryWeight for Solr
On Wed, Feb 9, 2011 at 12:16 PM, Em mailformailingli...@yahoo.de wrote: For the current usecase we want to experiment with different values for the idf based on different algorithms and how they affect the scoring. For tf, idf, lengthNorm, coord, etc, see Similarity. Solr already alows you to specify one in the schema, and work is underway to make it per-field: https://issues.apache.org/jira/browse/SOLR-2338 -Yonik http://lucidimagination.com
Re: Query regarding search term count in Solr
I suspect it's worthwhile to back up and ask whether this is a reasonable requirement. What is the use-case? Because unless the input is very uniform, I wouldn't be surprised if this will produce poor results. For instance, if solr appears once in a field 5 words long and 5 times in another document where the same field is 1,000,000 words long, which is preferable? This requirement can make sense if the fields being searched are uniform in length, but even then I'm not sure it would be good for the user That said, you know your problem domain best but before going through the effort of making this all work I'd step back and ask this question. There is no way that I know of of doing this out of the box with Solr. I can imagine you could set up a custom scorer that accessed the underlying TermDocs (see TermDocs in the Lucene API), but you'd also have to provide your own query parser... I'll reiterate, though, that it might be best to see if there are already ways in Solr to get close enough behavior to satisfy the underlying requirement rather than go down this route. Best Erick On Wed, Feb 9, 2011 at 11:55 AM, Rahul Warawdekar rahul.warawde...@gmail.com wrote: Hi All, This is Rahul and am using Solr for one of my upcoming projects. I had a query regarding search term count using Solr. We have a requirement in one of our search based projects to search the results based on search term counts per document. For eg, if a user searches for something like solr[4:9], this query should return only documents in which solr appears between 4 and 9 times (inclusively). if a user searches for something like solr lucene[4:9], this query should return only documents in which the phrase solr lucene appears between 4 and 9 times (inclusively). Is there any way from Solr to return results based on the search term and phrase counts ? If not, can it be customized by extending existing Solr/Lucene libraries ? -- Thanks and Regards Rahul A. Warawdekar
Re: Solr Out of Memory Error
Dear Adam, I also got the OutOfMemory exception. I changed the JAVA_OPTS in catalina.sh as follows. ... if [ -z $LOGGING_MANAGER ]; then JAVA_OPTS=$JAVA_OPTS -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager else JAVA_OPTS=$JAVA_OPTS -server -Xms8096m -Xmx8096m fi ... Is this change correct? After that, I still got the same exception. The index is updated and searched frequently. I am trying to change the code to avoid the frequent updates. I guess only changing JAVA_OPTS does not work. Could you give me some help? Thanks, LB On Wed, Jan 19, 2011 at 10:05 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Is anyone familiar with the environment variable, JAVA_OPTS? I set mine to a much larger heap size and never had any of these issues again. JAVA_OPTS = -server -Xms4048m -Xmx4048m Adam On Wed, Jan 19, 2011 at 3:29 AM, Isan Fulia isan.fu...@germinait.com wrote: Hi all, By adding more servers do u mean sharding of index.And after sharding , how my query performance will be affected . Will the query execution time increase. Thanks, Isan Fulia. On 19 January 2011 12:52, Grijesh pintu.grij...@gmail.com wrote: Hi Isan, It seems your index size 25GB si much more compared to you have total Ram size is 4GB. You have to do 2 things to avoid Out Of Memory Problem. 1-Buy more Ram ,add at least 12 GB of more ram. 2-Increase the Memory allocated to solr by setting XMX values.at least 12 GB allocate to solr. But if your all index will fit into the Cache memory it will give you the better result. Also add more servers to load balance as your QPS is high. Your 7 Laks data makes 25 GB of index its looking quite high.Try to lower the index size What are you indexing in your 25GB of index? - Thanx: Grijesh -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Out-of-Memory-Error-tp2280037p2285779.html Sent from the Solr - User mailing list archive at Nabble.com. -- Thanks Regards, Isan Fulia.
Re: QueryWeight for Solr
Hi Yonik, thanks for the fast feedback. Well, as far as I can see there is no possibility to get the original query from the similarity-class... Let me ask differently: I know there are some distributed idf-implementations out there. One approach is to ask every shard for its idf for a term and than aggregate those values at the master wo queried them all. Afterwards they use it for their similarity etc. How do they store these idfs for the current request so that the similarity is aware of them? I do not want to reimplement distributed idf, but I want to figure out how they make it accessible for the similarity that is in use. Thank you! Regards Yonik Seeley-2-2 wrote: On Wed, Feb 9, 2011 at 12:16 PM, Em mailformailingli...@yahoo.de wrote: For the current usecase we want to experiment with different values for the idf based on different algorithms and how they affect the scoring. For tf, idf, lengthNorm, coord, etc, see Similarity. Solr already alows you to specify one in the schema, and work is underway to make it per-field: https://issues.apache.org/jira/browse/SOLR-2338 -Yonik http://lucidimagination.com -- View this message in context: http://lucene.472066.n3.nabble.com/QueryWeight-for-Solr-tp2459933p2460386.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Out of Memory Error
Bing Li, One should be conservative when setting Xmx. Also, just setting Xmx might not do the trick at all because the garbage collector might also be the issue here. Configure the JVM to output debug logs of the garbage collector and monitor the heap usage (especially the tenured generation) with a good tool like JConsole. You might also want to take a look at your cache settings and autowarm parameters. In some scenario's with very frequent updates, a large corpus and a high load of heterogenous queries you might want to dump the documentCache and queryResultCache, the cache hitratio tends to be very low and the caches will just consume a lot of memory and CPU time. One of my projects i finally decided to only use the filterCache. Using the other caches took too much RAM and CPU while running and had a lot of evictions and still a lot hitratio. I could, of course, make the caches a lot bigger and increase autowarming but that would take a lot of time before a cache is autowarmed and a very, very, large amount of RAM. I choose to rely on the OS-cache instead. Cheers, Dear Adam, I also got the OutOfMemory exception. I changed the JAVA_OPTS in catalina.sh as follows. ... if [ -z $LOGGING_MANAGER ]; then JAVA_OPTS=$JAVA_OPTS -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager else JAVA_OPTS=$JAVA_OPTS -server -Xms8096m -Xmx8096m fi ... Is this change correct? After that, I still got the same exception. The index is updated and searched frequently. I am trying to change the code to avoid the frequent updates. I guess only changing JAVA_OPTS does not work. Could you give me some help? Thanks, LB On Wed, Jan 19, 2011 at 10:05 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Is anyone familiar with the environment variable, JAVA_OPTS? I set mine to a much larger heap size and never had any of these issues again. JAVA_OPTS = -server -Xms4048m -Xmx4048m Adam On Wed, Jan 19, 2011 at 3:29 AM, Isan Fulia isan.fu...@germinait.com wrote: Hi all, By adding more servers do u mean sharding of index.And after sharding , how my query performance will be affected . Will the query execution time increase. Thanks, Isan Fulia. On 19 January 2011 12:52, Grijesh pintu.grij...@gmail.com wrote: Hi Isan, It seems your index size 25GB si much more compared to you have total Ram size is 4GB. You have to do 2 things to avoid Out Of Memory Problem. 1-Buy more Ram ,add at least 12 GB of more ram. 2-Increase the Memory allocated to solr by setting XMX values.at least 12 GB allocate to solr. But if your all index will fit into the Cache memory it will give you the better result. Also add more servers to load balance as your QPS is high. Your 7 Laks data makes 25 GB of index its looking quite high.Try to lower the index size What are you indexing in your 25GB of index? - Thanx: Grijesh -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Out-of-Memory-Error-tp2280037p228 5779.html Sent from the Solr - User mailing list archive at Nabble.com. -- Thanks Regards, Isan Fulia.
Re: Solr Out of Memory Error
I should also add that reducing the caches and autowarm sizes (or not using them at all) drastically reduces memory consumption when a new searcher is being prepares after a commit. The memory usage will spike at these events. Again, use a monitoring tool to get more information on your specific scenario. Bing Li, One should be conservative when setting Xmx. Also, just setting Xmx might not do the trick at all because the garbage collector might also be the issue here. Configure the JVM to output debug logs of the garbage collector and monitor the heap usage (especially the tenured generation) with a good tool like JConsole. You might also want to take a look at your cache settings and autowarm parameters. In some scenario's with very frequent updates, a large corpus and a high load of heterogenous queries you might want to dump the documentCache and queryResultCache, the cache hitratio tends to be very low and the caches will just consume a lot of memory and CPU time. One of my projects i finally decided to only use the filterCache. Using the other caches took too much RAM and CPU while running and had a lot of evictions and still a lot hitratio. I could, of course, make the caches a lot bigger and increase autowarming but that would take a lot of time before a cache is autowarmed and a very, very, large amount of RAM. I choose to rely on the OS-cache instead. Cheers, Dear Adam, I also got the OutOfMemory exception. I changed the JAVA_OPTS in catalina.sh as follows. ... if [ -z $LOGGING_MANAGER ]; then JAVA_OPTS=$JAVA_OPTS -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager else JAVA_OPTS=$JAVA_OPTS -server -Xms8096m -Xmx8096m fi ... Is this change correct? After that, I still got the same exception. The index is updated and searched frequently. I am trying to change the code to avoid the frequent updates. I guess only changing JAVA_OPTS does not work. Could you give me some help? Thanks, LB On Wed, Jan 19, 2011 at 10:05 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Is anyone familiar with the environment variable, JAVA_OPTS? I set mine to a much larger heap size and never had any of these issues again. JAVA_OPTS = -server -Xms4048m -Xmx4048m Adam On Wed, Jan 19, 2011 at 3:29 AM, Isan Fulia isan.fu...@germinait.com wrote: Hi all, By adding more servers do u mean sharding of index.And after sharding , how my query performance will be affected . Will the query execution time increase. Thanks, Isan Fulia. On 19 January 2011 12:52, Grijesh pintu.grij...@gmail.com wrote: Hi Isan, It seems your index size 25GB si much more compared to you have total Ram size is 4GB. You have to do 2 things to avoid Out Of Memory Problem. 1-Buy more Ram ,add at least 12 GB of more ram. 2-Increase the Memory allocated to solr by setting XMX values.at least 12 GB allocate to solr. But if your all index will fit into the Cache memory it will give you the better result. Also add more servers to load balance as your QPS is high. Your 7 Laks data makes 25 GB of index its looking quite high.Try to lower the index size What are you indexing in your 25GB of index? - Thanx: Grijesh -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Out-of-Memory-Error-tp2280037p2 28 5779.html Sent from the Solr - User mailing list archive at Nabble.com. -- Thanks Regards, Isan Fulia.
Changing value of start parameter affects numFound?
I have a data set indexed over two irons, with M docs per Solr core for a total of N cores. If I perform a query across all N cores with start=0 and rows=30, I get, say, numFound=27521). If I simply change the start param to start=27510 (simulating being on the last page of data), I get a smaller result set (say, numFound=21415). I had expected numFound to be the same in either case, since no other aspect of the query had changed. Am I mistaken? I'm using Solr 1.4.1.955763M. Faceting is enabled on the query. All cores have the same schema. Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Changing-value-of-start-parameter-affects-numFound-tp2460645p2460645.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: QueryWeight for Solr
On Wed, Feb 9, 2011 at 1:18 PM, Em mailformailingli...@yahoo.de wrote: How do they store these idfs for the current request so that the similarity is aware of them? The df (as opposed to idf) is requested from the searcher by the weight, which then uses the similarity to produce the idf. See TermWeight as an example. There's no out-of-the-box plugin to provide alternate df values though, other than the Searcher interface. If you're doing custom enough scoring, then just implementing your own query class is probably the way to go, but people might have other ideas depending on the specifics of what you're trying to do. -Yonik http://lucidimagination.com
Re: QueryWeight for Solr
Thanks, again. :) Okay, so if one wants a distributed idf one should extend a searcher instead of the query-class. But it doesn't seem to be pluggable, right? Well, for our purposes extending the query-class is enough, but just from beeing curious: Where should one starts if one wants to make some components pluggable? Since Real-Time-Search is an issue where I read about the idea of making things like the searcher pluggable, this could be beneficial to the community. Regards Yonik Seeley-2-2 wrote: On Wed, Feb 9, 2011 at 1:18 PM, Em mailformailingli...@yahoo.de wrote: How do they store these idfs for the current request so that the similarity is aware of them? The df (as opposed to idf) is requested from the searcher by the weight, which then uses the similarity to produce the idf. See TermWeight as an example. There's no out-of-the-box plugin to provide alternate df values though, other than the Searcher interface. If you're doing custom enough scoring, then just implementing your own query class is probably the way to go, but people might have other ideas depending on the specifics of what you're trying to do. -Yonik http://lucidimagination.com -- View this message in context: http://lucene.472066.n3.nabble.com/QueryWeight-for-Solr-tp2459933p2460718.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Changing value of start parameter affects numFound?
mrw wrote: I have a data set indexed over two irons, with M docs per Solr core for a total of N cores. If I perform a query across all N cores with start=0 and rows=30, I get, say, numFound=27521). If I simply change the start param to start=27510 (simulating being on the last page of data), I get a smaller result set (say, numFound=21415). I had expected numFound to be the same in either case, since no other aspect of the query had changed. Am I mistaken? I'm using Solr 1.4.1.955763M. Faceting is enabled on the query. All cores have the same schema. Thanks! More detail: numFound seems to vary unpredictably based on start value. start, numFound -- 0-46, 27521 47-59, 27520 60, 27519 61-91, 27518 62, 27517 Any ideas? -- View this message in context: http://lucene.472066.n3.nabble.com/Changing-value-of-start-parameter-affects-numFound-tp2460645p2460795.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: QueryWeight for Solr
On Wed, Feb 9, 2011 at 1:51 PM, Em mailformailingli...@yahoo.de wrote: Okay, so if one wants a distributed idf one should extend a searcher instead of the query-class. Yes. If you're interested in distributed search for Solr, there is a patch in progress: https://issues.apache.org/jira/browse/SOLR-1632 But it doesn't seem to be pluggable, right? Since you can weight with a different searcher than you query with, a searcher works fine as a fine extension point (but it's a lucene-level extension point, not a solr-level one). -Yonik http://lucidimagination.com
Re: Changing value of start parameter affects numFound?
On Wed, Feb 9, 2011 at 1:42 PM, mrw mikerobertsw...@gmail.com wrote: I have a data set indexed over two irons, with M docs per Solr core for a total of N cores. If I perform a query across all N cores with start=0 and rows=30, I get, say, numFound=27521). If I simply change the start param to start=27510 (simulating being on the last page of data), I get a smaller result set (say, numFound=21415). I had expected numFound to be the same in either case, since no other aspect of the query had changed. Am I mistaken? You probably have some duplicate docs in your shards (those with the same id). Solr doesn't know they are dups until it retrieves the ids of the docs to merge, and then it only takes one of the dups and decrements numFound. -Yonik http://lucidimagination.com
Architecture decisions with Solr
Hello all, I am looking into an enterprise search solution for our architecture and I am very pleased to see all the features Solr provides. In our case, we will have a need for a highly scalable application for multiple clients. This application will be built to serve many users who each will have a client account. Each client will have a multitude of documents to index (0-1000s of documents). After discussion we were talking about going multicore and to have one index file per client account. The reason for this is that security is achieved by having a separate index for each client etc.. Is this the best approach? How feasible is it (dynamically create indexes on client account creation. Is it better to go the faceted search capabilities route? Thanks for your help Greg
Re: Architecture decisions with Solr
What about standing up a VM (search appliance that you would make) for each client? If there's no data sharing across clients, then using the same solr server/index doesn't seem necessary. Solr will easily meet your needs though, its the best there is. On Wed, 2011-02-09 at 14:23 -0500, Greg Georges wrote: Hello all, I am looking into an enterprise search solution for our architecture and I am very pleased to see all the features Solr provides. In our case, we will have a need for a highly scalable application for multiple clients. This application will be built to serve many users who each will have a client account. Each client will have a multitude of documents to index (0-1000s of documents). After discussion we were talking about going multicore and to have one index file per client account. The reason for this is that security is achieved by having a separate index for each client etc.. Is this the best approach? How feasible is it (dynamically create indexes on client account creation. Is it better to go the faceted search capabilities route? Thanks for your help Greg
RE: Architecture decisions with Solr
From what I understand about multicore, each of the indexes are independant from each other right? Or would one index have access to the info of the other? My requirement is like you mention, a client has access only to his or her search data based in their documents. Other clients have no access to the index of other clients. Greg -Original Message- From: Darren Govoni [mailto:dar...@ontrenet.com] Sent: 9 février 2011 14:28 To: solr-user@lucene.apache.org Subject: Re: Architecture decisions with Solr What about standing up a VM (search appliance that you would make) for each client? If there's no data sharing across clients, then using the same solr server/index doesn't seem necessary. Solr will easily meet your needs though, its the best there is. On Wed, 2011-02-09 at 14:23 -0500, Greg Georges wrote: Hello all, I am looking into an enterprise search solution for our architecture and I am very pleased to see all the features Solr provides. In our case, we will have a need for a highly scalable application for multiple clients. This application will be built to serve many users who each will have a client account. Each client will have a multitude of documents to index (0-1000s of documents). After discussion we were talking about going multicore and to have one index file per client account. The reason for this is that security is achieved by having a separate index for each client etc.. Is this the best approach? How feasible is it (dynamically create indexes on client account creation. Is it better to go the faceted search capabilities route? Thanks for your help Greg
Re: Architecture decisions with Solr
This application will be built to serve many users If this means that you have thousands of users, 1000s of VMs and/or 1000s of cores is not going to scale. Have an ID in the index for each user, and filter using it. Then they can see only their own documents. Assuming that you are building an app that through which they authenticate talks to solr . (i.e. all requests are filtered using their ID) -Glen On Wed, Feb 9, 2011 at 2:31 PM, Greg Georges greg.geor...@biztree.com wrote: From what I understand about multicore, each of the indexes are independant from each other right? Or would one index have access to the info of the other? My requirement is like you mention, a client has access only to his or her search data based in their documents. Other clients have no access to the index of other clients. Greg -Original Message- From: Darren Govoni [mailto:dar...@ontrenet.com] Sent: 9 février 2011 14:28 To: solr-user@lucene.apache.org Subject: Re: Architecture decisions with Solr What about standing up a VM (search appliance that you would make) for each client? If there's no data sharing across clients, then using the same solr server/index doesn't seem necessary. Solr will easily meet your needs though, its the best there is. On Wed, 2011-02-09 at 14:23 -0500, Greg Georges wrote: Hello all, I am looking into an enterprise search solution for our architecture and I am very pleased to see all the features Solr provides. In our case, we will have a need for a highly scalable application for multiple clients. This application will be built to serve many users who each will have a client account. Each client will have a multitude of documents to index (0-1000s of documents). After discussion we were talking about going multicore and to have one index file per client account. The reason for this is that security is achieved by having a separate index for each client etc.. Is this the best approach? How feasible is it (dynamically create indexes on client account creation. Is it better to go the faceted search capabilities route? Thanks for your help Greg -- -
solr render biased search result
Hi, I am asked that whether solr renders biased search result? For example, for this search (query all movie title by this Comedy genre), for user who indicates a preference to 1950's movies, solr renders the 1950's movies with higher score (top in the list)?Or if user is a kid, then the result will render G/PG rated movie top in the list, and render all the R rated movie bottom in the list? I know that solr can boost score based on match on a particular field. But it can't favor some value over other value in the same field. is that right? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-render-biased-search-result-tp2461155p2461155.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Architecture decisions with Solr
Another option (assuming the case where a user can be granted access to a certain class of documents, and more than one user would be able to access certain documents) would be to store the access filter (as an OR query of content types) in an external cache (perhaps a database or an eternal cache that the database changes are published to periodically), then using this access filter as a facet on the base query. -sujit On Wed, 2011-02-09 at 14:38 -0500, Glen Newton wrote: This application will be built to serve many users If this means that you have thousands of users, 1000s of VMs and/or 1000s of cores is not going to scale. Have an ID in the index for each user, and filter using it. Then they can see only their own documents. Assuming that you are building an app that through which they authenticate talks to solr . (i.e. all requests are filtered using their ID) -Glen On Wed, Feb 9, 2011 at 2:31 PM, Greg Georges greg.geor...@biztree.com wrote: From what I understand about multicore, each of the indexes are independant from each other right? Or would one index have access to the info of the other? My requirement is like you mention, a client has access only to his or her search data based in their documents. Other clients have no access to the index of other clients. Greg -Original Message- From: Darren Govoni [mailto:dar...@ontrenet.com] Sent: 9 février 2011 14:28 To: solr-user@lucene.apache.org Subject: Re: Architecture decisions with Solr What about standing up a VM (search appliance that you would make) for each client? If there's no data sharing across clients, then using the same solr server/index doesn't seem necessary. Solr will easily meet your needs though, its the best there is. On Wed, 2011-02-09 at 14:23 -0500, Greg Georges wrote: Hello all, I am looking into an enterprise search solution for our architecture and I am very pleased to see all the features Solr provides. In our case, we will have a need for a highly scalable application for multiple clients. This application will be built to serve many users who each will have a client account. Each client will have a multitude of documents to index (0-1000s of documents). After discussion we were talking about going multicore and to have one index file per client account. The reason for this is that security is achieved by having a separate index for each client etc.. Is this the best approach? How feasible is it (dynamically create indexes on client account creation. Is it better to go the faceted search capabilities route? Thanks for your help Greg
Re: solr render biased search result
Cyang, why can't you, for a kid, add a boosting query genre:kid^2.0 aside of the rest? That would double the score of a match if the users are kids. But note that you'd better calibrate the coefficient with some test battery. This is part of the fine art, I think. paul Le 9 févr. 2011 à 20:44, cyang2010 a écrit : Hi, I am asked that whether solr renders biased search result? For example, for this search (query all movie title by this Comedy genre), for user who indicates a preference to 1950's movies, solr renders the 1950's movies with higher score (top in the list)?Or if user is a kid, then the result will render G/PG rated movie top in the list, and render all the R rated movie bottom in the list? I know that solr can boost score based on match on a particular field. But it can't favor some value over other value in the same field. is that right? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-render-biased-search-result-tp2461155p2461155.html Sent from the Solr - User mailing list archive at Nabble.com.
solr current workding directory or reading config files
Hi, I have a class (in a jar) that reads from properties (text) files. I have these files in the same jar file as the class. However, when my class reads those properties files, those files cannot be found since solr reads from tomcat's bin directory. I don't really want to put the config files in tomcat's bin directory. How do I reconcile this? Tri
pre and post processing when building index
Hi, I'm scheduling solr to build every hour or so. I'd like to do some pre and post processing for each index build. The preprocessing would do some checks and perhaps will skip the build. For post processing, I will do some checks and either commit or rollback the build. Can I write some class and plugin into solr for this? Thanks, Tri
DataImportHandler: regex debugging
I am trying to use the regex transformer but it's not returning anything. Either my regex is wrong, or I've done something else wrong in the setup of the entity. Is there any way to debug this? Making a change and waiting 7 minutes to reindex the entity sucks. entity name=boxshot query=SELECT GROUP_CONCAT(i.url, ',') boxshot_url, GROUP_CONCAT(i2.url, ',') boxshot_url_small FROM games g left join image_sizes i ON g.box_image_id = i.id AND i.size_type = 39 left join image_sizes i2 on g.box_image_id = i2.id AND i2.size_type = 40 WHERE g.game_seo_title = '${game.game_seo_title}' GROUP BY g.game_seo_title field name=main_image regex=^(.*?), sourceColName=boxshot_url / field name=small_image regex=^(.*?), sourceColName=boxshot_url_small / /entity This returns columns that are either null, or have some comma-separated strings. I want the bit up to the first comma, if it exists. Ideally I could have it log the query and the input/output of the field statements.
Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?
Hello, Andy, so did you get final answer to your quetion? I am also trying to do something similar. Please give me pointers if you have any. Basically even I need to use Ngram with WhitespaceTokenizer any help will be appreciated. -- View this message in context: http://lucene.472066.n3.nabble.com/NGramFilterFactory-for-auto-complete-that-matches-the-middle-of-multi-lingual-tags-tp1619234p2459466.html Sent from the Solr - User mailing list archive at Nabble.com.
Why does the StatsComponent only work with indexed fields?
Is there a reason why the StatsComponent only deals with indexed fields? I just updated the wiki: http://wiki.apache.org/solr/StatsComponent to call this fact out since it was not apparent previously. I've briefly skimmed the source of StatsComponent, but am not familiar enough with the code or Solr yet to understand if it was omitted for performance reasons or some other reason. Any information would be appreciated. Thanks, Travis
Re: solr render biased search result
That makes sense. It is a little bit indirect. You have to translate that user preference/profile into a search field value and then dictate search result boosting the doc with that preference value. -- View this message in context: http://lucene.472066.n3.nabble.com/solr-render-biased-search-result-tp2461155p2461668.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Why does the StatsComponent only work with indexed fields?
What kinds of information would you expect for a stored-only field? I mean, the stored part is just a blob that Solr doesn't peek inside of, so I'm not sure what useful information *could* be returned Best Erick On Wed, Feb 9, 2011 at 3:55 PM, Travis Truman trum...@gmail.com wrote: Is there a reason why the StatsComponent only deals with indexed fields? I just updated the wiki: http://wiki.apache.org/solr/StatsComponent to call this fact out since it was not apparent previously. I've briefly skimmed the source of StatsComponent, but am not familiar enough with the code or Solr yet to understand if it was omitted for performance reasons or some other reason. Any information would be appreciated. Thanks, Travis
Re: solr render biased search result
What *could* solr do for you? You've outlined a domain-specific requirement, I'm not sure how a general-purpose search engine would incorporate that functionality Best Erick On Wed, Feb 9, 2011 at 4:08 PM, cyang2010 ysxsu...@hotmail.com wrote: That makes sense. It is a little bit indirect. You have to translate that user preference/profile into a search field value and then dictate search result boosting the doc with that preference value. -- View this message in context: http://lucene.472066.n3.nabble.com/solr-render-biased-search-result-tp2461155p2461668.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr current workding directory or reading config files
Is your war always deployed the the same location, ie /usr/mycomp/ myapplication/webapps/myapp.war? If so then on startup copy the files out of your directory and put them under CATALINA_BASE/solr (usr/ mycomp/myapplication/solr) and in your war file have the META-INF/ context.xml JNDI setting point to that. Context Environment name=solr/home type=java.lang.String value=/usr/ mycomp/myapplication/solr override=true / /Context If you know of a way to reference CATALINA_BASE in the context.xml that would make it easier. On Feb 9, 2011, at 12:00 PM, Tri Nguyen wrote: Hi, I have a class (in a jar) that reads from properties (text) files. I have these files in the same jar file as the class. However, when my class reads those properties files, those files cannot be found since solr reads from tomcat's bin directory. I don't really want to put the config files in tomcat's bin directory. How do I reconcile this? Tri
communication between entity processor and solr DataImporter
Hi, I'd like to communicate errors between my entity processor and the DataImporter in case of error. Should there be an error in my entity processor, I'd like the index build to rollback. How can I do this? I want to throw an exception of some sort. Only thing I can think of is to force a runtime exception be thrown in nextRow() of the entityprocessor since runtime exceptions are not checked and does not have to be declared in the nextRow() method signature. How can I request the nextRow() method signature be updated to throw Exception? Would it even make sense? Tri
Re: solr current workding directory or reading config files
Wanted to add some more details to my problem. I have many jars that have their own config files. So I'd have to copy files for every jar. Can solr read from the classpath (jar files)? Yes my war is always deployed to the same location under webapps. I do already have solr/home defined in web.xml. I'll try copying my files into there, but I would have to extract every jar file and do this manually. From: Wilkes, Chris cwil...@gmail.com To: solr-user@lucene.apache.org Sent: Wed, February 9, 2011 3:44:03 PM Subject: Re: solr current workding directory or reading config files Is your war always deployed the the same location, ie /usr/mycomp/myapplication/webapps/myapp.war? If so then on startup copy the files out of your directory and put them under CATALINA_BASE/solr (usr/mycomp/myapplication/solr) and in your war file have the META-INF/context.xml JNDI setting point to that. Context Environment name=solr/home type=java.lang.String value=/usr/mycomp/myapplication/solr override=true / /Context If you know of a way to reference CATALINA_BASE in the context.xml that would make it easier. On Feb 9, 2011, at 12:00 PM, Tri Nguyen wrote: Hi, I have a class (in a jar) that reads from properties (text) files. I have these files in the same jar file as the class. However, when my class reads those properties files, those files cannot be found since solr reads from tomcat's bin directory. I don't really want to put the config files in tomcat's bin directory. How do I reconcile this? Tri
Re: communication between entity processor and solr DataImporter
I can throw DataImportHandlerException (a runtime exception) from my entityprocessor which will force a rollback. Tri From: Tri Nguyen tringuye...@yahoo.com To: solr-user@lucene.apache.org Sent: Wed, February 9, 2011 3:50:05 PM Subject: communication between entity processor and solr DataImporter Hi, I'd like to communicate errors between my entity processor and the DataImporter in case of error. Should there be an error in my entity processor, I'd like the index build to rollback. How can I do this? I want to throw an exception of some sort. Only thing I can think of is to force a runtime exception be thrown in nextRow() of the entityprocessor since runtime exceptions are not checked and does not have to be declared in the nextRow() method signature. How can I request the nextRow() method signature be updated to throw Exception? Would it even make sense? Tri
Re: communication between entity processor and solr DataImporter
Tri: You might want to consider, rather than going through DIH with your own entity processor, just using SolrJ in a separate process. That allows you much finer control over the behavior of your indexing process. Making a connection to Solr via SolrJ and adding a one-field document is maybe a 20 line program. Of course the complexity will come in your database-access code and error handling, and your documents will be much larger than one field, I just included that estimate so you can guage whether a pilot would be worthwhile... Just a thought Erick On Wed, Feb 9, 2011 at 7:32 PM, Tri Nguyen tringuye...@yahoo.com wrote: I can throw DataImportHandlerException (a runtime exception) from my entityprocessor which will force a rollback. Tri From: Tri Nguyen tringuye...@yahoo.com To: solr-user@lucene.apache.org Sent: Wed, February 9, 2011 3:50:05 PM Subject: communication between entity processor and solr DataImporter Hi, I'd like to communicate errors between my entity processor and the DataImporter in case of error. Should there be an error in my entity processor, I'd like the index build to rollback. How can I do this? I want to throw an exception of some sort. Only thing I can think of is to force a runtime exception be thrown in nextRow() of the entityprocessor since runtime exceptions are not checked and does not have to be declared in the nextRow() method signature. How can I request the nextRow() method signature be updated to throw Exception? Would it even make sense? Tri
Re: Nutch and Solr search on the fly
Hi Charan, Thanks for the clarifications. The link I have been referring to( http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) does not say anything about using the crawl? Do I have to do it after the last step mentioned? Thanks, Abi On Thu, Feb 10, 2011 at 12:58 AM, charan kumar charan.ku...@gmail.comwrote: Hi Abishek, depth is a param of crawl command, not fetch command If you are using custom script calling individual stages of nutch crawl, then depth N means , you running that script for N times.. You can put a loop, in the script. Thanks, Charan On Wed, Feb 9, 2011 at 6:26 AM, .: Abhishek :. ab1s...@gmail.com wrote: Hi Erick, Thanks a bunch for the response Could be a chance..but all I am wondering is where to specify the depth in the whole entire process in the URL http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/? I tried specifying it during the fetcher phase but it was just ignored :( Thanks, Abi On Wed, Feb 9, 2011 at 10:11 PM, Erick Erickson erickerick...@gmail.com wrote: WARNING: I don't do Nutch much, but could it be that your crawl depth is 1? See: http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/NutchTutorialand search for depth Best Erick On Wed, Feb 9, 2011 at 9:06 AM, .: Abhishek :. ab1s...@gmail.com wrote: Hi Markus, I am sorry for not being clear, I meant to say that... Suppose if a url namely www.somehost.com/gifts/greetingcard.html(whichhttp://www.somehost.com/gifts/greetingcard.html%28which http://www.somehost.com/gifts/greetingcard.html%28which http://www.somehost.com/gifts/greetingcard.html%28whichin turn contain links to a.html, b.html, c.html, d.html) is injected into the seed.txt, after the whole process I was expecting a bunch of other pages which crawled from this seed url. However, at the end of it all I see is the contents from only this page namely www.somehost.com/gifts/greetingcard.htmland I do not see any other pages(here a.html, b.html, c.html, d.html) crawled from this one. The crawling happens only for the URLs mentioned in the seed.txt and does not proceed further from there. So I am just bit confused. Why is it not crawling the linked pages(a.html, b.html, c.html and d.html). I get a feeling that I am missing something that the author of the blog( http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed everyone would know. Thanks, Abi On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma markus.jel...@openindex.io wrote: The parsed data is only sent to the Solr index of you tell a segment to be indexed; solrindex crawldb linkdb segment If you did this only once after injecting and then the consequent fetch,parse,update,index sequence then you, of course, only see those URL's. If you don't index a segment after it's being parsed, you need to do it later on. On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote: Hi all, I am a newbie to nutch and solr. Well relatively much newer to Solr than Nutch :) I have been using nutch for past two weeks, and I wanted to know if I can query or search on my nutch crawls on the fly(before it completes). I am asking this because the websites I am crawling are really huge and it takes around 3-4 days for a crawl to complete. I want to analyze some quick results while the nutch crawler is still crawling the URLs. Some one suggested me that Solr would make it possible. I followed the steps in http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. By this process, I see only the injected URLs are shown in the Solr search. I know I did something really foolish and the crawl never happened, I feel I am missing some information here. I think somewhere in the process there should be a crawling happening and I missed it out. Just wanted to see if some one could help me pointing this out and where I went wrong in the process. Forgive my foolishness and thanks for your patience. Cheers, Abi -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Architecture decisions with Solr
I tried the multi-core route and it gets too complicated and cumbersome to maintain. That is just from my own personal testing...It was suggested that each user have their own ID in a single index that you can query against accordingly. In the example schema.xml I believe there is a field called texttight or something like that that is meant for skew numbers. Give each user their own guid or md5 hash and add that as part of all your queries. That way, only their data are returned. It would be the equivalent of something like this... SELECT * FROM mytable WHERE userid = '3F2504E04F8911D39A0C0305E82C3301' AND Grant Ingersoll gave a presentation at the Lucene Revolution conference that demonstrated that you can build a query to be as easy or as complicated as any SQL statement. Maybe he can share that PPT? Adam On Feb 9, 2011, at 2:47 PM, Sujit Pal wrote: Another option (assuming the case where a user can be granted access to a certain class of documents, and more than one user would be able to access certain documents) would be to store the access filter (as an OR query of content types) in an external cache (perhaps a database or an eternal cache that the database changes are published to periodically), then using this access filter as a facet on the base query. -sujit On Wed, 2011-02-09 at 14:38 -0500, Glen Newton wrote: This application will be built to serve many users If this means that you have thousands of users, 1000s of VMs and/or 1000s of cores is not going to scale. Have an ID in the index for each user, and filter using it. Then they can see only their own documents. Assuming that you are building an app that through which they authenticate talks to solr . (i.e. all requests are filtered using their ID) -Glen On Wed, Feb 9, 2011 at 2:31 PM, Greg Georges greg.geor...@biztree.com wrote: From what I understand about multicore, each of the indexes are independant from each other right? Or would one index have access to the info of the other? My requirement is like you mention, a client has access only to his or her search data based in their documents. Other clients have no access to the index of other clients. Greg -Original Message- From: Darren Govoni [mailto:dar...@ontrenet.com] Sent: 9 février 2011 14:28 To: solr-user@lucene.apache.org Subject: Re: Architecture decisions with Solr What about standing up a VM (search appliance that you would make) for each client? If there's no data sharing across clients, then using the same solr server/index doesn't seem necessary. Solr will easily meet your needs though, its the best there is. On Wed, 2011-02-09 at 14:23 -0500, Greg Georges wrote: Hello all, I am looking into an enterprise search solution for our architecture and I am very pleased to see all the features Solr provides. In our case, we will have a need for a highly scalable application for multiple clients. This application will be built to serve many users who each will have a client account. Each client will have a multitude of documents to index (0-1000s of documents). After discussion we were talking about going multicore and to have one index file per client account. The reason for this is that security is achieved by having a separate index for each client etc.. Is this the best approach? How feasible is it (dynamically create indexes on client account creation. Is it better to go the faceted search capabilities route? Thanks for your help Greg
Faceting Query
Hi, What is the significance of copy field when used in faceting . plz explain with example. Thanks! Isha
Faceting Query
What is facet.pivot field? PLz explain with example