Re: How do I make sure the resulting documents contain the query terms?
Sorry being unclear and thank you for answering. Consider the following documents A(k0,k1,k2), B(k1,k2,k3), and C(k0,k2,k3), where A,B,C are document identifiers and the ks in bracket with each are the terms each contains. So Solr inverted index should be something like: k0 -- A | C k1 -- A | B k2 -- A | B | C k3 -- B | C Now let q=k1, how do I make sure C doesn't appear as a result since it doesn't contain any occurence of k1? On Tue, Jun 7, 2011 at 12:21 AM, Erick Erickson erickerick...@gmail.comwrote: I'm having a hard time understanding what you're driving at, can you provide some examples? This *looks* like filter queries, but I think you already know about those... Best Erick On Mon, Jun 6, 2011 at 4:00 PM, Gabriele Kahlout gabri...@mysimpatico.com wrote: Hello, I've seen that through boosting it's possible to influence the scoring function, but what I would like is sort of a boolean property. In some way it's to search only the indexed documents by that keyword (or the intersection/union) rather than the whole set. Is this supported in any way? -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)). -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: How do I make sure the resulting documents contain the query terms?
k0 -- A | C k1 -- A | B k2 -- A | B | C k3 -- B | C Now let q=k1, how do I make sure C doesn't appear as a result since it doesn't contain any occurence of k1? Do we bother to do that. Now that's what lucene does :) -- View this message in context: http://lucene.472066.n3.nabble.com/How-do-I-make-sure-the-resulting-documents-contain-the-query-terms-tp3031637p3033451.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How do I make sure the resulting documents contain the query terms?
On Tue, Jun 7, 2011 at 8:43 AM, pravesh suyalprav...@yahoo.com wrote: k0 -- A | C k1 -- A | B k2 -- A | B | C k3 -- B | C Now let q=k1, how do I make sure C doesn't appear as a result since it doesn't contain any occurence of k1? Do we bother to do that. Now that's what lucene does :) Lucene/Solr doesn't do that, it ranks documents based on a scoring function, and with that it lacks the possibility of specifying that a particular term must appear (the closest way I know of is boosting it). The solution would be a way to tell Solr/lucene which documents/indices to query, i.e. query only the union/intersection of the documents in which k1,...kn appear, instead of query all indexed documents and apply the ranking function (which will give weight to documents that contains k1...kn). -- View this message in context: http://lucene.472066.n3.nabble.com/How-do-I-make-sure-the-resulting-documents-contain-the-query-terms-tp3031637p3033451.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: Master Slave help
thanks Jayendra.. From: Jayendra Patil jayendra.patil@gmail.com To: solr-user@lucene.apache.org Sent: Tue, 7 June, 2011 6:55:58 AM Subject: Re: Master Slave help Do you mean the replication happens everytime you restart the server ? If so, you would need to modify the events you want the replication to happen. Check for the replicateAfter tag and remove the startup option, if you don't need it. requestHandler name=/replication class=solr.ReplicationHandler lst name=master !--Replicate on 'startup' and 'commit'. 'optimize' is also a valid value for replicateAfter. -- str name=replicateAfterstartup/str str name=replicateAftercommit/str !--Create a backup after 'optimize'. Other values can be 'commit', 'startup'. It is possible to have multiple entries of this config string. Note that this is just for backup, replication does not require this. -- !-- str name=backupAfteroptimize/str -- !--If configuration files need to be replicated give the names here, separated by comma -- str name=confFilesschema.xml,stopwords.txt,elevate.xml/str !--The default value of reservation is 10 secs.See the documentation below . Normally , you should not need to specify this -- str name=commitReserveDuration00:00:10/str /lst /requestHandler Regards, Jayendra On Mon, Jun 6, 2011 at 11:24 AM, Rohit Gupta ro...@in-rev.com wrote: Hi, I have configured my master slave server and everything seems to be running fine, the replication completed the firsttime it ran. But everytime I go the the replication link in the admin panel after restarting the server or server startup I notice the replication starting from scratch or at least the stats show that. What could be wrong? Thanks, Rohit
Commit taking very long
Hi, My commit seems to be taking too much time, if you notice from the Dataimport status given below to commit 1000 docs its taking longer than 24 minutes /lst str name=statusbusy/str str name=importResponseA command is still running.../str − lst name=statusMessages str name=Time Elapsed0:24:43.156/str str name=Total Requests made to DataSource1001/str str name=Total Rows Fetched1658/str str name=Total Documents Skipped0/str str name=Full Dump Started2011-06-07 09:15:17/str − str name= Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents. /str /lst What can be causing this, I have tried looking for a reason or a way to improve this, but am just not able to find. At this rate my documents would never get indexed, given that I have more than 100,000 records coming into the database every hour. Regards, Rohit
getting numberformat exception while using tika
Hi We are using requestextractinghandler and we are getting following error. we are giving microsoft docx file for indexing. I think that this is something to do with field date definition .. but now very sure ...what field type should we use? 2. we are trying to index jpg (when we search over the name of the jpg, it is not coming .. though in id i am passing one) 3. what about zip files or rar files.. does tika with solr handle this one ? java.lang.NumberFormatException: For input string: quot;2011-01-27T07:18:00Zquot; at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Long.parseLong(Long.java:412) at java.lang.Long.parseLong(Long.java:461) at org.apache.solr.schema.TrieField.createField(TrieField.java:434) at org.apache.solr.schema.SchemaField.createField(SchemaField.java:98) at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:204) at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:277) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:198) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:238) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) Thanks Naveen
How many fields can SOLR handle?
Hello, I have a SOLR implementation with 1m products. Every products has some information, lets say a television has some information about pixels and inches, a computer has information about harddisk, cpu, gpu. When a user search for computer i want to show the correct facets. An example: User search for Computer Facets: CPU AMD(10) Intel(300) GPU Nvidia(20) Ati(290) Every product has different facets. I have something like this in my schema: dynamicField name=*_FACET type=facetType indexed=true stored=true multiValued=true/ In SOLR i have now a lot of fields: CPU_FACET, GPU_FACET etc. How many fields can SOLR handle? Another question: Is it possible to add the FACET fields automatically to my query? facet.field=*_FACET? Now i do first a request to a DB to get the FACET titles and add this to the request: facet.field=cpu_FACET,gpu_FACET. I'm affraid that *_FACET is a overkill solution. -- View this message in context: http://lucene.472066.n3.nabble.com/How-many-fields-can-SOLR-handle-tp3033910p3033910.html Sent from the Solr - User mailing list archive at Nabble.com.
function queries scope
Hi, I need to use the function queries operations with the score of a given query, but only in the docset that i get from the query and i dont know if this is possible. Example: q=shops in madridreturns 1 docs with a specific score for each doc but now i need to do some stuff like q=sum(product(2,query(shops in madrid),productValueField) but this will be return all the docs in my index. I know that i can do it via filter queries, ex, q=sum(product(2,query(shops in madrid),productValueField)fq=shops in madrid but this will do the query two times and i dont want this because the performance is important to our application. Is there other approach to accomplished that= Thanks in advance, Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42
Indexing Mediawiki
I have a need to index an internal instance of Mediawiki. I'd like to use DIH if I can since I have access to the database but the example provided on the Solr wiki uses a Mediawiki dump XML file. Does anyone have any experience using DIH in this manner? Am I barking up the wrong tree and would be better off dumping and indexing the wiki instead? Thanks - Tod
solr 3.1 java.lang.NoClassDEfFoundError org/carrot2/core/ControllerFactory
As per the subject I am getting java.lang.NoClassDEfFoundError org/carrot2/core/ControllerFactory when I try to run clustering. I am using Solr 3.1: I get the following error: java.lang.NoClassDefFoundError: org/carrot2/core/ControllerFactory at org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.init(CarrotClusteringEngine.java:74) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at java.lang.Class.newInstance0(Unknown Source) at java.lang.Class.newInstance(Unknown Source) at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:412) at org.apache.solr.handler.clustering.ClusteringComponent.inform(ClusteringComponent.java:203) at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:522) at org.apache.solr.core.SolrCore.init(SolrCore.java:594) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:458) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:316) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:207) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:130) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:94) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713) at org.mortbay.jetty.servlet.Context.startContext(Context.java:140) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130) at org.mortbay.jetty.Server.doStart(Server.java:224) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.mortbay.start.Main.invokeMain(Main.java:194) at org.mortbay.start.Main.start(Main.java:534) at org.mortbay.start.Main.start(Main.java:441) at org.mortbay.start.Main.main(Main.java:119) Caused by: java.lang.ClassNotFoundException: org.carrot2.core.ControllerFactory at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.net.FactoryURLClassLoader.loadClass(Unknown Source) using the following configuration searchComponent class=org.apache.solr.handler.clustering.ClusteringComponent name=clustering lst name=engine str name=namedefault/str str name=carrot.algorithmorg.carrot2.clustering.lingo.LingoClusteringAlgorithm/str !-- Engine-specific parameters -- str name=LingoClusteringAlgorithm.desiredClusterCountBase20/str /lst /searchComponent requestHandler name=/search class=org.apache.solr.handler.component.SearchHandler lst name=defaults str name=echoParamsexplicit/str /lst !-- By default, this will register the following components: arr name=components strquery/str strfacet/str strmlt/str strhighlight/str strdebug/str /arr /requestHandler requestHandler name=clusty class=solr.SearchHandler default=true lst name=defaults str name=echoParamsexplicit/str bool name=clusteringtrue/bool str name=clustering.enginedefault/str bool name=clustering.resultstrue/bool !-- Fields to cluster on -- str name=carrot.titletitle/str str name=carrot.snippetall_text/str
Re: Documents update
Created file, reloaded solr - externalfilefield works fine, if i change change external files and do curl http://127.0.0.1:4900/solr/site/update -H Content-Type: text/xml --data-binary 'commit /' then no thanges are made. If i start solr without external files and then create them - they are not working.. What is wrong? PS: Solr 3.2 http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html On Tuesday 31 May 2011 15:41:32 Denis Kuzmenok wrote: Flags are stored to filter results and it's pretty highloaded, it's working fine, but i can't update index very often just to make flags up to time =\ Where can i read about using external fields / files? And it wouldn't work unless all the data is stored anyway. Currently there's no way to update a single field in a document, although there's work being done in that direction (see the column stride JIRA). What do you want to do with these fields? If it's to influence scoring, you could look at external fields. If the flags are a selection criteria, it's...harder. What are the flags used for? Could you consider essentially storing a map of the uniqueKey's and flags in a special document and having your app read that document and merge the results with the output? If this seems irrelevant, a more complete statement of the use-case would be helpful. Best Erick
Re: How do I make sure the resulting documents contain the query terms?
Gabriele Lucene uses a combination of boolean and VSM for its IR. A straight forward query for a keyword will only match docs with that keyword. Now things quickly get subtle and complex the more sugar you add, more complicated queries across fields and more complex analysis chains but I think the short answer to your question is C will not be returned, it will not be scored either lee c On 7 June 2011 08:30, Gabriele Kahlout gabri...@mysimpatico.com wrote: On Tue, Jun 7, 2011 at 8:43 AM, pravesh suyalprav...@yahoo.com wrote: k0 -- A | C k1 -- A | B k2 -- A | B | C k3 -- B | C Now let q=k1, how do I make sure C doesn't appear as a result since it doesn't contain any occurence of k1? Do we bother to do that. Now that's what lucene does :) Lucene/Solr doesn't do that, it ranks documents based on a scoring function, and with that it lacks the possibility of specifying that a particular term must appear (the closest way I know of is boosting it). The solution would be a way to tell Solr/lucene which documents/indices to query, i.e. query only the union/intersection of the documents in which k1,...kn appear, instead of query all indexed documents and apply the ranking function (which will give weight to documents that contains k1...kn). -- View this message in context: http://lucene.472066.n3.nabble.com/How-do-I-make-sure-the-resulting-documents-contain-the-query-terms-tp3031637p3033451.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
clustering problems on 3.1
I added the following to my configuration lib dir=c:/projects/solrtest/dist/ regex=apache-solr-clustering-.*\.jar / requestHandler name=clusty class=solr.SearchHandler default=true lst name=defaults str name=echoParamsexplicit/str bool name=clusteringtrue/bool str name=clustering.enginedefault/str bool name=clustering.resultstrue/bool !-- Fields to cluster on -- str name=carrot.titletitle/str str name=carrot.snippetall_text/str str name=hl.flall_text title/str !-- for this field, we want no fragmenting, just highlighting -- str name=f.name.hl.fragsize150/str /lst arr name=last-components strclustering/str /arr /requestHandler searchComponent class=org.apache.solr.handler.clustering.ClusteringComponent name=clustering lst name=engine str name=namedefault/str str name=carrot.algorithmorg.carrot2.clustering.lingo.LingoClusteringAlgorithm/str !-- Engine-specific parameters -- str name=LingoClusteringAlgorithm.desiredClusterCountBase20/str /lst /searchComponent which ended up with the message solr java.lang.NoClassDefFoundError: org/carrot2/core/ControllerFactory and whenever I did a request I got a 404 response back and SEVERE: REFCOUNT ERROR: unreferenced org.apache.solr.SolrCore@14db38a4 (core1) has a reference count of 1 appeared in my console. Any suggestions? Thanks, Bryan Rasmussen
Re: Commit taking very long
Are you optimizing? That is unnecessary when committing, and is often the culprit. Best Erick On Tue, Jun 7, 2011 at 5:42 AM, Rohit Gupta ro...@in-rev.com wrote: Hi, My commit seems to be taking too much time, if you notice from the Dataimport status given below to commit 1000 docs its taking longer than 24 minutes /lst str name=statusbusy/str str name=importResponseA command is still running.../str - lst name=statusMessages str name=Time Elapsed0:24:43.156/str str name=Total Requests made to DataSource1001/str str name=Total Rows Fetched1658/str str name=Total Documents Skipped0/str str name=Full Dump Started2011-06-07 09:15:17/str - str name= Indexing completed. Added/Updated: 1000 documents. Deleted 0 documents. /str /lst What can be causing this, I have tried looking for a reason or a way to improve this, but am just not able to find. At this rate my documents would never get indexed, given that I have more than 100,000 records coming into the database every hour. Regards, Rohit
Re: problem: zooKeeper Integration with solr
how this method (http://localhost:8983/solr/select?shards=*Machine:Port/Solr Path,**Machine:Port/Solr Path*indent=trueq=query) is better than zooKeeper, could you please refer any performance doc. On 7 June 2011 08:18, bmdakshinamur...@gmail.com bmdakshinamur...@gmail.com wrote: Instead of integrating zookeeper, you could create shards over multiple machines and specify the shards while you are querying solr. Eg: http://localhost:8983/solr/select?shards=*Machine:Port/Solr Path,* *Machine:Port/Solr Path*indent=trueq=query On Mon, Jun 6, 2011 at 5:59 PM, Mohammad Shariq shariqn...@gmail.com wrote: Hi folk, I am using solr to index around 100mn docs. now I am planning to move to cluster based solr, so that I can scale the indexing and searching process. since solrCloud is in development stage, I am trying to index in shard based environment using zooKeeper. I followed the steps from http://wiki.apache.org/solr/ZooKeeperIntegrationthen also I am not able to do distributes search. Once I index the docs in one shard, not able to query from other shard and vice-versa, (using the query http://localhost:8180/solr/select/?q=itunesversion=2.2start=0rows=10indent=on ) I am running solr3.1 on ubuntu 10.10. please help me. -- Thanks and Regards Mohammad Shariq -- Thanks and Regards, DakshinaMurthy BM -- Thanks and Regards Mohammad Shariq
RE: SpellCheckComponent performance
As I may have mentioned before, VuFind is actually doing two Solr queries for every search -- a base query that gets basic spelling suggestions, and a supplemental spelling-only query that gets shingled spelling suggestions. If there's a way to get two different spelling responses in a single query, I'd love to hear about it... but the double-querying doesn't seem to be a huge problem -- the delays I'm talking about are in the spelling portion of the initial query. Just for the sake of completeness, here are both of my spelling field types: !-- Basic Text Field for use with Spell Correction -- fieldType name=textSpell class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=schema.UnicodeNormalizationFilterFactory version=icu4j composed=false remove_diacritics=true remove_modifiers=true fold=true/ filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType !-- More advanced spell checking field. -- fieldType name=textSpellShingle class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.ShingleFilterFactory maxShingleSize=2 outputUnigrams=false/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.ShingleFilterFactory maxShingleSize=2 outputUnigrams=false/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType ...and here are the fields: field name=spelling type=textSpell indexed=true stored=true/ field name=spellingShingle type=textSpellShingle indexed=true stored=true multiValued=true/ As you can probably guess, I'm using spelling in my main query and spellingShingle in my supplemental query. Here are stats on the spelling field: {field=spelling,memSize=107830314,tindexSize=249184,time=25747,phase1=25150,nTerms=1343061,bigTerms=231,termInstances=40960454,uses=1} (I obtained these numbers by temporarily adding the spelling field as a facet to my warming query -- probably not a very smart way to do it, but it was the only way I could figure out! If there's a more elegant and accurate approach, I'd be interested to know what it is.) I should also note that my basic spelling index is 114MB and my shingled spelling index is 931MB -- not outrageously large. Is there a way to persuade Solr to load these into memory for faster performance? thanks, Demian -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, June 06, 2011 6:23 PM To: solr-user@lucene.apache.org Subject: Re: SpellCheckComponent performance Hmmm, how are you configuring you spell checker? The first-time slowdown is probably due to cache warming, but subsequent 500 ms slowdowns seem odd. How many unique terms are there in your spellecheck index? It'd probably be best if you showed us your fieldtype and field definition... Best Erick On Mon, Jun 6, 2011 at 4:04 PM, Demian Katz demian.k...@villanova.edu wrote: I'm continuing to work on tuning my Solr server, and now I'm noticing that my biggest bottleneck is the SpellCheckComponent. This is eating multiple seconds on most first-time searches, and still taking around 500ms even on cached searches. Here is my configuration: searchComponent name=spellcheck class=org.apache.solr.handler.component.SpellCheckComponent lst name=spellchecker str name=namebasicSpell/str str name=fieldspelling/str str name=accuracy0.75/str str name=spellcheckIndexDir./spellchecker/str str name=queryAnalyzerFieldTypetextSpell/str str name=buildOnOptimizetrue/str /lst /searchComponent I've done a bit of searching, but the best advice I could find for making the search component go faster involved reducing spellcheck.maxCollationTries, which doesn't even seem to apply to my settings. Does anyone have any advice on tuning this aspect of my configuration? Are there any extra debug settings that might give deeper insight into how the component is spending its time? thanks, Demian
Re: [ANNOUNCEMENT] PHP Solr Extension 1.0.1 Stable Has Been Released
Hello, I have some problems with the installation of the new PECL package solr-1.0.1. I run this command: pecl uninstall solr-beta ( to uninstall old version, 0.9.11) pecl install solr The installing is running but then it gives the following error message: /tmp/tmpKUExET/solr-1.0.1/solr_functions_helpers.c: In function 'solr_json_to_php_native': /tmp/tmpKUExET/solr-1.0.1/solr_functions_helpers.c:1123: error: too many arguments to function 'php_json_decode' make: *** [solr_functions_helpers.lo] Error 1 ERROR: `make' failed I have php version 5.2.17. How can i fix this? -- View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCEMENT-PHP-Solr-Extension-1-0-1-Stable-Has-Been-Released-tp3024040p3034350.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: java.lang.AbstractMethodError at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)
Finally figured out the problem. -- View this message in context: http://lucene.472066.n3.nabble.com/java-lang-AbstractMethodError-at-org-apache-solr-handler-ContentStreamHandlerBase-handleRequestBody--tp3026470p3034456.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr Cloud Query Question
I am currently experimenting with the Solr Cloud code on trunk and just had a quick question. Lets say my setup had 3 nodes a, b and c. Node a has 1000 results which meet a particular query, b has 2000 and c has 3000. When executing this query and asking for row 900 what specifically happens? From reading the Distributed Search Wiki I would expect that node a responds with 900, node b responds with 900 and c responds with 900 and the coordinating node is responsible for taking the top scored items and throwing away the rest, is this correct or is there some additional coordination that happens where nodes a, b and c return back an id and a score and the coordinating node makes an additional request to get back the documents for the ids which make up the top list?
Re: Solr Cloud Query Question
On Tue, Jun 7, 2011 at 9:35 AM, Jamie Johnson jej2...@gmail.com wrote: I am currently experimenting with the Solr Cloud code on trunk and just had a quick question. Lets say my setup had 3 nodes a, b and c. Node a has 1000 results which meet a particular query, b has 2000 and c has 3000. When executing this query and asking for row 900 what specifically happens? From reading the Distributed Search Wiki I would expect that node a responds with 900, node b responds with 900 and c responds with 900 and the coordinating node is responsible for taking the top scored items and throwing away the rest, is this correct or is there some additional coordination that happens where nodes a, b and c return back an id and a score and the coordinating node makes an additional request to get back the documents for the ids which make up the top list? The latter is correct - the first phase only collects enough information to merge ids from the shards, and then a second phase requests the stored fields, highlighting, etc for the specific docs that will be returned. -Yonik http://www.lucidimagination.com
Re: function queries scope
One way is to use the boost qparser: http://search-lucene.com/jd/solr/org/apache/solr/search/BoostQParserPlugin.html q={!boost b=productValueField}shops in madrid Or you can use the edismax parser which as a boost parameter that does the same thing: defType=edismaxq=shops in madridboost=productValueField -Yonik http://www.lucidimagination.com On Tue, Jun 7, 2011 at 6:53 AM, Marco Martinez mmarti...@paradigmatecnologico.com wrote: Hi, I need to use the function queries operations with the score of a given query, but only in the docset that i get from the query and i dont know if this is possible. Example: q=shops in madrid returns 1 docs with a specific score for each doc but now i need to do some stuff like q=sum(product(2,query(shops in madrid),productValueField) but this will be return all the docs in my index. I know that i can do it via filter queries, ex, q=sum(product(2,query(shops in madrid),productValueField)fq=shops in madrid but this will do the query two times and i dont want this because the performance is important to our application. Is there other approach to accomplished that= Thanks in advance, Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42
Re: function queries scope
Thanks, but its not what i'm looking for, because the BoostQParserPlugin multiplies the score of the query with the function queries defined in the b param of the BoostQParserPlugin. and i can't use the edismax because we have our own qparser. Its seems that i have to code another qparser. Thanks Yonik anyway, Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 2011/6/7 Yonik Seeley yo...@lucidimagination.com One way is to use the boost qparser: http://search-lucene.com/jd/solr/org/apache/solr/search/BoostQParserPlugin.html q={!boost b=productValueField}shops in madrid Or you can use the edismax parser which as a boost parameter that does the same thing: defType=edismaxq=shops in madridboost=productValueField -Yonik http://www.lucidimagination.com On Tue, Jun 7, 2011 at 6:53 AM, Marco Martinez mmarti...@paradigmatecnologico.com wrote: Hi, I need to use the function queries operations with the score of a given query, but only in the docset that i get from the query and i dont know if this is possible. Example: q=shops in madridreturns 1 docs with a specific score for each doc but now i need to do some stuff like q=sum(product(2,query(shops in madrid),productValueField) but this will be return all the docs in my index. I know that i can do it via filter queries, ex, q=sum(product(2,query(shops in madrid),productValueField)fq=shops in madrid but this will do the query two times and i dont want this because the performance is important to our application. Is there other approach to accomplished that= Thanks in advance, Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42
RE: SpellCheckComponent performance
Demian, If you omit spellcheckIndexDir from the configuration, it will create an in-memory spelling dictionary. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Demian Katz [mailto:demian.k...@villanova.edu] Sent: Tuesday, June 07, 2011 7:59 AM To: solr-user@lucene.apache.org Subject: RE: SpellCheckComponent performance As I may have mentioned before, VuFind is actually doing two Solr queries for every search -- a base query that gets basic spelling suggestions, and a supplemental spelling-only query that gets shingled spelling suggestions. If there's a way to get two different spelling responses in a single query, I'd love to hear about it... but the double-querying doesn't seem to be a huge problem -- the delays I'm talking about are in the spelling portion of the initial query. Just for the sake of completeness, here are both of my spelling field types: !-- Basic Text Field for use with Spell Correction -- fieldType name=textSpell class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=schema.UnicodeNormalizationFilterFactory version=icu4j composed=false remove_diacritics=true remove_modifiers=true fold=true/ filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType !-- More advanced spell checking field. -- fieldType name=textSpellShingle class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.ShingleFilterFactory maxShingleSize=2 outputUnigrams=false/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.ShingleFilterFactory maxShingleSize=2 outputUnigrams=false/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType ...and here are the fields: field name=spelling type=textSpell indexed=true stored=true/ field name=spellingShingle type=textSpellShingle indexed=true stored=true multiValued=true/ As you can probably guess, I'm using spelling in my main query and spellingShingle in my supplemental query. Here are stats on the spelling field: {field=spelling,memSize=107830314,tindexSize=249184,time=25747,phase1=25150,nTerms=1343061,bigTerms=231,termInstances=40960454,uses=1} (I obtained these numbers by temporarily adding the spelling field as a facet to my warming query -- probably not a very smart way to do it, but it was the only way I could figure out! If there's a more elegant and accurate approach, I'd be interested to know what it is.) I should also note that my basic spelling index is 114MB and my shingled spelling index is 931MB -- not outrageously large. Is there a way to persuade Solr to load these into memory for faster performance? thanks, Demian -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, June 06, 2011 6:23 PM To: solr-user@lucene.apache.org Subject: Re: SpellCheckComponent performance Hmmm, how are you configuring you spell checker? The first-time slowdown is probably due to cache warming, but subsequent 500 ms slowdowns seem odd. How many unique terms are there in your spellecheck index? It'd probably be best if you showed us your fieldtype and field definition... Best Erick On Mon, Jun 6, 2011 at 4:04 PM, Demian Katz demian.k...@villanova.edu wrote: I'm continuing to work on tuning my Solr server, and now I'm noticing that my biggest bottleneck is the SpellCheckComponent. This is eating multiple seconds on most first-time searches, and still taking around 500ms even on cached searches. Here is my configuration: searchComponent name=spellcheck class=org.apache.solr.handler.component.SpellCheckComponent lst name=spellchecker str name=namebasicSpell/str str name=fieldspelling/str str name=accuracy0.75/str str name=spellcheckIndexDir./spellchecker/str str name=queryAnalyzerFieldTypetextSpell/str str name=buildOnOptimizetrue/str /lst /searchComponent I've done a bit of searching, but the best advice I could find for making the search component go faster involved reducing
Re: Nullpointer Exception in Solr 4.x in DebugComponent when using wildcard in facet value
Hi Yonik, thanks, it's working in trunk now again... I had to re-index though because of exceptions at startup, did the index format change again between trunk of beginning / mid may and current trunk? best regards, Stefan Am 03.06.2011 15:32, schrieb Yonik Seeley: This bug was introduced during the cutover from strings to BytesRef on TermRangeQuery. I just committed a fix. -Yonik http://www.lucidimagination.com On Fri, Jun 3, 2011 at 5:42 AM, Stefan Moisesmoi...@shoptimax.de wrote: Hi, in Solr 4.x (trunk version of mid may) I have noticed a null pointer exception if I activate debugging (debug=true) and use a wildcard to filter by facet value, e.g. if I have a price field ...debug=truefacet.field=pricefq=price[500+TO+*] I get SEVERE: java.lang.RuntimeException: java.lang.NullPointerException at org.apache.solr.search.QueryParsing.toString(QueryParsing.java:538) at org.apache.solr.handler.component.DebugComponent.process(DebugComponent.java:77) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:239) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1298) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:465) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:555) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:852) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.NullPointerException at org.apache.solr.search.QueryParsing.toString(QueryParsing.java:402) at org.apache.solr.search.QueryParsing.toString(QueryParsing.java:535) This used to work in Solr 1.4 and I was wondering if it's a bug or a new feature and if there is a trick to get this working again? Best regards, Stefan . -- Mit den besten Grüßen aus Nürnberg, Stefan Moises *** Stefan Moises Senior Softwareentwickler shoptimax GmbH Guntherstraße 45 a 90461 Nürnberg Amtsgericht Nürnberg HRB 21703 GF Friedrich Schreieck Tel.: 0911/25566-25 Fax: 0911/25566-29 moi...@shoptimax.de http://www.shoptimax.de ***
Re: Debugging a Solr/Jetty Hung Process
OK... The fix I thought would fix it didn't fix it (which was to use the commitWithin feature). What I can gather from `ps` is that the thread has pages locked in memory. Currently I'm using native locking for Solr. Would switching to simple help alleviate this problem? Chris On Jun 4, 2011, at 2:48 PM, Chris Cowan wrote: I found this thread that looks similar to what's happening on my system. I think what happens is there are multiple commits happening at once from the clients and it's causing the same issue. I'm going to use the commitWithin argument to the updates to see if that fixes the problem. I will report back with any findings. Chris On Jun 1, 2011, at 12:42 PM, Jonathan Rochkind wrote: First guess (and it really is just a guess) would be Java garbage collection taking over. There are some JVM parameters you can use to tune the GC process, especially if the machine is multi-core, making sure GC happens in a seperate thread is helpful. But figuring out exactly what's going on requires confusing JVM debugging of which I am no expert at either. On 6/1/2011 3:04 PM, Chris Cowan wrote: About once a day a Solr/Jetty process gets hung on my server consuming 100% of one of the CPU's. Once this happens the server no longer responds to requests. I've looked through the logs to try and see if anything stands out but so far I've found nothing out of the ordinary. My current remedy is to log in and just kill the single processes that's hung. Once that happens everything goes back to normal and I'm good for a day or so. I'm currently the running following: solr-jetty-1.4.0+ds1-1ubuntu1 which is comprised of Solr 1.4.0 Jetty 6.1.22 on Unbuntu 10.10 I'm pretty new to managing a Jetty/Solr instance so at this point I'm just looking for advice on how I should go about trouble shooting this problem. Chris
Re: Default query parser operator
I feel like this should be fairly easy to do but I just don't see anywhere in the documentation on how to do this. Perhaps I am using the wrong search parameters. On Mon, Jun 6, 2011 at 12:19 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, Is it possible to change the query parser operator for a specific field without having to explicitly type it in the search field? For example, I'd like to use: http://localhost:8983/solr/search/?q=field1:word token field2:parser syntax instead of http://localhost:8983/solr/search/?q=field1:word AND token field2:parser syntax But, I only want it to be applied to field1, not field2 and I want the operator to always be AND unless the user explicitly types in OR. Thanks, Brian Lamb
Solr Custom Installation
Hey there. I was wondering if Solr can be embedded into my Java Web App. As far as I know, Solr comes as a war or bundled with Jetty if you don't have a container. I've opened the war's web.xml and found out that it only has a couple of servlets, filters and that's it. So, would it be possible to declare those servlets in *my* web.xml, and include the appropiate jars in my classpath, instead of having another webapp deployed in the container? Does Solr has the jars mavenized? Thank you Fede.
Re: How do I make sure the resulting documents contain the query terms?
Um, normally that would never happen, because, well, like you say, the inverted index doesn't have docC for term K1, because doc C didn't include term K1. If you search on q=K1, then how/why would docC ever be in your result set? Are you seeing it in your result set? The question then would be _why_, what weird thing is going on to make that happen, that's not expected. The result set _starts_ from only the documents that actually include the term. Boosting/relevancy ranking only effects what order these documents appear in, but there's no reason documentC should be in the result set at all in your case of q=k1, where docC is not indexed under k1. On 6/7/2011 2:35 AM, Gabriele Kahlout wrote: Sorry being unclear and thank you for answering. Consider the following documents A(k0,k1,k2), B(k1,k2,k3), and C(k0,k2,k3), where A,B,C are document identifiers and the ks in bracket with each are the terms each contains. So Solr inverted index should be something like: k0 -- A | C k1 -- A | B k2 -- A | B | C k3 -- B | C Now let q=k1, how do I make sure C doesn't appear as a result since it doesn't contain any occurence of k1?
Re: Default query parser operator
Nope, not possible. I'm not even sure what it would mean semantically. If you had default operator OR ordinarily, but default operator AND just for field2, then what would happen if you entered: field1:foo field2:bar field1:baz field2:bom Where the heck would the ANDs and ORs go? The operators are BETWEEN the clauses that specify fields, they don't belong to a field. In general, the operators are part of the query as a whole, not any specific field. In fact, I'd be careful of your example query: q=field1:foo bar field2:baz I don't think that means what you think it means, I don't think the field1 applies to the bar in that case. Although I could be wrong, but you definitely want to check it. You need field1:foo field1:bar, or set the default field for the query to field1, or use parens (although that will change the execution strategy and ranking): q=field1:(foo bar) At any rate, even if there's a way to specify this so it makes sense, no, Solr/lucene doesn't support any such thing. On 6/7/2011 10:56 AM, Brian Lamb wrote: I feel like this should be fairly easy to do but I just don't see anywhere in the documentation on how to do this. Perhaps I am using the wrong search parameters. On Mon, Jun 6, 2011 at 12:19 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, Is it possible to change the query parser operator for a specific field without having to explicitly type it in the search field? For example, I'd like to use: http://localhost:8983/solr/search/?q=field1:word token field2:parser syntax instead of http://localhost:8983/solr/search/?q=field1:word AND token field2:parser syntax But, I only want it to be applied to field1, not field2 and I want the operator to always be AND unless the user explicitly types in OR. Thanks, Brian Lamb
Re: Solr Custom Installation
Hi Federico, you can take a look to this wiki page: http://wiki.apache.org/solr/EmbeddedSolr http://wiki.apache.org/solr/EmbeddedSolrSolr also has some maven support, see the ant target generate-maven-artifacts, don't know if that's what you need. Regards, Tomás On Tue, Jun 7, 2011 at 12:17 PM, Federico Czerwinski fed...@gmail.comwrote: Hey there. I was wondering if Solr can be embedded into my Java Web App. As far as I know, Solr comes as a war or bundled with Jetty if you don't have a container. I've opened the war's web.xml and found out that it only has a couple of servlets, filters and that's it. So, would it be possible to declare those servlets in *my* web.xml, and include the appropiate jars in my classpath, instead of having another webapp deployed in the container? Does Solr has the jars mavenized? Thank you Fede.
Re: How do I make sure the resulting documents contain the query terms?
You are right, Lucene will return based on my scoring function implementation (Similarity classhttp://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html ): score(q,d) = coord(q,d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_coord · queryNorm(q)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_queryNorm · ∑ ( tf(t in d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_tf · idf(t)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_idf 2 · t.getBoost()http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_termBoost · norm(t,d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_norm ) It can be seen that whenever tf(t in d) =0 the whole score will be 0, so as you say C will never be returned. My issue is when the query has multiple terms (my example was too simple!), and some are 'mandatory' while others not. In that case I should make a query that uses the +%20http://lucene.apache.org/java/2_9_1/queryparsersyntax.html#+(eg. q=+k1). I'm unsure I'll get the syntax right, but let's say k1 is mandatory and and k2 and k3 are optional, then q=k2 k3 +k1. I see that queries made through solrj are received with + in place of the (default to OR), so q=k2+k3++k1. On Tue, Jun 7, 2011 at 5:23 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Um, normally that would never happen, because, well, like you say, the inverted index doesn't have docC for term K1, because doc C didn't include term K1. If you search on q=K1, then how/why would docC ever be in your result set? Are you seeing it in your result set? The question then would be _why_, what weird thing is going on to make that happen, that's not expected. The result set _starts_ from only the documents that actually include the term. Boosting/relevancy ranking only effects what order these documents appear in, but there's no reason documentC should be in the result set at all in your case of q=k1, where docC is not indexed under k1. On 6/7/2011 2:35 AM, Gabriele Kahlout wrote: Sorry being unclear and thank you for answering. Consider the following documents A(k0,k1,k2), B(k1,k2,k3), and C(k0,k2,k3), where A,B,C are document identifiers and the ks in bracket with each are the terms each contains. So Solr inverted index should be something like: k0 -- A | C k1 -- A | B k2 -- A | B | C k3 -- B | C Now let q=k1, how do I make sure C doesn't appear as a result since it doesn't contain any occurence of k1? -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: How do I make sure the resulting documents contain the query terms?
Okay, if you're using a custom similarity, I'm not sure what's going on, I'm not familiar with that. But ordinarily, you are right, you would require k1 with +k1. What you say about the + being lost suggests something is going wrong. Either you are not sending your query to Solr properly escaped, or there's a bug in your custom similarity or query parser, or (not too likely) there's a bug in Solr. My experience is using the standard query parser, standard similarity class, and contacting Solr via HTTP. (are you using SolrJ or HTTP?). In that case, when you send the q to Solr, you are responsible for URI-encoding it when you send it. So if you want to send a query like k2 k3 +k1, you need to URI-escape it first, and this is what you'd send: q=k2+k3+%2Bk1 or, escaping spaces as %20 instead, which is actually more 'correct' with current standards: q=k2%20k3%20%2Bk1 The important thing is that + escapes as %2B. You need to escape it before sending it to Solr via an HTTP URI query string or HTTP form post data. Yes, if you send a raw +, Solr will understand that as representing a space, not an actual +. This is because the + character is not 'safe', it needs to be escaped. The programming language of your choice probably already has a library function for URI-escaping values. On 6/7/2011 11:36 AM, Gabriele Kahlout wrote: You are right, Lucene will return based on my scoring function implementation (Similarity classhttp://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html ): score(q,d) = coord(q,d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_coord · queryNorm(q)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_queryNorm · ∑ ( tf(t in d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_tf · idf(t)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_idf 2 · t.getBoost()http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_termBoost · norm(t,d)http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#formula_norm ) It can be seen that whenever tf(t in d) =0 the whole score will be 0, so as you say C will never be returned. My issue is when the query has multiple terms (my example was too simple!), and some are 'mandatory' while others not. In that case I should make a query that uses the +%20http://lucene.apache.org/java/2_9_1/queryparsersyntax.html#+(eg. q=+k1). I'm unsure I'll get the syntax right, but let's say k1 is mandatory and and k2 and k3 are optional, then q=k2 k3 +k1. I see that queries made through solrj are received with + in place of the (default to OR), so q=k2+k3++k1. On Tue, Jun 7, 2011 at 5:23 PM, Jonathan Rochkindrochk...@jhu.edu wrote: Um, normally that would never happen, because, well, like you say, the inverted index doesn't have docC for term K1, because doc C didn't include term K1. If you search on q=K1, then how/why would docC ever be in your result set? Are you seeing it in your result set? The question then would be _why_, what weird thing is going on to make that happen, that's not expected. The result set _starts_ from only the documents that actually include the term. Boosting/relevancy ranking only effects what order these documents appear in, but there's no reason documentC should be in the result set at all in your case of q=k1, where docC is not indexed under k1. On 6/7/2011 2:35 AM, Gabriele Kahlout wrote: Sorry being unclear and thank you for answering. Consider the following documents A(k0,k1,k2), B(k1,k2,k3), and C(k0,k2,k3), where A,B,C are document identifiers and the ks in bracket with each are the terms each contains. So Solr inverted index should be something like: k0 -- A | C k1 -- A | B k2 -- A | B | C k3 -- B | C Now let q=k1, how do I make sure C doesn't appear as a result since it doesn't contain any occurence of k1?
Data not always returned
Hi all, I have a problem with my index. Even though I always index the same data over and over again, whenever I try a couple of searches (they are always the same as they are issued by a unit test suite) I do not get the same results, sometimes I get 3 successes and 2 failures and sometimes it is the other way around it is unpredictable. Here is what I am trying to do: I created a new Solr core with its specific solrconfig.xml and schema.xml This core stores a list of towns which I plan to use with an auto-suggestion system, using ngrams (no Suggester) The indexing process is always the same : 1. the import script deletes all documents in the core : deletequery*:*/query/delete and commit/ 2. the import script fetches date from postgres, 100 rows at a time 2. the import script adds these 100 documents and sends a commit/ 3. once all the rows (around 40 000) have been imported the script send an optimize/ query Here is what happens: I run the indexer once and search for 'foo' I get results I expect but if I search for 'bar' I get nothing I reindex once again and search for 'foo' I get nothing, but if I search for 'bar' I get results The search is made on the name field which is a pretty common TextField with ngrams. I tried to physically remove the index (rm -rf path/to/index) and reindex everything as well and not all searches work, sometimes the 'foo' search work, sometimes the 'bar' one. I tried a lot of differents things but now I am running out of ideas. This is why I am asking for help. Some useful informations : Solr version : 3.1.0 Solr Implementation Version: 3.1.0 1085815 - grantingersoll - 2011-03-26 18:00:07 Lucene Implementation Version: 3.1.0 1085809 - 2011-03-26 18:06:58 Java 1.5.0_24 on Mac Os X solrconfig.xml and schema.xml are attached Thanks in advance for your help. schema.xml.gz Description: GNU Zip compressed data solrconfig.xml.gz Description: GNU Zip compressed data
Question about tokenizing, searching and retrieving results.
Hello! My problem is as follows: I've got a field (indexed and stored setted as true) tokenized by whitespaces and other patterns, with a gap with value 100. For example, if index the following expression for the field that I mentioned: *Expression*: A B C D E- *Index*: tokenAtokenBtokenC tokenD tokenE This behaviour is replicated in search context, so each content asociated to this field during a search will be tokenized as I explained. If I search the whole expresion the document indexed is returned correctly as expected, but if I search something like: *Expression*: A B C D E F G H I It doesn't retrieve the document. What's happening? The expression will match partially and I thought that the document will be returned too. I tried modifying the gap value but it doesn't work. Thank you very much.
Re: Solr Cloud Query Question
Thanks Yonik. I have a follow on now, how does Solr ensure consistent results across pages? So for example if we had my 3 theoretical solr instances again and a, b and c each returned 100 documents with the same score and the user only requested 100 documents, how are those 100 documents chosen from the set available from a, b and c if the documents have the same score? On Tue, Jun 7, 2011 at 9:38 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Tue, Jun 7, 2011 at 9:35 AM, Jamie Johnson jej2...@gmail.com wrote: I am currently experimenting with the Solr Cloud code on trunk and just had a quick question. Lets say my setup had 3 nodes a, b and c. Node a has 1000 results which meet a particular query, b has 2000 and c has 3000. When executing this query and asking for row 900 what specifically happens? From reading the Distributed Search Wiki I would expect that node a responds with 900, node b responds with 900 and c responds with 900 and the coordinating node is responsible for taking the top scored items and throwing away the rest, is this correct or is there some additional coordination that happens where nodes a, b and c return back an id and a score and the coordinating node makes an additional request to get back the documents for the ids which make up the top list? The latter is correct - the first phase only collects enough information to merge ids from the shards, and then a second phase requests the stored fields, highlighting, etc for the specific docs that will be returned. -Yonik http://www.lucidimagination.com
Re: Default query parser operator
Hi Jonathan, Thank you for your reply. Your point about my example is a good one. So let me try to restate using your example. Suppose I want to apply AND to any search terms within field1. Then field1:foo field2:bar field1:baz field2:bom would by written as http://localhost:8983/solr/?q=field1:foo OR field2:bar OR field1:baz OR field2:bom But if they were written together like: http://localhost:8983/solr/?q=field1:(foo baz) field2:(bar bom) I would want it to be http://localhost:8983/solr/?q=field1:(foo AND baz) OR field2:(bar OR bom) But it sounds like you are saying that would not be possible. Thanks, Brian Lamb On Tue, Jun 7, 2011 at 11:27 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Nope, not possible. I'm not even sure what it would mean semantically. If you had default operator OR ordinarily, but default operator AND just for field2, then what would happen if you entered: field1:foo field2:bar field1:baz field2:bom Where the heck would the ANDs and ORs go? The operators are BETWEEN the clauses that specify fields, they don't belong to a field. In general, the operators are part of the query as a whole, not any specific field. In fact, I'd be careful of your example query: q=field1:foo bar field2:baz I don't think that means what you think it means, I don't think the field1 applies to the bar in that case. Although I could be wrong, but you definitely want to check it. You need field1:foo field1:bar, or set the default field for the query to field1, or use parens (although that will change the execution strategy and ranking): q=field1:(foo bar) At any rate, even if there's a way to specify this so it makes sense, no, Solr/lucene doesn't support any such thing. On 6/7/2011 10:56 AM, Brian Lamb wrote: I feel like this should be fairly easy to do but I just don't see anywhere in the documentation on how to do this. Perhaps I am using the wrong search parameters. On Mon, Jun 6, 2011 at 12:19 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, Is it possible to change the query parser operator for a specific field without having to explicitly type it in the search field? For example, I'd like to use: http://localhost:8983/solr/search/?q=field1:word token field2:parser syntax instead of http://localhost:8983/solr/search/?q=field1:word AND token field2:parser syntax But, I only want it to be applied to field1, not field2 and I want the operator to always be AND unless the user explicitly types in OR. Thanks, Brian Lamb
Re: Question about tokenizing, searching and retrieving results.
My first guess would be that you are using AND as default operator? you can see the generated query by using the parameter debugQuery=true On Tue, Jun 7, 2011 at 1:34 PM, Luis Cappa Banda luisca...@gmail.comwrote: Hello! My problem is as follows: I've got a field (indexed and stored setted as true) tokenized by whitespaces and other patterns, with a gap with value 100. For example, if index the following expression for the field that I mentioned: *Expression*: A B C D E- *Index*: tokenAtokenBtokenC tokenD tokenE This behaviour is replicated in search context, so each content asociated to this field during a search will be tokenized as I explained. If I search the whole expresion the document indexed is returned correctly as expected, but if I search something like: *Expression*: A B C D E F G H I It doesn't retrieve the document. What's happening? The expression will match partially and I thought that the document will be returned too. I tried modifying the gap value but it doesn't work. Thank you very much.
Re: Solr Cloud Query Question
On Tue, Jun 7, 2011 at 1:01 PM, Jamie Johnson jej2...@gmail.com wrote: Thanks Yonik. I have a follow on now, how does Solr ensure consistent results across pages? So for example if we had my 3 theoretical solr instances again and a, b and c each returned 100 documents with the same score and the user only requested 100 documents, how are those 100 documents chosen from the set available from a, b and c if the documents have the same score? Ties within a shard are broken by docid (just like lucene), and ties across different shards are broken by comparing the shard ids... so yes, it's consistent. -Yonik http://www.lucidimagination.com
Re: Default query parser operator
There's no feature in Solr to do what you ask, no. I don't think. On 6/7/2011 1:30 PM, Brian Lamb wrote: Hi Jonathan, Thank you for your reply. Your point about my example is a good one. So let me try to restate using your example. Suppose I want to apply AND to any search terms within field1. Then field1:foo field2:bar field1:baz field2:bom would by written as http://localhost:8983/solr/?q=field1:foo OR field2:bar OR field1:baz OR field2:bom But if they were written together like: http://localhost:8983/solr/?q=field1:(foo baz) field2:(bar bom) I would want it to be http://localhost:8983/solr/?q=field1:(foo AND baz) OR field2:(bar OR bom) But it sounds like you are saying that would not be possible. Thanks, Brian Lamb On Tue, Jun 7, 2011 at 11:27 AM, Jonathan Rochkindrochk...@jhu.edu wrote: Nope, not possible. I'm not even sure what it would mean semantically. If you had default operator OR ordinarily, but default operator AND just for field2, then what would happen if you entered: field1:foo field2:bar field1:baz field2:bom Where the heck would the ANDs and ORs go? The operators are BETWEEN the clauses that specify fields, they don't belong to a field. In general, the operators are part of the query as a whole, not any specific field. In fact, I'd be careful of your example query: q=field1:foo bar field2:baz I don't think that means what you think it means, I don't think the field1 applies to the bar in that case. Although I could be wrong, but you definitely want to check it. You need field1:foo field1:bar, or set the default field for the query to field1, or use parens (although that will change the execution strategy and ranking): q=field1:(foo bar) At any rate, even if there's a way to specify this so it makes sense, no, Solr/lucene doesn't support any such thing. On 6/7/2011 10:56 AM, Brian Lamb wrote: I feel like this should be fairly easy to do but I just don't see anywhere in the documentation on how to do this. Perhaps I am using the wrong search parameters. On Mon, Jun 6, 2011 at 12:19 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, Is it possible to change the query parser operator for a specific field without having to explicitly type it in the search field? For example, I'd like to use: http://localhost:8983/solr/search/?q=field1:word token field2:parser syntax instead of http://localhost:8983/solr/search/?q=field1:word AND token field2:parser syntax But, I only want it to be applied to field1, not field2 and I want the operator to always be AND unless the user explicitly types in OR. Thanks, Brian Lamb
Re: Question about tokenizing, searching and retrieving results.
On Tue, Jun 7, 2011 at 12:34 PM, Luis Cappa Banda luisca...@gmail.com wrote: *Expression*: A B C D E F G H I As written, this is equivalent to *Expression*: A default_field:B default_field:C default_field:D default_field:E default_field:F default_field:G default_field:H default_field:I Try *Expression*:( A B C D E F G H I) or *Expression*:A B C D E F G H I for a phrase query. Oh, and I highly recommend sticking to java identifiers for field names - it will make your life much easier in the future. -Yonik http://www.lucidimagination.com
Solr Cloud and Range Facets
I have a solr cloud setup wtih 2 servers, when executing a query against them of the form: http://localhost:8983/solr/select/?distrib=trueq=*:*facet=truefacet.mincount=1facet.range=dateTimef.dateTime.facet.range.gap=%2B1MONTHf.dateTime.facet.range.start=2011-06-01T00%3A00%3A00Z-1YEARf.dateTime.facet.range.end=2011-07-01T00%3A00%3A00Zf.dateTime.facet.mincount=1start=0rows=0 I am seeing that sometimes the date facet has a count, and other times it does not. Specifically I am seeing sometimes: lst name=facet_ranges lst name=dateTime lst name=counts/ str name=gap+1MONTH/str date name=start2010-06-01T00:00:00Z/date date name=end2011-07-01T00:00:00Z/date /lst /lst and others lst name=facet_ranges lst name=dateTime lst name=counts int name=2011-06-01T00:00:00Z250/int /lst str name=gap+1MONTH/str date name=start2010-06-01T00:00:00Z/date date name=end2011-07-01T00:00:00Z/date /lst /lst What could be causing this inconsistency?
Compound word search not what I expected
I have a field defined as: field name=content type=text indexed=true stored=false termVectors=true multiValued=true / where text is unmodified from the schema.xml example that came with Solr 1.4.1. I have documents with some compound words indexed, words like Sandstone. And in several cases words that are camel case like MaxSize. If I query using all lower case, sandstone or maxsize, I get the documents I expect. If I query with proper case, ie. Sandstone or Maxsize I get the documents I expect. However, if the query is camel case, MaxSize or SandStone, it doesn't find the documents. In the case of MaxSize it is particularly frustrating because that is the actual case of the word that was indexed. Is this expected behavior? The query analyzer definition the the text field type is: analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory ignoreCase=true expand=true synonyms=synonyms.txt/ filter class=solr.StopFilterFactory enablePositionIncrements=true words=stopwords.txt ignoreCase=true/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 catenateAll=0 catenateNumbers=0 catenateWords=0 generateNumberParts=1 generateWordParts=1/ filter class=solr.LowerCaseFilterFactory/ filter language=English class=solr.SnowballPorterFilterFactory protected=protwords.txt/ /analyzer Is the order by the filters important? If LowerCaseFilterFactory came before WordDelimiterFilterFactory, would that fix this? Would it break something else? Thanks, Ken -- View this message in context: http://lucene.472066.n3.nabble.com/Compound-word-search-not-what-I-expected-tp3036089p3036089.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Compound word search not what I expected
catenateWords should be set to true. Same goes for the index analyzer. preserveOriginal would also work. I have a field defined as: field name=content type=text indexed=true stored=false termVectors=true multiValued=true / where text is unmodified from the schema.xml example that came with Solr 1.4.1. I have documents with some compound words indexed, words like Sandstone. And in several cases words that are camel case like MaxSize. If I query using all lower case, sandstone or maxsize, I get the documents I expect. If I query with proper case, ie. Sandstone or Maxsize I get the documents I expect. However, if the query is camel case, MaxSize or SandStone, it doesn't find the documents. In the case of MaxSize it is particularly frustrating because that is the actual case of the word that was indexed. Is this expected behavior? The query analyzer definition the the text field type is: analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory ignoreCase=true expand=true synonyms=synonyms.txt/ filter class=solr.StopFilterFactory enablePositionIncrements=true words=stopwords.txt ignoreCase=true/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 catenateAll=0 catenateNumbers=0 catenateWords=0 generateNumberParts=1 generateWordParts=1/ filter class=solr.LowerCaseFilterFactory/ filter language=English class=solr.SnowballPorterFilterFactory protected=protwords.txt/ /analyzer Is the order by the filters important? If LowerCaseFilterFactory came before WordDelimiterFilterFactory, would that fix this? Would it break something else? Thanks, Ken -- View this message in context: http://lucene.472066.n3.nabble.com/Compound-word-search-not-what-I-expecte d-tp3036089p3036089.html Sent from the Solr - User mailing list archive at Nabble.com.
How to deal with many files using solr external file field
Hi all, we're using solr 1.4 and external file field ([1]) for sorting our searchresults. We have about 40.000 Terms, for which we use this sorting option. Currently we're running into massive OutOfMemory-Problems and were not pretty sure, what's the matter. It seems that the garbage collector stops working or some processes are going wild. However, solr starts to allocate more and more RAM until we experience this OutOfMemory-Exception. We noticed the following: For some terms one could see in the solr log that there appear some java.io.FileNotFoundExceptions, when solr tries to load an external file for a term for which there is not such a file, e.g. solr tries to load the external score file for trousers but there ist none in the /solr/data-Folder. Question: is it possible, that those exceptions are responsible for the OutOfMemory-Problem or could it be due to the large(?) number of 40k terms for which we want to sort the result via external file field? I'm looking forward for your answers, suggestions and ideas :) Regards Sven [1]: http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
Available Solr Indexing strategies
Hi, I am very new to Solr and my client is trying to implement full text searching capabilities to their product by using Solr. They will also have master storage that would be the Authoritative data store which will also provide meta data searches. Can you please point me in the right direction for some indexing strategies that people are using for further research. Thank you, Zarni
Re: Data not always returned
Well, this is odd. Several questions 1 what do your logs show? I'm wondering if somehow some data is getting rejected. I have no idea why that would be, but if you're seeing indexing exceptions that would explain it. 2 on the admin/stats page, are maxDocs and numDocs the same in the success /failure case? And are they equal to 40,000? 3 what does debugQuery=on show in the two cases? I'd expect it to be identical, but... 4 admin/schema browser. Look at your three fields and see if things like unique-terms are identical. 5 are the rows being returned before indexing in the same order? I'm wondering if somehow you're getting documents overwritten by having the same id (uniqueKey). 6 Have you poked around with Luke to see what, if anything, is dissimilar? These are shots in the dark, but my supposition is that somehow you're not indexing what you expect, the questions above might give us a clue where to look next. Best Erick On Tue, Jun 7, 2011 at 12:02 PM, Jerome Renard jerome.ren...@gmail.com wrote: Hi all, I have a problem with my index. Even though I always index the same data over and over again, whenever I try a couple of searches (they are always the same as they are issued by a unit test suite) I do not get the same results, sometimes I get 3 successes and 2 failures and sometimes it is the other way around it is unpredictable. Here is what I am trying to do: I created a new Solr core with its specific solrconfig.xml and schema.xml This core stores a list of towns which I plan to use with an auto-suggestion system, using ngrams (no Suggester) The indexing process is always the same : 1. the import script deletes all documents in the core : deletequery*:*/query/delete and commit/ 2. the import script fetches date from postgres, 100 rows at a time 2. the import script adds these 100 documents and sends a commit/ 3. once all the rows (around 40 000) have been imported the script send an optimize/ query Here is what happens: I run the indexer once and search for 'foo' I get results I expect but if I search for 'bar' I get nothing I reindex once again and search for 'foo' I get nothing, but if I search for 'bar' I get results The search is made on the name field which is a pretty common TextField with ngrams. I tried to physically remove the index (rm -rf path/to/index) and reindex everything as well and not all searches work, sometimes the 'foo' search work, sometimes the 'bar' one. I tried a lot of differents things but now I am running out of ideas. This is why I am asking for help. Some useful informations : Solr version : 3.1.0 Solr Implementation Version: 3.1.0 1085815 - grantingersoll - 2011-03-26 18:00:07 Lucene Implementation Version: 3.1.0 1085809 - 2011-03-26 18:06:58 Java 1.5.0_24 on Mac Os X solrconfig.xml and schema.xml are attached Thanks in advance for your help.
Re: Compound word search not what I expected
WordDelimiterFilterFactory is doing this to you. It's not clear to me that you want this in place at all. Look at admin/analysis for that field to see how that filter breaks things up, it's often surprising to people. Best Erick On Tue, Jun 7, 2011 at 3:13 PM, kenf_nc ken.fos...@realestate.com wrote: I have a field defined as: field name=content type=text indexed=true stored=false termVectors=true multiValued=true / where text is unmodified from the schema.xml example that came with Solr 1.4.1. I have documents with some compound words indexed, words like Sandstone. And in several cases words that are camel case like MaxSize. If I query using all lower case, sandstone or maxsize, I get the documents I expect. If I query with proper case, ie. Sandstone or Maxsize I get the documents I expect. However, if the query is camel case, MaxSize or SandStone, it doesn't find the documents. In the case of MaxSize it is particularly frustrating because that is the actual case of the word that was indexed. Is this expected behavior? The query analyzer definition the the text field type is: analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory ignoreCase=true expand=true synonyms=synonyms.txt/ filter class=solr.StopFilterFactory enablePositionIncrements=true words=stopwords.txt ignoreCase=true/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 catenateAll=0 catenateNumbers=0 catenateWords=0 generateNumberParts=1 generateWordParts=1/ filter class=solr.LowerCaseFilterFactory/ filter language=English class=solr.SnowballPorterFilterFactory protected=protwords.txt/ /analyzer Is the order by the filters important? If LowerCaseFilterFactory came before WordDelimiterFilterFactory, would that fix this? Would it break something else? Thanks, Ken -- View this message in context: http://lucene.472066.n3.nabble.com/Compound-word-search-not-what-I-expected-tp3036089p3036089.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Compound word search not what I expected
see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory from the wiki Example of generateWordParts=1 and catenateWords=1: PowerShot - 0:Power, 1:Shot 1:PowerShot (where 0,1,1 are token positions) A's+B'sC's - 0:A, 1:B, 2:C, 2:ABC Super-Duper-XL500-42-AutoCoder! - 0:Super, 1:Duper, 2:XL, 2:SuperDuperXL, 3:500 4:42, 5:Auto, 6:Coder, 6:AutoCoder One use for WordDelimiterFilter is to help match words with different delimiters. One way of doing so is to specify generateWordParts=1 catenateWords=1 in the analyzer used for indexing, and generateWordParts=1 in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that leaves them in place (such as WhitespaceTokenizer).
Re: Compound word search not what I expected
I tried setting catenateWords=1 on the Query analyzer and that didn't do anything. I think what I need is to set my Index Analyzer to have preserveOriginal=1 and then re-index everything. That will be a pain, so I'll do a small test to make sure first. I'm really surprised preserveOriginal=1 isn't the default. It's like saying slice and dice this word so I can search on all kinds of partial matches...but do NOT let me search on the actual word itself. I know it's not quite that, but it's close. Anyway, I'm going to try the preserveOriginal parameter on WordDelimiterFilterFactory, on both the Index and Query side and see what happens. Thanks for all the suggestions, Ken -- View this message in context: http://lucene.472066.n3.nabble.com/Compound-word-search-not-what-I-expected-tp3036089p3037068.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Default query parser operator
Hi Brian could your front end app do this field query logic? (assuming you have an app in front of solr) On 7 June 2011 18:53, Jonathan Rochkind rochk...@jhu.edu wrote: There's no feature in Solr to do what you ask, no. I don't think. On 6/7/2011 1:30 PM, Brian Lamb wrote: Hi Jonathan, Thank you for your reply. Your point about my example is a good one. So let me try to restate using your example. Suppose I want to apply AND to any search terms within field1. Then field1:foo field2:bar field1:baz field2:bom would by written as http://localhost:8983/solr/?q=field1:foo OR field2:bar OR field1:baz OR field2:bom But if they were written together like: http://localhost:8983/solr/?q=field1:(foo baz) field2:(bar bom) I would want it to be http://localhost:8983/solr/?q=field1:(foo AND baz) OR field2:(bar OR bom) But it sounds like you are saying that would not be possible. Thanks, Brian Lamb On Tue, Jun 7, 2011 at 11:27 AM, Jonathan Rochkindrochk...@jhu.edu wrote: Nope, not possible. I'm not even sure what it would mean semantically. If you had default operator OR ordinarily, but default operator AND just for field2, then what would happen if you entered: field1:foo field2:bar field1:baz field2:bom Where the heck would the ANDs and ORs go? The operators are BETWEEN the clauses that specify fields, they don't belong to a field. In general, the operators are part of the query as a whole, not any specific field. In fact, I'd be careful of your example query: q=field1:foo bar field2:baz I don't think that means what you think it means, I don't think the field1 applies to the bar in that case. Although I could be wrong, but you definitely want to check it. You need field1:foo field1:bar, or set the default field for the query to field1, or use parens (although that will change the execution strategy and ranking): q=field1:(foo bar) At any rate, even if there's a way to specify this so it makes sense, no, Solr/lucene doesn't support any such thing. On 6/7/2011 10:56 AM, Brian Lamb wrote: I feel like this should be fairly easy to do but I just don't see anywhere in the documentation on how to do this. Perhaps I am using the wrong search parameters. On Mon, Jun 6, 2011 at 12:19 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, Is it possible to change the query parser operator for a specific field without having to explicitly type it in the search field? For example, I'd like to use: http://localhost:8983/solr/search/?q=field1:word token field2:parser syntax instead of http://localhost:8983/solr/search/?q=field1:word AND token field2:parser syntax But, I only want it to be applied to field1, not field2 and I want the operator to always be AND unless the user explicitly types in OR. Thanks, Brian Lamb
Solr Coldfusion Search Issue
Hi, I¹m having some troubles using Solr throught Coldfusion, the problem right now is that when I search for a term in a Custom field, the results sometimes have the value that I sent to the custom field and not to the field that contains the text, this is the cfsearch sintax that I¹m using: cfsearch collection=agenda,bitacoras criteria='contents:#form.search#ANDcustom1:#form.tema#ANDcustom2:# form.dia#ANDcustom4:#form.anio#ANDcustom3:#form.mon#' name=result status=meta startrow=#url.start# maxrows=#max# contextpassages=5 contexthighlightbegin=B contexthighlightend=BE suggestions=always Every custom fields gets the value by a combo box or drop box with a list of option, the thing is that when the users sends a search for CUSTOM1, sometimes the results include the same searched value un CONTENTS... Do anyone have an idea on how to fix this? I¹ll appreciate all the help I can get. Regards. Alex
Re: Solr Coldfusion Search Issue
Can you see the query actually presented to solr in the logs ? maybe capture that and then run it with a debug true in the admin pages. sorry i cant help directly with your syntax On 7 June 2011 23:06, Alejandro Delgadillo adelgadi...@febg.org wrote: Hi, I¹m having some troubles using Solr throught Coldfusion, the problem right now is that when I search for a term in a Custom field, the results sometimes have the value that I sent to the custom field and not to the field that contains the text, this is the cfsearch sintax that I¹m using: cfsearch collection=agenda,bitacoras criteria='contents:#form.search#ANDcustom1:#form.tema#ANDcustom2:# form.dia#ANDcustom4:#form.anio#ANDcustom3:#form.mon#' name=result status=meta startrow=#url.start# maxrows=#max# contextpassages=5 contexthighlightbegin=B contexthighlightend=BE suggestions=always Every custom fields gets the value by a combo box or drop box with a list of option, the thing is that when the users sends a search for CUSTOM1, sometimes the results include the same searched value un CONTENTS... Do anyone have an idea on how to fix this? I¹ll appreciate all the help I can get. Regards. Alex
Re: Compound word search not what I expected
You must catenateWord on index-time as well. I tried setting catenateWords=1 on the Query analyzer and that didn't do anything. I think what I need is to set my Index Analyzer to have preserveOriginal=1 and then re-index everything. That will be a pain, so I'll do a small test to make sure first. I'm really surprised preserveOriginal=1 isn't the default. It's like saying slice and dice this word so I can search on all kinds of partial matches...but do NOT let me search on the actual word itself. I know it's not quite that, but it's close. Anyway, I'm going to try the preserveOriginal parameter on WordDelimiterFilterFactory, on both the Index and Query side and see what happens.
wildcard search
Hello, I am testing solr 3.2 and have problems with wildcards. I am indexing values like IA 300; IC 330; IA 317; IA 318 in a field GOK, and can't find a way to search with wildcards. I want to use a wild card search to match something like IA 31? but cannot find a way to do so. GOK:IA\ 38* doesn't work with the contents of GOK indexed as text. Is there a way to index and search that would meet my requirements? Thomas
Re: Solr Coldfusion Search Issue
Thanks Lee for the quick response, Let me explain it a little bit better In the CFSEARCH tag, you use the CRITERIA attribute, what it does... By default is that it sents to the SOLR via post the search query of the user to the field where the text is stored in this case since I'm indexing PDF files the variable CONTENTS in solr... The problem is that also sends the custom field criteria to the contents variables and that's why I have for example: If I search the value 03 in CUSTOM1, it also search the same value in CONTENTS, it works since the results are filtered by the value, but the contents display the same value, in this case 03 Maybe... I'm not sure... There is another way to search for custom fields using the CFSEARCH tag, I've tried changing the order, but I still get the same result... On 6/7/11 4:14 PM, lee carroll lee.a.carr...@googlemail.com wrote: Can you see the query actually presented to solr in the logs ? maybe capture that and then run it with a debug true in the admin pages. sorry i cant help directly with your syntax On 7 June 2011 23:06, Alejandro Delgadillo adelgadi...@febg.org wrote: Hi, I¹m having some troubles using Solr throught Coldfusion, the problem right now is that when I search for a term in a Custom field, the results sometimes have the value that I sent to the custom field and not to the field that contains the text, this is the cfsearch sintax that I¹m using: cfsearch collection=agenda,bitacoras criteria='contents:#form.search#ANDcustom1:#form.tema#ANDcustom2:# form.dia#ANDcustom4:#form.anio#ANDcustom3:#form.mon#' name=result status=meta startrow=#url.start# maxrows=#max# contextpassages=5 contexthighlightbegin=B contexthighlightend=BE suggestions=always Every custom fields gets the value by a combo box or drop box with a list of option, the thing is that when the users sends a search for CUSTOM1, sometimes the results include the same searched value un CONTENTS... Do anyone have an idea on how to fix this? I¹ll appreciate all the help I can get. Regards. Alex
Re: wildcard search
Yes there is, but you haven't provided enough information to make a suggestion. What isthe fieldType definition? What is the field definition? Two resources that'll help you greatly are: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters and the admin/analysis page... Best Erick On Tue, Jun 7, 2011 at 6:23 PM, Thomas Fischer fischer...@aon.at wrote: Hello, I am testing solr 3.2 and have problems with wildcards. I am indexing values like IA 300; IC 330; IA 317; IA 318 in a field GOK, and can't find a way to search with wildcards. I want to use a wild card search to match something like IA 31? but cannot find a way to do so. GOK:IA\ 38* doesn't work with the contents of GOK indexed as text. Is there a way to index and search that would meet my requirements? Thomas
400 MB Fields
Hello, What are the biggest document fields that you've ever indexed in Solr or that you've heard of? Ah, it must be Tom's Hathi trust. :) I'm asking because I just heard of a case of an index where some documents having a field that can be around 400 MB in size! I'm curious if anyone has any experience with such monster fields? Crazy? Yes, sure. Doable? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: 400 MB Fields
From older (2.4) Lucene days, I once indexed the 23 volume Encyclopedia of Michigan Civil War Volunteers in a single document/field, so it's probably within the realm of possibility at least G... Erick On Tue, Jun 7, 2011 at 6:59 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello, What are the biggest document fields that you've ever indexed in Solr or that you've heard of? Ah, it must be Tom's Hathi trust. :) I'm asking because I just heard of a case of an index where some documents having a field that can be around 400 MB in size! I'm curious if anyone has any experience with such monster fields? Crazy? Yes, sure. Doable? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: 400 MB Fields
I think the question is strange... May be you are wondering about possible OOM exceptions? I think we can pass to Lucene single document containing comma separated list of term, term, ... (few billion times)... Except stored and TermVectorComponent... I believe thousands companies already indexed millions documents with average size few hundreds Mbytes... There should not be any limits (except InputSource vs. ByteArray) 100,000 _unique_ terms vs. single document containing 100,000,000,000,000 of non-unique terms (and trying to store offsets) What about Spell Checker feature? Is anyone tried to index single terabytes-like document? Personally, I indexed only small (up to 1000 bytes) documents-fields, but I believe 500Mb is very common use case with PDFs (which vendors use Lucene already? Eclipse? To index Eclipse Help file? Even Microsoft uses Lucene...) Fuad On 11-06-07 7:02 PM, Erick Erickson erickerick...@gmail.com wrote: From older (2.4) Lucene days, I once indexed the 23 volume Encyclopedia of Michigan Civil War Volunteers in a single document/field, so it's probably within the realm of possibility at least G... Erick On Tue, Jun 7, 2011 at 6:59 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello, What are the biggest document fields that you've ever indexed in Solr or that you've heard of? Ah, it must be Tom's Hathi trust. :) I'm asking because I just heard of a case of an index where some documents having a field that can be around 400 MB in size! I'm curious if anyone has any experience with such monster fields? Crazy? Yes, sure. Doable? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: 400 MB Fields
Hi, I think the question is strange... May be you are wondering about possible OOM exceptions? No, that's an easier one. I was more wondering whether with 400 MB Fields (indexed, not stored) it becomes incredibly slow to: * analyze * commit / write to disk * search I think we can pass to Lucene single document containing comma separated list of term, term, ... (few billion times)... Except stored and TermVectorComponent... Oh, I know it can be done, but I'm wondering how bad things (like the ones above) get. I believe thousands companies already indexed millions documents with average size few hundreds Mbytes... There should not be any limits (except Which ones are you thinking about? What sort of documents? 100,000 _unique_ terms vs. single document containing 100,000,000,000,000 of non-unique terms (and trying to store offsets) Personally, I indexed only small (up to 1000 bytes) documents-fields, but I believe 500Mb is very common use case with PDFs (which vendors use Nah, PDF files may be big, but I think the text in them is often not *that* big, unless those are PDFs of very big books. Thanks, Otis On 11-06-07 7:02 PM, Erick Erickson erickerick...@gmail.com wrote: From older (2.4) Lucene days, I once indexed the 23 volume Encyclopedia of Michigan Civil War Volunteers in a single document/field, so it's probably within the realm of possibility at least G... Erick On Tue, Jun 7, 2011 at 6:59 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello, What are the biggest document fields that you've ever indexed in Solr or that you've heard of? Ah, it must be Tom's Hathi trust. :) I'm asking because I just heard of a case of an index where some documents having a field that can be around 400 MB in size! I'm curious if anyone has any experience with such monster fields? Crazy? Yes, sure. Doable? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: 400 MB Fields
Hi Otis, I am recalling pagination feature, it is still unresolved (with default scoring implementation): even with small documents, searching-retrieving documents 1 to 10 can take 0 milliseconds, but from 100,000 to 100,010 can take few minutes (I saw it with trunk version 6 months ago, and with very small documents, total 100 mlns docs); it is advisable to restrict search results to top-1000 in any case (as with Google)... I believe things can get wrong; yes, most plain-text retrieved from books should be 2kb per page, 500 pages, := 1,000,000 bytes (or double it for UTF-8) Theoretically, it doesn't make any sense to index BIG document containing all terms from dictionary without any terms frequency calcs, but even with it... I can't imagine we should index 1000s docs and each is just (different) version of whole Wikipedia, should be wrong design... Ok, use case: index single HUGE document. What will we do? Create index with _the_only_ document? And all search will return the same result (or nothing)? Paginate it; split into pages. I am pragmatic... Fuad On 11-06-07 8:04 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, I think the question is strange... May be you are wondering about possible OOM exceptions? No, that's an easier one. I was more wondering whether with 400 MB Fields (indexed, not stored) it becomes incredibly slow to: * analyze * commit / write to disk * search I think we can pass to Lucene single document containing comma separated list of term, term, ... (few billion times)... Except stored and TermVectorComponent...
Re: 400 MB Fields
The Salesforce book is 2800 pages of PDF, last I looked. What can you do with a field that big? Can you get all of the snippets? On Tue, Jun 7, 2011 at 5:33 PM, Fuad Efendi f...@efendi.ca wrote: Hi Otis, I am recalling pagination feature, it is still unresolved (with default scoring implementation): even with small documents, searching-retrieving documents 1 to 10 can take 0 milliseconds, but from 100,000 to 100,010 can take few minutes (I saw it with trunk version 6 months ago, and with very small documents, total 100 mlns docs); it is advisable to restrict search results to top-1000 in any case (as with Google)... I believe things can get wrong; yes, most plain-text retrieved from books should be 2kb per page, 500 pages, := 1,000,000 bytes (or double it for UTF-8) Theoretically, it doesn't make any sense to index BIG document containing all terms from dictionary without any terms frequency calcs, but even with it... I can't imagine we should index 1000s docs and each is just (different) version of whole Wikipedia, should be wrong design... Ok, use case: index single HUGE document. What will we do? Create index with _the_only_ document? And all search will return the same result (or nothing)? Paginate it; split into pages. I am pragmatic... Fuad On 11-06-07 8:04 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, I think the question is strange... May be you are wondering about possible OOM exceptions? No, that's an easier one. I was more wondering whether with 400 MB Fields (indexed, not stored) it becomes incredibly slow to: * analyze * commit / write to disk * search I think we can pass to Lucene single document containing comma separated list of term, term, ... (few billion times)... Except stored and TermVectorComponent... -- Lance Norskog goks...@gmail.com
RE: 400 MB Fields
Hi Otis, Our OCR fields average around 800 KB. My guess is that the largest docs we index (in a single OCR field) are somewhere between 2 and 10MB. We have had issues where the in-memory representation of the document (the in memory index structures being built)is several times the size of the text, so I would suspect even with the largest ramBufferSizeMB, you might run into problems. (This is with the 3.x branch. Trunk might not have this problem since it's much more memory efficient when indexing Tom Burton-West www.hathitrust.org/blogs From: Otis Gospodnetic [otis_gospodne...@yahoo.com] Sent: Tuesday, June 07, 2011 6:59 PM To: solr-user@lucene.apache.org Subject: 400 MB Fields Hello, What are the biggest document fields that you've ever indexed in Solr or that you've heard of? Ah, it must be Tom's Hathi trust. :) I'm asking because I just heard of a case of an index where some documents having a field that can be around 400 MB in size! I'm curious if anyone has any experience with such monster fields? Crazy? Yes, sure. Doable? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
tika integration exception and other related queries
Hi Can somebody answer this ... 3. can somebody tell me an idea how to do indexing for a zip file ? 1. while sending docx, we are getting following error. java.lang. NumberFormatException: For input string: quot;2011-01-27T07:18:00Zquot; at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Long.parseLong(Long.java:412) at java.lang.Long.parseLong(Long.java:461) at org.apache.solr.schema.TrieField.createField(TrieField.java:434) at org.apache.solr.schema.SchemaField.createField(SchemaField.java:98) at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:204) at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:277) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:198) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:238) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) Thanks Naveen On Tue, Jun 7, 2011 at 3:33 PM, Naveen Gupta nkgiit...@gmail.com wrote: Hi We are using requestextractinghandler and we are getting following error. we are giving microsoft docx file for indexing. I think that this is something to do with field date definition .. but now very sure ...what field type should we use? 2. we are trying to index jpg (when we search over the name of the jpg, it is not coming .. though in id i am passing one) 3. what about zip files or rar files.. does tika with solr handle this one ? java.lang.NumberFormatException: For input string: quot;2011-01-27T07:18:00Zquot; at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Long.parseLong(Long.java:412) at java.lang.Long.parseLong(Long.java:461) at org.apache.solr.schema.TrieField.createField(TrieField.java:434) at org.apache.solr.schema.SchemaField.createField(SchemaField.java:98) at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:204) at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:277) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:198) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:238) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360) at