Running Solr 4 on Sun vs OpenJDK JVM?
Hi, do you have any advice on operating a Solr 4.0 read-only instance with regards to the underlying JVM? In particular I'm wondering about stability and memory usage, but anything else you might add is welcome, when it comes to OpenJDK vs Sun/Oracle Hotspot, v6 vs v7. What are you running, what would you suggest and why? I tried searching for some information, and the old(?) run-on-sun-java6-jre tip is all I get back. This morning I also read that Oracle JDK v7 is apparently based on OpenJDK sources... so? Thanks, -- Cosimo
Re: adding date column to the index
On 23 July 2013 11:13, Mysurf Mail stammail...@gmail.com wrote: clarify: I did deleted the data in the index and reloaded it (+ commit). (As i said, I have seen it loaded in the sb profiler) [...] Please share your DIH configuration file, and Solr's schema.xml. It must be that somehow the column is not getting indexed. Regards, Gora
Refering SOLRcore properties in zookeeper
Hi, I have uploaded solrconfig.xml, db-data-config.xml , solrcore.properties (ABC.properties ) files into zookeeper. below is my solr.xml file, ?xml version=1.0 encoding=UTF-8 ? solr persistent=true cores defaultCoreName=ABC adminPath=/admin/cores zkClientTimeout=${zkClientTimeout:15000} host=${host:} hostPort=${port} hostContext=${hostContext:} core loadOnStartup=true instanceDir=ABC transient=false name=ABC properties=$propfile property name=propfile value=ABC.properties / /core /cores /solr while starting the solr, it was not able to recognize ABC.properties . Am i doing correct ? Thanks, Sathish -- View this message in context: http://lucene.472066.n3.nabble.com/Refering-SOLRcore-properties-in-zookeeper-tp4079639.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: adding date column to the index
Ahaa I deleted the data folder and now I get Invalid Date String:'2010-01-01 00:00:00 +02:00' I need to cast it to solr. as I read it in the schema using field name=LastModificationTime type=date indexed=false stored=true required=true/ On Tue, Jul 23, 2013 at 10:50 AM, Gora Mohanty g...@mimirtech.com wrote: On 23 July 2013 11:13, Mysurf Mail stammail...@gmail.com wrote: clarify: I did deleted the data in the index and reloaded it (+ commit). (As i said, I have seen it loaded in the sb profiler) [...] Please share your DIH configuration file, and Solr's schema.xml. It must be that somehow the column is not getting indexed. Regards, Gora
Re: adding date column to the index
How do I cast datetimeoffset(7)) to solr date On Tue, Jul 23, 2013 at 11:11 AM, Mysurf Mail stammail...@gmail.com wrote: Ahaa I deleted the data folder and now I get Invalid Date String:'2010-01-01 00:00:00 +02:00' I need to cast it to solr. as I read it in the schema using field name=LastModificationTime type=date indexed=false stored=true required=true/ On Tue, Jul 23, 2013 at 10:50 AM, Gora Mohanty g...@mimirtech.com wrote: On 23 July 2013 11:13, Mysurf Mail stammail...@gmail.com wrote: clarify: I did deleted the data in the index and reloaded it (+ commit). (As i said, I have seen it loaded in the sb profiler) [...] Please share your DIH configuration file, and Solr's schema.xml. It must be that somehow the column is not getting indexed. Regards, Gora
Re:
Hi! http://mackieprice.org/cbs.com.network.html
Indexing Oracle Database in Solr using Data Import Handler
Im trying to Index oracle database 10g XE using Solr's Data Import Handler. My data-config.xml looks like this dataConfig dataSource type=JdbcDataSource driver=oracle.jdbc.OracleDriver url=jdbc:oracle:thin:@XXX.XXX.XXX.XXX::xe user=XX password=XX / document name=product_info entity name=product query=select * from product field column=pid name=id / field column=pname name=itemName / field column=initqty name=itemQuantity / field column=remQty name=remQuantity / field column=price name=itemPrice / field column=specification name=specifications / /entity /document /dataConfig My schema.xml looks like this - field name=id type=text_general indexed=true stored=true required=true multiValued=false / field name=itemName type=text_general indexed=true stored=true multiValued=true omitNorms=true termVectors=true / field name=itemQuantity type=text_general indexed=true stored=true multiValued=true omitNorms=true termVectors=true / field name=remQuantity type=text_general indexed=true stored=true multiValued=true omitNorms=true termVectors=true / field name=itemPrice type=text_general indexed=true stored=true multiValued=true omitNorms=true termVectors=true / field name=specifications type=text_general indexed=true stored=true multiValued=true omitNorms=true termVectors=true / field name=brand type=text_general indexed=true stored=true multiValued=true omitNorms=true termVectors=true / field name=itemCategory type=text_general indexed=true stored=true multiValued=true omitNorms=true termVectors=true / Now when I try to index it, Solr is not able to read the columns of the table and therefore indexing fails. it says that the document is missing the unique key id which ,as you can see, is clearly present in document. Also, generally in the log when such an exception is thrown it is clearly shown that what all fields were picked up by the document. However in this case, No fields are being read. But if i change my query then everything works perfectly. The modified data-config.xml - dataConfig dataSource name=db1 type=JdbcDataSource driver=oracle.jdbc.OracleDriver url=jdbc:oracle:thin:@XXX.XXX.XX.XX::xe user= password=X / document name=product_info entity name=products dataSource=db1 query=select pid as id,pname as itemName,initqty as itemQuantity, remqty as remQuantity, price as itemPrice, specification as specifications from product field column=id name=id / field column=itemName name=itemName / field column=itemQuantity name=itemQuantity / field column=remQuantity name=remQuantity / field column=itemPrice name=itemPrice / field column=specifications name=specifications / /entity /document /dataConfig Why is this happening? how do i solve it? how does giving an alias affect indexing process? Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-Oracle-Database-in-Solr-using-Data-Import-Handler-tp4079649.html Sent from the Solr - User mailing list archive at Nabble.com.
Document Similarity Algorithm at Solr/Lucene
Hi; Sometimes a huge part of a document may exist in another document. As like in student plagiarism or quotation of a blog post at another blog post. Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to detect it?
Re: Document Similarity Algorithm at Solr/Lucene
Hi, I you may leverage and / or improve MLT component [1]. HTH, Tommaso [1] : http://wiki.apache.org/solr/MoreLikeThis 2013/7/23 Furkan KAMACI furkankam...@gmail.com Hi; Sometimes a huge part of a document may exist in another document. As like in student plagiarism or quotation of a blog post at another blog post. Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to detect it?
RE: Solr 4.1.0 not using solrcore.properties ?
Hi , Can any one help on how to refer the solrcore.properties uploaded into Zookeeper ? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-1-0-not-using-solrcore-properties-tp4040228p4079654.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Document Similarity Algorithm at Solr/Lucene
Actually I need a specialized algorithm. I want to use that algorithm to detect duplicate blog posts. 2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com Hi, I you may leverage and / or improve MLT component [1]. HTH, Tommaso [1] : http://wiki.apache.org/solr/MoreLikeThis 2013/7/23 Furkan KAMACI furkankam...@gmail.com Hi; Sometimes a huge part of a document may exist in another document. As like in student plagiarism or quotation of a blog post at another blog post. Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to detect it?
problems about solr replication in 4.3
hi,all i have two solr ,one is master , one is replication , before i use them under 3.5 version . it works fine . when i upgrade to 4.3version , i found when replication solr copying index from master , it will clean current index and copy new version to self folder . slave can't search during this process ! i am newer to solr 4 , does this normal ? any ideas , thanks ! -- View this message in context: http://lucene.472066.n3.nabble.com/problems-about-solr-replication-in-4-3-tp4079665.html Sent from the Solr - User mailing list archive at Nabble.com.
facet.maxcount ?
Hi all happy Solr users! I was wondering if it's possible to have some sort of facet.maxcount equivalent? In short, that would exclude from the facet any term (or query) that matches at least facet.maxcount times. That facet.maxcount would probably significantly improve the performance of request of the type: I want the facet values with zero result. Is it a mad idea or does it make some sort of sense? Cheers, Jerome. -- Jerome Eteve +44(0)7738864546 http://www.eteve.net/
RE: facet.maxcount ?
Hi - No but there are two unresolved issues about this topic: https://issues.apache.org/jira/browse/SOLR-4411 https://issues.apache.org/jira/browse/SOLR-4411 Cheers -Original message- From:Jérôme Étévé jerome.et...@gmail.com Sent: Tuesday 23rd July 2013 12:58 To: solr-user@lucene.apache.org Subject: facet.maxcount ? Hi all happy Solr users! I was wondering if it's possible to have some sort of facet.maxcount equivalent? In short, that would exclude from the facet any term (or query) that matches at least facet.maxcount times. That facet.maxcount would probably significantly improve the performance of request of the type: I want the facet values with zero result. Is it a mad idea or does it make some sort of sense? Cheers, Jerome. -- Jerome Eteve +44(0)7738864546 http://www.eteve.net/
RE: facet.maxcount ?
Eeh, here's the other one: https://issues.apache.org/jira/browse/SOLR-1712 -Original message- From:Markus Jelsma markus.jel...@openindex.io Sent: Tuesday 23rd July 2013 13:18 To: solr-user@lucene.apache.org Subject: RE: facet.maxcount ? Hi - No but there are two unresolved issues about this topic: https://issues.apache.org/jira/browse/SOLR-4411 https://issues.apache.org/jira/browse/SOLR-4411 Cheers -Original message- From:Jérôme Étévé jerome.et...@gmail.com Sent: Tuesday 23rd July 2013 12:58 To: solr-user@lucene.apache.org Subject: facet.maxcount ? Hi all happy Solr users! I was wondering if it's possible to have some sort of facet.maxcount equivalent? In short, that would exclude from the facet any term (or query) that matches at least facet.maxcount times. That facet.maxcount would probably significantly improve the performance of request of the type: I want the facet values with zero result. Is it a mad idea or does it make some sort of sense? Cheers, Jerome. -- Jerome Eteve +44(0)7738864546 http://www.eteve.net/
Appending *-wildcard suffix on all terms for querying: move logic from client to server side
My client has an installation with 3 different clients using the same Solr index. These clients all append a * wildcard suffix in the query: user enters abc def while search is performed against (abc* def*). In order to move away from this way of searching, we'd like to move the clients away from this wildcard search at the moment we implement a new index. However, at that time, the client apps will still need to use this wildcard suffix search. So the goal is to have the wildcard search option to append * suffix when not yet set configurable on server side. I thought a tokenizer would do the work, but as the wildcard searches are detected before analyzers do the work, this is not an option. Can I enable this without coding? Or should I use a (custom) functionquery or custom search handler? Any thought is appreciated. - Kind regards, Paul Blanchaert
Re: highlighting required in document
You just need to specify the emphasizing tag in hl params by adding something like this to your query: hl.fl=contenthl.simple.pre=bhl.simple.post=%2Fb Check the solr admin page, the querying item, it shows the constructed query, so you don't need to guess! Regards, Dmitry On Mon, Jul 22, 2013 at 10:31 AM, Jamshaid Ashraf jamshaid...@gmail.comwrote: Hi, I'm using solr 4.3.0 following is the response against hit highlighting request: Request: http://localhost:8080/solr/collection2/select?q=content:ps4hl=true Response: doc arr name=contentstrThis post is regarding ps4 accuracy and qulaity which is smooth and factastic/str/arr /doc lst name=highlighting lst name=1 arr name=contentstrThis post is regarding bps4/b accuracy and qulaity which is smooth and factastic/str/arr /lst I wanted result like this: doc arr name=contentstrThis post is regarding bps4/b accuracy and qulaity which is smooth and factastic/str/arr /doc lst name=highlighting lst name=1 arr name=contentstrThis post is regarding bps4/b accuracy and qulaity which is smooth and factastic/str/arr /lst Thanks in advance! Regards, Jamshaid
Re: highlighting required in document
Ah, I think I misread your question. So your question is actually, how make solr embed higlighting into the doc response itself. I'm not aware of such a functionality. This why you have the highlighting section in your response. On Tue, Jul 23, 2013 at 2:30 PM, Dmitry Kan solrexp...@gmail.com wrote: You just need to specify the emphasizing tag in hl params by adding something like this to your query: hl.fl=contenthl.simple.pre=bhl.simple.post=%2Fb Check the solr admin page, the querying item, it shows the constructed query, so you don't need to guess! Regards, Dmitry On Mon, Jul 22, 2013 at 10:31 AM, Jamshaid Ashraf jamshaid...@gmail.comwrote: Hi, I'm using solr 4.3.0 following is the response against hit highlighting request: Request: http://localhost:8080/solr/collection2/select?q=content:ps4hl=true Response: doc arr name=contentstrThis post is regarding ps4 accuracy and qulaity which is smooth and factastic/str/arr /doc lst name=highlighting lst name=1 arr name=contentstrThis post is regarding bps4/b accuracy and qulaity which is smooth and factastic/str/arr /lst I wanted result like this: doc arr name=contentstrThis post is regarding bps4/b accuracy and qulaity which is smooth and factastic/str/arr /doc lst name=highlighting lst name=1 arr name=contentstrThis post is regarding bps4/b accuracy and qulaity which is smooth and factastic/str/arr /lst Thanks in advance! Regards, Jamshaid
Re: facet.maxcount ?
Thanks! On 23 July 2013 12:19, Markus Jelsma markus.jel...@openindex.io wrote: Eeh, here's the other one: https://issues.apache.org/jira/browse/SOLR-1712 -Original message- From:Markus Jelsma markus.jel...@openindex.io Sent: Tuesday 23rd July 2013 13:18 To: solr-user@lucene.apache.org Subject: RE: facet.maxcount ? Hi - No but there are two unresolved issues about this topic: https://issues.apache.org/jira/browse/SOLR-4411 https://issues.apache.org/jira/browse/SOLR-4411 Cheers -Original message- From:Jérôme Étévé jerome.et...@gmail.com Sent: Tuesday 23rd July 2013 12:58 To: solr-user@lucene.apache.org Subject: facet.maxcount ? Hi all happy Solr users! I was wondering if it's possible to have some sort of facet.maxcount equivalent? In short, that would exclude from the facet any term (or query) that matches at least facet.maxcount times. That facet.maxcount would probably significantly improve the performance of request of the type: I want the facet values with zero result. Is it a mad idea or does it make some sort of sense? Cheers, Jerome. -- Jerome Eteve +44(0)7738864546 http://www.eteve.net/ -- Jerome Eteve +44(0)7738864546 http://www.eteve.net/
Re: how to improve (keyword) relevance?
Another thing I've seen people do is something like text:(test AND pdf)^10 text:(test pdf). so docs with both terms in the text field get boosted a lot, but docs with either one will still get found. But as Jack says, you have to demonstrate a problem before you propose a solution. You say a lot people are concerned about improving relevance.. Just get them to define a good set of search results. Bet they can't except by looking at specific result lists and saying I like that one more than this one. You gotta quantify this somehow, do A/B testing, whatever or you'll go mad. Erick On Mon, Jul 22, 2013 at 12:47 PM, Jack Krupansky j...@basetechnology.com wrote: Again, you haven't indicated what the problem is. I mean, have you actually confirmed that a problem exists? Add debugQuery=true to your query and examine the explain section if you believe that Solr has improperly computed any document scores. If you simply want to boost a term in a query, use the ^ operator, which applies to the preceding term. a boost of 1.0 means no change, 2.0 means double, 0.5 means cut in half. But, you don't need to boost. Relevancy is based on the data in the documents themselves. BTW, q=text%3Atest+pdf does not search for pdf in the text field - field- qualification only applies to a single term, but you can use parentheses: q=text%3A(test+pdf) -- Jack Krupansky -Original Message- From: eShard Sent: Monday, July 22, 2013 12:34 PM To: solr-user@lucene.apache.org Subject: Re: how to improve (keyword) relevance? Sure, let's say the user types in test pdf; we need the results with all the query words to be near the top of the result set. the query will look like this: /select?q=text%3Atest+pdfwt=xml How do I ensure that the top resultset contains all of the query words? How can I boost the first (or second) term when they are both the same field (i.e. text)? Does this make sense? Please bear with me; I'm still new to the solr query syntax so I don't even know if I'm asking the right question. Thanks, -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-improve-keyword-relevance-tp4079462p4079502.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: IllegalStateException
There has been a _ton_ of work since 4.0, and 4.4 will be out in a day or two. I suspect the best advice is to try 4.4... Best Erick On Mon, Jul 22, 2013 at 2:54 PM, Michael Long ml...@mlong.us wrote: I'm seeing random crashes in solr 4.0 but I don't have anything to go on other than IllegalStateException. Other than checking for corrupt index and out of memory, what other things should I check? org.apache.catalina.core.StandardWrapperValve invoke SEVERE: Servlet.service() for servlet default threw exception java.lang.IllegalStateException at org.apache.catalina.connector.ResponseFacade.sendError(ResponseFacade.java:407) at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:483) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:297) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:662)
Re: how to improve (keyword) relevance?
To add to what Erick said, that *quantifying* is hugely important! How do you measure your search relevance improvements? How are you currently measuring it? How will you see, after you apply any changes, whether relevance was improved and how much? How will you know whether, even test queries you are using to evaluate relevance, the end users also see the same sort of improvements or whether you improved your test queries, but made no difference overall or maybe even made things worse? ... Have a look at: * http://www.slideshare.net/sematext/tag/analytics * http://sematext.com/search-analytics/index.html - it's free, and we regularly use it with our clients with great success Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jul 23, 2013 at 7:50 AM, Erick Erickson erickerick...@gmail.com wrote: Another thing I've seen people do is something like text:(test AND pdf)^10 text:(test pdf). so docs with both terms in the text field get boosted a lot, but docs with either one will still get found. But as Jack says, you have to demonstrate a problem before you propose a solution. You say a lot people are concerned about improving relevance.. Just get them to define a good set of search results. Bet they can't except by looking at specific result lists and saying I like that one more than this one. You gotta quantify this somehow, do A/B testing, whatever or you'll go mad. Erick On Mon, Jul 22, 2013 at 12:47 PM, Jack Krupansky j...@basetechnology.com wrote: Again, you haven't indicated what the problem is. I mean, have you actually confirmed that a problem exists? Add debugQuery=true to your query and examine the explain section if you believe that Solr has improperly computed any document scores. If you simply want to boost a term in a query, use the ^ operator, which applies to the preceding term. a boost of 1.0 means no change, 2.0 means double, 0.5 means cut in half. But, you don't need to boost. Relevancy is based on the data in the documents themselves. BTW, q=text%3Atest+pdf does not search for pdf in the text field - field- qualification only applies to a single term, but you can use parentheses: q=text%3A(test+pdf) -- Jack Krupansky -Original Message- From: eShard Sent: Monday, July 22, 2013 12:34 PM To: solr-user@lucene.apache.org Subject: Re: how to improve (keyword) relevance? Sure, let's say the user types in test pdf; we need the results with all the query words to be near the top of the result set. the query will look like this: /select?q=text%3Atest+pdfwt=xml How do I ensure that the top resultset contains all of the query words? How can I boost the first (or second) term when they are both the same field (i.e. text)? Does this make sense? Please bear with me; I'm still new to the solr query syntax so I don't even know if I'm asking the right question. Thanks, -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-improve-keyword-relevance-tp4079462p4079502.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.3.1 - SolrCloud nodes down and lost documents
Neil: Here's a must-read blog about why allocating more memory to the JVM than Solr requires is a Bad Thing: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html It turns out that you actually do yourself harm by allocating more memory to the JVM than it really needs. Of course the problem is figuring out how much it really needs, which if pretty tricky. Your long GC pauses _might_ be ameliorated by allocating _less_ memory to the JVM, counterintuitive as that seems. Best Erick On Mon, Jul 22, 2013 at 5:05 PM, Neil Prosser neil.pros...@gmail.com wrote: I just have a little python script which I run with cron (luckily that's the granularity we have in Graphite). It reads the same JSON the admin UI displays and dumps numeric values into Graphite. I can open source it if you like. I just need to make sure I remove any hacks/shortcuts that I've taken because I'm working with our cluster! On 22 July 2013 19:26, Lance Norskog goks...@gmail.com wrote: Are you feeding Graphite from Solr? If so, how? On 07/19/2013 01:02 AM, Neil Prosser wrote: That was overnight so I was unable to track exactly what happened (I'm going off our Graphite graphs here).
Re: Appending *-wildcard suffix on all terms for querying: move logic from client to server side
It can be done by extending LuceneQParser/SolrQueryParser see http://wiki.apache.org/solr/SolrPlugins#QParserPlugin there is newTermQuery(Term) it should be overridden and delegate to newPrefixQuery() method. Overall, I suggest you consider to use EdgeNGramTokenFilter in index time, and then search by plain termqueries. On Tue, Jul 23, 2013 at 2:05 PM, Paul Blanchaert p...@amosis.eu wrote: My client has an installation with 3 different clients using the same Solr index. These clients all append a * wildcard suffix in the query: user enters abc def while search is performed against (abc* def*). In order to move away from this way of searching, we'd like to move the clients away from this wildcard search at the moment we implement a new index. However, at that time, the client apps will still need to use this wildcard suffix search. So the goal is to have the wildcard search option to append * suffix when not yet set configurable on server side. I thought a tokenizer would do the work, but as the wildcard searches are detected before analyzers do the work, this is not an option. Can I enable this without coding? Or should I use a (custom) functionquery or custom search handler? Any thought is appreciated. - Kind regards, Paul Blanchaert -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Running Solr 4 on Sun vs OpenJDK JVM?
Hi Cosimo, Very simple: Oracle 1.7 is your best bet. If you have a large heap and are seeing STW pauses, try G1 - we've been using it and have been happy with it. Ciao, Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jul 23, 2013 at 3:17 AM, Cosimo Streppone cos...@streppone.it wrote: Hi, do you have any advice on operating a Solr 4.0 read-only instance with regards to the underlying JVM? In particular I'm wondering about stability and memory usage, but anything else you might add is welcome, when it comes to OpenJDK vs Sun/Oracle Hotspot, v6 vs v7. What are you running, what would you suggest and why? I tried searching for some information, and the old(?) run-on-sun-java6-jre tip is all I get back. This morning I also read that Oracle JDK v7 is apparently based on OpenJDK sources... so? Thanks, -- Cosimo
Re: Solr 4.3.1 - SolrCloud nodes down and lost documents
Hi, On Tue, Jul 23, 2013 at 8:02 AM, Erick Erickson erickerick...@gmail.com wrote: Neil: Here's a must-read blog about why allocating more memory to the JVM than Solr requires is a Bad Thing: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html It turns out that you actually do yourself harm by allocating more memory to the JVM than it really needs. Of course the problem is figuring out how much it really needs, which if pretty tricky. Your long GC pauses _might_ be ameliorated by allocating _less_ memory to the JVM, counterintuitive as that seems. or by using G1 :) See http://blog.sematext.com/2013/06/24/g1-cms-java-garbage-collector/ Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Mon, Jul 22, 2013 at 5:05 PM, Neil Prosser neil.pros...@gmail.com wrote: I just have a little python script which I run with cron (luckily that's the granularity we have in Graphite). It reads the same JSON the admin UI displays and dumps numeric values into Graphite. I can open source it if you like. I just need to make sure I remove any hacks/shortcuts that I've taken because I'm working with our cluster! On 22 July 2013 19:26, Lance Norskog goks...@gmail.com wrote: Are you feeding Graphite from Solr? If so, how? On 07/19/2013 01:02 AM, Neil Prosser wrote: That was overnight so I was unable to track exactly what happened (I'm going off our Graphite graphs here).
Re: softCommit doesn't work - ?
First a minor nit. The server.add(doc, time) is a hard commit, not a soft one. But the rest of it. When you add your 70 docs, do they all have the same id (i.e. the uniqueKey field). If so, there will be only one document, the last one since all the earlier ones will be overwritten. Not quite sure why your first example doesn't work, it should. Although killing the process before the commit complete will lose documents in the uncommitted segments. Best Erick On Mon, Jul 22, 2013 at 8:45 PM, tskom tsiedlac...@hotmail.co.uk wrote: Hi, I use solr 4.3.1. I tried to index about 70 documents using sofCommit as below: SolrInputDocument doc = new SolrInputDocument(); result = fillMetaData(request, doc); // custom one int softCommit = 1; solrServer.add(doc, softCommit); Process ran very fast but there is nothing in the index neither after 10sec nor after restarting server application In the solr log I got something like that: 2013-07-23 01:58:01,543 INFO [org.apache.solr.update.processor.LogUpdateProcessor] (http-127.0.0.1-8090-5) [collection1] webapp=/solr path=/update params={wt=javabinversion=2} {add=[Rep_CA_FairyCakes (1441307014244335616)]} 0 3 2013-07-23 01:58:01,546 INFO [org.apache.solr.update.UpdateHandler] (http-127.0.0.1-8090-5) start rollback{} 2013-07-23 01:58:01,547 INFO [org.apache.solr.update.DefaultSolrCoreState] (http-127.0.0.1-8090-5) Creating new IndexWriter... 2013-07-23 01:58:01,547 INFO [org.apache.solr.update.DefaultSolrCoreState] (http-127.0.0.1-8090-5) Waiting until IndexWriter is unused... core=collection1 2013-07-23 01:58:01,547 INFO [org.apache.solr.update.DefaultSolrCoreState] (http-127.0.0.1-8090-5) Rollback old IndexWriter... core=collection1 2013-07-23 01:58:01,617 INFO [org.apache.solr.core.SolrCore] (http-127.0.0.1-8090-5) SolrDeletionPolicy.onInit: commits:num=1 commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@C:\solr\data\index lockFactory=org.apache.lucene.store.NativeFSLockFactory@7ed1f882; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ew,generation=536,filenames=[_ah_Lucene41_0.tim, _9d.fdt, _a5.fdx, _ag_Lucene41_0.pos, _9l.si, _a7.nvd, _a0_Lucene41_0.pos, _ah_Lucene41_0.tip, _9d.fdx, _a5.fdt, _9r.fnm, _97_Lucene41_0.doc, _9k_Lucene41_0.tim, _a7.nvm, _ad.fnm, _9k_Lucene41_0.tip, _a9.fnm, _9g.nvm, _ao_Lucene41_0.tim, _ao_Lucene41_0.tip, _9i_Lucene41_0.doc, _a2.nvm, _az_Lucene41_0.tim, _az_Lucene41_0.tip, _af_Lucene41_0.pos, _9t.nvm, _9w.fnm, _9z.si, _a9_Lucene41_0.tim, _9h.fnm, _9g.nvd, _a9_Lucene41_0.tip, _9d_Lucene41_0.pos, _9t.nvd, _a3.fdx, _aw.nvm, _9i_Lucene41_0.pos, _98.fnm, _a3.fdt, _a8_Lucene41_0.tim, _am.nvd, _aw.nvd, _a8_Lucene41_0.tip, _9f.si, _ap.fdt, _ag.fdt, _au.fnm, _aq.nvm, _ap.fdx, _av.fdt, _a0.si, _ac_Lucene41_0.doc, _a9_Lucene41_0.doc, _at_Lucene41_0.doc, _9u.fdx, _9z.fnm, _9d.si, _af.nvd, _9j_Lucene41_0.doc, _9u.fdt, _ag.fdx, _9b.si, _af.nvm, _9q.fnm, _aw_Lucene41_0.tim, _aw_Lucene41_0.tip, _ao.fnm, _9f.fnm, _a1.fdt, _9l_Lucene41_0.pos, _ad_Lucene41_0.pos, _a1.fdx, _aa_Lucene41_0.tip, _aa_Lucene41_0.tim, _9j_Lucene41_0.pos, _a2.nvd, _aj.nvd, _9o.fnm, _am.fnm, _9t_Lucene41_0.doc, _av.fdx, _ab.fdt, _an.nvd, _at.nvd, _ao_Lucene41_0.doc, _al.fnm, _9e_Lucene41_0.doc, _ab.fdx, _9x.fnm, _aj.nvm, _at.nvm, _ai.fnm, _9a_Lucene41_0.tim, _ak.nvm, _a2_Lucene41_0.doc, _an.nvm, _ah.nvd, _aw.fnm, _al_Lucene41_0.doc, _9a_Lucene41_0.tip, _9f_Lucene41_0.tim, _aq.fnm, _ah.nvm, _9k.nvd, _9b.nvm, _9c.fnm, _9f_Lucene41_0.tip, _9y_Lucene41_0.pos, _ax_Lucene41_0.doc, _av_Lucene41_0.tip, _ar_Lucene41_0.tim, _9c.si, _av_Lucene41_0.tim, _9b.nvd, _ar_Lucene41_0.tip, _as_Lucene41_0.tip, _as_Lucene41_0.tim, _ae_Lucene41_0.pos, _9j.si, _9z.nvd, _9y_Lucene41_0.doc, _a6_Lucene41_0.doc, _9d_Lucene41_0.doc, _ao.nvd, _9m.fdx, _ac.fdx, _a6.si, _aa_Lucene41_0.doc, _9m.fdt, _ac.fdt, _a3_Lucene41_0.pos, _av_Lucene41_0.doc, _9k.nvm, _ay_Lucene41_0.pos, _9z.nvm, _ai_Lucene41_0.tim, _aq.si, _ap_Lucene41_0.pos, _ai_Lucene41_0.tip, _96.si, _ab_Lucene41_0.pos, _9e.fnm, _as_Lucene41_0.doc, _9h.si, _96.nvm, _96.nvd, _ae.fdt, _9f_Lucene41_0.pos, _a4.fdx, _ae.fdx, _a4.fdt, _9j.fnm, _9z_Lucene41_0.doc, _9p.nvm, _aw.si, _a8.nvm, _9p.nvd, _9s.fdx, _9v.fnm, _a8.nvd, _9f_Lucene41_0.doc, _9s.fdt, _a2.si, _ai.si, _9o_Lucene41_0.tip, _a3.si, _9o_Lucene41_0.tim, _aj_Lucene41_0.tip, _aj_Lucene41_0.tim, _99.si, _9k_Lucene41_0.pos, _97.fdt, _9w.fdx, _a5.si, _9s_Lucene41_0.pos, _9w.fdt, _aj.fnm, _97.fdx, _9p.fdx, _9t.fnm, _9j.fdx, _9j.fdt, _ar_Lucene41_0.pos, _au_Lucene41_0.doc, _9p_Lucene41_0.doc, _9a.fdx, _9j_Lucene41_0.tip, _9q.nvd, _at_Lucene41_0.tip, _an.si, _9j_Lucene41_0.tim, _at_Lucene41_0.tim, _ad.fdx, _az_Lucene41_0.doc, _ad.fdt, _9q.nvm, _9g.fdx, _ax_Lucene41_0.pos, _9r.fdt, _9g.fdt, _9r.fdx, _9a.fdt, _a7.si, _98.nvm, _au_Lucene41_0.tim, _ag.nvm, _az.si, _au_Lucene41_0.tip, _ag.nvd, _ao.nvm, _9o.fdx, _9q_Lucene41_0.tip, _ax.si, _9p_Lucene41_0.pos,
Re: Question about field boost
this isn't doing what you think. title^10 content is actually parsed as text:title^100 text:content where text is my default search field. assuming title is a field. If you look a little farther up the debug output you'll see that. You probably want title:content^100 or some such? Erick On Tue, Jul 23, 2013 at 1:43 AM, Jack Krupansky j...@basetechnology.com wrote: That means that for that document china occurs in the title vs. snowden found in a document but not in the title. -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Tuesday, July 23, 2013 12:52 AM To: solr-user@lucene.apache.org Subject: Re: Question about field boost Is my reading correct that the boost is only applied on china but not snowden? How can that be? My query is: q=china+snowdenqf=title^10 content On Mon, Jul 22, 2013 at 9:43 PM, Joe Zhang smartag...@gmail.com wrote: Thanks for your hint, Jack. Here is the debug results, which I'm having a hard deciphering (the two terms are china and snowden)... 0.26839527 = (MATCH) sum of: 0.26839527 = (MATCH) sum of: 0.26757246 = (MATCH) max of: 7.9147343E-4 = (MATCH) weight(content:china in 249), product of: 0.019873314 = queryWeight(content:china), product of: 1.6649085 = idf(docFreq=46832, maxDocs=91058) 0.01193658 = queryNorm 0.039825942 = (MATCH) fieldWeight(content:china in 249), product of: 4.8989797 = tf(termFreq(content:china)=24) 1.6649085 = idf(docFreq=46832, maxDocs=91058) 0.0048828125 = fieldNorm(field=content, doc=249) 0.26757246 = (MATCH) weight(title:china^10.0 in 249), product of: 0.5836803 = queryWeight(title:china^10.0), product of: 10.0 = boost 4.8898454 = idf(docFreq=1861, maxDocs=91058) 0.01193658 = queryNorm 0.45842302 = (MATCH) fieldWeight(title:china in 249), product of: 1.0 = tf(termFreq(title:china)=1) 4.8898454 = idf(docFreq=1861, maxDocs=91058) 0.09375 = fieldNorm(field=title, doc=249) 8.2282536E-4 = (MATCH) max of: 8.2282536E-4 = (MATCH) weight(content:snowden in 249), product of: 0.03407834 = queryWeight(content:snowden), product of: 2.8549502 = idf(docFreq=14246, maxDocs=91058) 0.01193658 = queryNorm 0.024145111 = (MATCH) fieldWeight(content:snowden in 249), product of: 1.7320508 = tf(termFreq(content:snowden)=3) 2.8549502 = idf(docFreq=14246, maxDocs=91058) 0.0048828125 = fieldNorm(field=content, doc=249) On Mon, Jul 22, 2013 at 9:27 PM, Jack Krupansky j...@basetechnology.comwrote: Maybe you're not doing anything wrong - other than having an artificial expectation of what the true relevance of your data actually is. Many factors go into relevance scoring. You need to look at all aspects of your data. Maybe your terms don't occur in your titles the way you think they do. Maybe you need a boost of 500 or more... Lots of potential maybes. Relevancy tuning is an art and craft, hardly a science. Step one: Know your data, inside and out. Use the debugQuery=true parameter on your queries and see how much of the score is dominated by your query terms in the non-title fields. -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Monday, July 22, 2013 11:06 PM To: solr-user@lucene.apache.org Subject: Question about field boost Dear Solr experts: Here is my query: defType=dismaxq=term1+term2**qf=title^100 content Apparently (at least I thought) my intention is to boost the title field. While I'm getting some non-trivial results, I'm surprised that the documents with both term1 and term2 in title (I know such docs do exist in my repository) were not returned (or maybe ranked very low). The situation does not change even when I use much larger boost factors. What am I doing wrong?
solr - Deleting a row from the index, using the configuration files only.
I am updating my solr index using deltaQuery and deltaImportQuery attributes in data-config.xml. In my condition I write where MyDoc.LastModificationTime '${dataimporter.last_index_time}' then after I add a row I trigger an update using data-config.xml. Now, sometimes I delete a row. How can I implement this with configuration files only (without sending a delete rest command to solr ). Lets say my object is not deleted but its status is changed to deleted. I dont index that status field, as I want to hold only the live rows. (otherwise I could have just filtered it) Is there a way to do it? thanks.
Re: dataimporter, custom fields and parsing error
i have tried post.jar and it works when i set the literal.id in solrconfig.xml. i can't pass the id with post.jar (-Dparams=literal.id=abc) because i get a error: could not find or load main class .id=abc. On 20. Jul 2013, at 7:05 PM, Andreas Owen wrote: path was set text wasn't, but it doesn't make a difference. my importer says 1 row fetched, 0 docs processed, 0 docs skipped. i don't understand how it can have 2 docs indexed with such a output. On 20. Jul 2013, at 12:47 PM, Shalin Shekhar Mangar wrote: Are the path and text fields set to stored in the schema.xml? On Sat, Jul 20, 2013 at 3:37 PM, Andreas Owen a...@conx.ch wrote: they are in my schema, path is typed correctly the others are default fields which already exist. all the other fields are populated and i can search for them, just path and text aren't. On 19. Jul 2013, at 6:16 PM, Alexandre Rafalovitch wrote: Dumb question: they are in your schema? Spelled right, in the right section, using types also defined? Can you populate them by hand with a CSV file and post.jar? Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jul 19, 2013 at 12:09 PM, Andreas Owen a...@conx.ch wrote: i'm using solr 4.3 which i just downloaded today and am using only jars that came with it. i have enabled the dataimporter and it runs without error. but the field path (included in schema.xml) and text (file content) aren't indexed. what am i doing wrong? solr-path: C:\ColdFusion10\cfusion\jetty-new collection-path: C:\ColdFusion10\cfusion\jetty-new\solr\collection1 pdf-doc-path: C:\web\development\tkb\internet\public data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl= http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImportUrl.xml forEach=/albums/album dataSource=main !-- transformer=script:GenerateId-- field column=title xpath=//title / field column=id xpath=//file / field column=path xpath=//path / field column=Author xpath=//author / !-- field column=tstamp2013-07-05T14:59:46.889Z/field -- entity name=tika processor=TikaEntityProcessor url=../../../../../web/development/tkb/internet/public/${rec.path}/${ rec.id} dataSource=data field column=text / /entity /entity /document /dataConfig docImportUrl.xml: ?xml version=1.0 encoding=utf-8? albums album authorPeter Z./author titleBeratungsseminar kundenbrief/title descriptionwie kommuniziert man/description file0226520141_e-banking_Checkliste_CLX.Sentinel.pdf/file pathdownload/online/path /album album authorMarcel X./author titlekuchen backen/title descriptiontorten, kuchen, geb‰ck .../description fileKundenbrief.pdf/file pathdownload/online/path /album /albums -- Regards, Shalin Shekhar Mangar.
zkHost in solr.xml goes missing after SPLITSHARD using Collections API
Hello all, Every time I issue a SPLITSHARD using Collections API, the zkHost attribute in the solr.xml goes missing. I have to manually edit the solr.xml to add zkHost after every SPLITSHARD. Any thoughts on what could be causing this? Thanks.
Start independent Zookeeper from within Solr install
Assumptions: * you currently have two choices to start Zookeeper: run it embedded within Solr, or download it from the ZooKeeper site and start it independently. * everything you need to run ZooKeeper (embedded or not) is included within the Solr distribution Assuming I've got the above right, then currently starting an embedded ZooKeeper is easy (-DzkRun), and starting an ensemble is irritatingly complex. So, my question is, how hard would it be to start Zookeeper without Solr, but from within the Solr codebase? -DensembleOnly or some such, causes Solr not to load, but Zookeeper still starts. I'm assuming that Jetty would still listen on port 8983, but it wouldn't initialise the Solr webapp: java -DzkRun -DzkEnsembleOnly -DzkHosts=zkhost01:9983,zkhost02:9983,zkhost03:9983 -jar start.jar Is this possible? If it is, I'm happy to have a go at making it happen. Upayavira
filter query result by user
I want to restrict the returned results to be only the documents that were created by the user. I then load to the index the createdBy attribute and set it to index false,stored=true field name=CreatedBy type=string indexed=false stored=true required=true/ then in the I want to filter by CreatedBy so I use the dashboard, check edismax and add I check edismax and add CreatedBy:user1 to the qf field. the result query is http:// :8983/solr/vault/select?q=*%3A*defType=edismaxqf=CreatedBy%3Auser1 Nothing is filtered. all rows returned. What was I doing wrong?
Re: filter query result by user
Simple: the field needs to be indexed in order to search (or filter) on it. On Tue, Jul 23, 2013 at 3:26 PM, Mysurf Mail stammail...@gmail.com wrote: I want to restrict the returned results to be only the documents that were created by the user. I then load to the index the createdBy attribute and set it to index false,stored=true field name=CreatedBy type=string indexed=false stored=true required=true/ then in the I want to filter by CreatedBy so I use the dashboard, check edismax and add I check edismax and add CreatedBy:user1 to the qf field. the result query is http:// :8983/solr/vault/select?q=*%3A*defType=edismaxqf=CreatedBy%3Auser1 Nothing is filtered. all rows returned. What was I doing wrong?
Re: filter query result by user
Moreover, you may want to use fq=CreatedBy:user1 for filtering. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jul 23, 2013 at 9:28 AM, Raymond Wiker rwi...@gmail.com wrote: Simple: the field needs to be indexed in order to search (or filter) on it. On Tue, Jul 23, 2013 at 3:26 PM, Mysurf Mail stammail...@gmail.com wrote: I want to restrict the returned results to be only the documents that were created by the user. I then load to the index the createdBy attribute and set it to index false,stored=true field name=CreatedBy type=string indexed=false stored=true required=true/ then in the I want to filter by CreatedBy so I use the dashboard, check edismax and add I check edismax and add CreatedBy:user1 to the qf field. the result query is http:// :8983/solr/vault/select?q=*%3A*defType=edismaxqf=CreatedBy%3Auser1 Nothing is filtered. all rows returned. What was I doing wrong?
Re: filter query result by user
But I dont want it to be searched.on lets say the user name is giraffe I do want to filter to be where created by = giraffe but when the user searches his name, I will want only documents with name Giraffe. since it is indexed, wouldn't it return all rows created by him? Thanks. On Tue, Jul 23, 2013 at 4:28 PM, Raymond Wiker rwi...@gmail.com wrote: Simple: the field needs to be indexed in order to search (or filter) on it. On Tue, Jul 23, 2013 at 3:26 PM, Mysurf Mail stammail...@gmail.com wrote: I want to restrict the returned results to be only the documents that were created by the user. I then load to the index the createdBy attribute and set it to index false,stored=true field name=CreatedBy type=string indexed=false stored=true required=true/ then in the I want to filter by CreatedBy so I use the dashboard, check edismax and add I check edismax and add CreatedBy:user1 to the qf field. the result query is http:// :8983/solr/vault/select?q=*%3A*defType=edismaxqf=CreatedBy%3Auser1 Nothing is filtered. all rows returned. What was I doing wrong?
Re:
Can anyone remove this spammer please? On Tue, Jul 23, 2013 at 4:47 AM, wired...@yahoo.com wrote: Hi! http://mackieprice.org/cbs.com.network.html
Re: filter query result by user
I am probably using it wrong. http:// ...:8983/solr/vault10k/select?q=*%3A*defType=edismaxqf=CreatedBy%BLABLA returns all rows. It neglects my qf filter. Should I even use qf for filtrering with edismax? (It doesnt say that in the doc http://wiki.apache.org/solr/ExtendedDisMax#qf_.28Query_Fields.29) On Tue, Jul 23, 2013 at 4:32 PM, Mysurf Mail stammail...@gmail.com wrote: But I dont want it to be searched.on lets say the user name is giraffe I do want to filter to be where created by = giraffe but when the user searches his name, I will want only documents with name Giraffe. since it is indexed, wouldn't it return all rows created by him? Thanks. On Tue, Jul 23, 2013 at 4:28 PM, Raymond Wiker rwi...@gmail.com wrote: Simple: the field needs to be indexed in order to search (or filter) on it. On Tue, Jul 23, 2013 at 3:26 PM, Mysurf Mail stammail...@gmail.com wrote: I want to restrict the returned results to be only the documents that were created by the user. I then load to the index the createdBy attribute and set it to index false,stored=true field name=CreatedBy type=string indexed=false stored=true required=true/ then in the I want to filter by CreatedBy so I use the dashboard, check edismax and add I check edismax and add CreatedBy:user1 to the qf field. the result query is http:// :8983/solr/vault/select?q=*%3A*defType=edismaxqf=CreatedBy%3Auser1 Nothing is filtered. all rows returned. What was I doing wrong?
Re: filter query result by user
Hi, Use fq, not qf. It needs to be indexed. Filtering is like searching without scoring. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jul 23, 2013 at 9:39 AM, Mysurf Mail stammail...@gmail.com wrote: I am probably using it wrong. http:// ...:8983/solr/vault10k/select?q=*%3A*defType=edismaxqf=CreatedBy%BLABLA returns all rows. It neglects my qf filter. Should I even use qf for filtrering with edismax? (It doesnt say that in the doc http://wiki.apache.org/solr/ExtendedDisMax#qf_.28Query_Fields.29) On Tue, Jul 23, 2013 at 4:32 PM, Mysurf Mail stammail...@gmail.com wrote: But I dont want it to be searched.on lets say the user name is giraffe I do want to filter to be where created by = giraffe but when the user searches his name, I will want only documents with name Giraffe. since it is indexed, wouldn't it return all rows created by him? Thanks. On Tue, Jul 23, 2013 at 4:28 PM, Raymond Wiker rwi...@gmail.com wrote: Simple: the field needs to be indexed in order to search (or filter) on it. On Tue, Jul 23, 2013 at 3:26 PM, Mysurf Mail stammail...@gmail.com wrote: I want to restrict the returned results to be only the documents that were created by the user. I then load to the index the createdBy attribute and set it to index false,stored=true field name=CreatedBy type=string indexed=false stored=true required=true/ then in the I want to filter by CreatedBy so I use the dashboard, check edismax and add I check edismax and add CreatedBy:user1 to the qf field. the result query is http:// :8983/solr/vault/select?q=*%3A*defType=edismaxqf=CreatedBy%3Auser1 Nothing is filtered. all rows returned. What was I doing wrong?
Re: filter query result by user
There is no such thing as a qf filter - qf is simply a list of names of fields to search for the terms from the query, q, as well as boost factors. Filtering is done with filter queries - fq. -- Jack Krupansky -Original Message- From: Mysurf Mail Sent: Tuesday, July 23, 2013 9:39 AM To: solr-user@lucene.apache.org Subject: Re: filter query result by user I am probably using it wrong. http:// ...:8983/solr/vault10k/select?q=*%3A*defType=edismaxqf=CreatedBy%BLABLA returns all rows. It neglects my qf filter. Should I even use qf for filtrering with edismax? (It doesnt say that in the doc http://wiki.apache.org/solr/ExtendedDisMax#qf_.28Query_Fields.29) On Tue, Jul 23, 2013 at 4:32 PM, Mysurf Mail stammail...@gmail.com wrote: But I dont want it to be searched.on lets say the user name is giraffe I do want to filter to be where created by = giraffe but when the user searches his name, I will want only documents with name Giraffe. since it is indexed, wouldn't it return all rows created by him? Thanks. On Tue, Jul 23, 2013 at 4:28 PM, Raymond Wiker rwi...@gmail.com wrote: Simple: the field needs to be indexed in order to search (or filter) on it. On Tue, Jul 23, 2013 at 3:26 PM, Mysurf Mail stammail...@gmail.com wrote: I want to restrict the returned results to be only the documents that were created by the user. I then load to the index the createdBy attribute and set it to index false,stored=true field name=CreatedBy type=string indexed=false stored=true required=true/ then in the I want to filter by CreatedBy so I use the dashboard, check edismax and add I check edismax and add CreatedBy:user1 to the qf field. the result query is http:// :8983/solr/vault/select?q=*%3A*defType=edismaxqf=CreatedBy%3Auser1 Nothing is filtered. all rows returned. What was I doing wrong?
Re: zkHost in solr.xml goes missing after SPLITSHARD using Collections API
Can you try upgrading to the just-released 4.4? Solr.xml persistence had all kinds of bugs in 4.3, which should have been fixed now. Alan Woodward www.flax.co.uk On 23 Jul 2013, at 13:36, Ali, Saqib wrote: Hello all, Every time I issue a SPLITSHARD using Collections API, the zkHost attribute in the solr.xml goes missing. I have to manually edit the solr.xml to add zkHost after every SPLITSHARD. Any thoughts on what could be causing this? Thanks.
Re: solr - Deleting a row from the index, using the configuration files only.
Did you look at: *) $deleteDocById *) $deleteDocByQuery *) deletedPkQuery Just search for delete on https://wiki.apache.org/solr/DataImportHandler If you tried all of those, maybe you need to explain your problem in more specific details. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Jul 23, 2013 at 8:18 AM, Mysurf Mail stammail...@gmail.com wrote: I am updating my solr index using deltaQuery and deltaImportQuery attributes in data-config.xml. In my condition I write where MyDoc.LastModificationTime '${dataimporter.last_index_time}' then after I add a row I trigger an update using data-config.xml. Now, sometimes I delete a row. How can I implement this with configuration files only (without sending a delete rest command to solr ). Lets say my object is not deleted but its status is changed to deleted. I dont index that status field, as I want to hold only the live rows. (otherwise I could have just filtered it) Is there a way to do it? thanks.
Re: Document Similarity Algorithm at Solr/Lucene
One classic approach is to simply use the full text of the suspect text as well as bigrams and trigrams (phrases) from that text with OR operators. The top results will be the documents that most closely match the subject text. That provides a visual set similar results. You will then have to apply some heuristic of your own as far as how many top results to look at or what score to cut off at. The use of OR operators assures that similar documents will be found even if not 100% of the words are used. Yes, OR guarantees that your total result count will be high, but scoring assures that the top results will be more relevant. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Tuesday, July 23, 2013 6:16 AM To: solr-user@lucene.apache.org Subject: Re: Document Similarity Algorithm at Solr/Lucene Actually I need a specialized algorithm. I want to use that algorithm to detect duplicate blog posts. 2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com Hi, I you may leverage and / or improve MLT component [1]. HTH, Tommaso [1] : http://wiki.apache.org/solr/MoreLikeThis 2013/7/23 Furkan KAMACI furkankam...@gmail.com Hi; Sometimes a huge part of a document may exist in another document. As like in student plagiarism or quotation of a blog post at another blog post. Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to detect it?
Re: how number of indexed fields effect performance
Do you need all of the fields loaded every time and are they stored? Maybe there is a document with gigantic content that you don't actually need but it gets deserialized anyway. Try lazy loading setting: enableLazyFieldLoading in solrconfig.xml Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Jul 23, 2013 at 12:36 AM, Jack Krupansky j...@basetechnology.comwrote: After restarting Solr and doing a couple of queries to warm the caches, are queries already slow/failing, or does it take some time and a number of queries before failures start occurring? One possibility is that you just need a lot more memory for caches for this amount of data. So, maybe the failures are caused by heavy garbage collections. So, after restarting Solr, check how much Java heap is available, then do some warming queries, then check the Java heap available again. Add the debugQuery=true parameter to your queries and look at the timings to see what phases of query processing are taking the most time. Also check whether the reported QTime seems to match actual wall clock time; sometimes formatting of the results and network transfer time can dwarf actual query time. How many fields are you returning on a typical query? -- Jack Krupansky -Original Message- From: Suryansh Purwar Sent: Monday, July 22, 2013 11:06 PM To: solr-user@lucene.apache.org ; j...@basetechnology.com Subject: how number of indexed fields effect performance It was running fine initially when we just had around 100 fields indexed. In this case as well it runs fine but after sometime broken pipe exception starts coming which results in shard getting down. Regards, Suryansh On Tuesday, July 23, 2013, Jack Krupansky wrote: Was all of this running fine previously and only started running slow recently, or is this your first measurement? Are very simple queries (single keyword, no filters or facets or sorting or anything else, and returning only a few fields) working reasonably well? -- Jack Krupansky -Original Message- From: Suryansh Purwar Sent: Monday, July 22, 2013 4:07 PM To: solr-user@lucene.apache.org Subject: how number of indexed fields effect performance Hi, We have a two shard solrcloud cluster with each shard allocated 3 separate machines. We do complex queries involving a number of filter queries coupled with group queries and faceting. All of our machines are 64 bit with 32 gb ram. Our index size is around 10gb with around 8,00,000 documents. We have around 1000 indexed fields per document. 6gb of memeory is allocated to tomcat under which solr is running on each of the six machines. We have a zookeeper ensemble consisting of 3 zookeeper instances running on 3 of the six machines with 4gb memory allocated to each of the zookeeper instance. First solr start taking too much time with Broken pipe exception because of timeout from client side coming again and again, then after sometime a whole shard goes down with one machine at at time followed by other machines. Is having 1000 fields indexed with each document resulting in this problem? If it is so, what would be the ideal number of indexed fields in such environment. Regards, Suryansh
Re: Document Similarity Algorithm at Solr/Lucene
On 7/23/2013 3:33 AM, Furkan KAMACI wrote: Sometimes a huge part of a document may exist in another document. As like in student plagiarism or quotation of a blog post at another blog post. Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to detect it? Solr is designed for search, not heavy analysis. It might be possible, as Tommaso suggested, to take the MoreLikeThis functionality from Solr and adapt it to this use case, but this isn't really something Solr was designed to do. If you did use MoreLikeThis out of the box, the most it could do is show you similar documents to a specific document, but then you'd have to do your own actual comparison. Solr would not be able to tell you whether it's copied, just that it's similar. Also, it would not be able to easily and quickly do a full comparison across a huge number of documents. You'd be much better off with a tool specifically designed for the purpose. Perhaps Solr's MoreLikeThis capability would be something you could use in creating such a tool, but I couldn't say. Thanks, Shawn
Collection not current after insert
Hi there, My Solr is being fed by Fedora GSearch and when uploading a new resource, the Collection is optimized but not current so the new resource can't be found. I have to go to the Core Admin page and Optimize it from there, in order to make the collection current. Is there anything I should look for to see what the problem is? This is the comms to solr when inserting: DEBUG 2013-07-23 13:27:37,023 (OperationsImpl) resultXml = solrUpdateIndex indexName=FgsIndex insertedltk:13000116/inserted counts insertTotal=1 updateTotal=0 deleteTotal=0 emptyTotal=0 docCount=854 warnCount=0/ /solrUpdateIndex DEBUG 2013-07-23 13:27:37,023 (GTransformer) xsltName=fgsconfigFinal/index/FgsIndex/updateIndexToResultPage DEBUG 2013-07-23 13:27:37,027 (GTransformer) getTransformer transformer=org.apache.xalan.transformer.TransformerImpl@6561b973 uriResolver=null DEBUG 2013-07-23 13:27:37,028 (GenericOperationsImpl) resultXml=?xml version=1.0 encoding=UTF-8? resultPage operation=updateIndex action=fromPid value=ltk:13000116 repositoryName=FgsRepos indexNames= resultPageXslt= dateTime=Tue Jul 23 13:27:36 UTC 2013 updateIndex xmlns:dc=http://purl.org/dc/elements/1.1/; xmlns:foxml=info:fedora/fedora-system:def/foxml# xmlns:zs=http://www.loc.gov/zing/srw/; warnCount=0 docCount=854 deleteTotal=0 updateTotal=0 insertTotal=1 indexName=FgsIndex/ /resultPage INFO 2013-07-23 13:27:37,028 (UpdateListener) Index updated by notification message, returning: ?xml version=1.0 encoding=UTF-8? resultPage operation=updateIndex action=fromPid value=ltk:13000116 repositoryName=FgsRepos indexNames= resultPageXslt= dateTime=Tue Jul 23 13:27:36 UTC 2013 updateIndex xmlns:dc=http://purl.org/dc/elements/1.1/; xmlns:foxml=info:fedora/fedora-system:def/foxml# xmlns:zs=http://www.loc.gov/zing/srw/; warnCount=0 docCount=854 deleteTotal=0 updateTotal=0 insertTotal=1 indexName=FgsIndex/ /resultPage thanks, Alistair -- mov eax,1 mov ebx,0 int 80h
Re: deserializing highlighting json result
The JSON keys within the highlighting object are the document IDs, and then the keys within those objects are the highlighted field names. Again, I repeat my question: Exactly why is it difficult to deserialize? Seems simple enough. -- Jack Krupansky -Original Message- From: Mysurf Mail Sent: Tuesday, July 23, 2013 1:48 AM To: solr-user@lucene.apache.org Subject: Re: deserializing highlighting json result the guid appears as the attribute id and not as id:baf8434a-99a4-4046-8a4d-2f7ec09eafc8: Trying to create an object that holds this guid will create an attribute with name baf8434a-99a4-4046-8a4d-2f7ec09eafc8 On Mon, Jul 22, 2013 at 6:30 PM, Jack Krupansky j...@basetechnology.comwrote: Exactly why is it difficult to deserialize? Seems simple enough. -- Jack Krupansky -Original Message- From: Mysurf Mail Sent: Monday, July 22, 2013 11:14 AM To: solr-user@lucene.apache.org Subject: deserializing highlighting json result When I request a json result I get the following streucture in the highlighting {highlighting:{ 394c65f1-dfb1-4b76-9b6c-**2f14c9682cc9:{ PackageName:[- emTestingem channel twenty.]}, baf8434a-99a4-4046-8a4d-**2f7ec09eafc8:{ PackageName:[- emTestingem channel twenty.]}, 0a699062-cd09-4b2e-a817-**330193a352c1:{ PackageName:[- emTestingem channel twenty.]}, 0b9ec891-5ef8-4085-9de2-**38bfa9ea327e:{ PackageName:[- emTestingem channel twenty.]}}} It is difficult to deserialize this json because the guid is in the attribute name. Is that solveable (using c#)?
Re: Appending *-wildcard suffix on all terms for querying: move logic from client to server side
Thanks Mikhail, I'll go for your EdgeNGramTokenFilter suggestion. - Kind regards, Paul
Re: Start independent Zookeeper from within Solr install
Curious what the use case is for this? Zookeeper is not an HTTP service so loading it in Jetty by itself doesn't really make sense. I also think this creates more work for the Solr team especially since setting up a production ensemble shouldn't take more than a few minutes once you have the nodes provisioned. On Tue, Jul 23, 2013 at 7:05 AM, Upayavira u...@odoko.co.uk wrote: Assumptions: * you currently have two choices to start Zookeeper: run it embedded within Solr, or download it from the ZooKeeper site and start it independently. * everything you need to run ZooKeeper (embedded or not) is included within the Solr distribution Assuming I've got the above right, then currently starting an embedded ZooKeeper is easy (-DzkRun), and starting an ensemble is irritatingly complex. So, my question is, how hard would it be to start Zookeeper without Solr, but from within the Solr codebase? -DensembleOnly or some such, causes Solr not to load, but Zookeeper still starts. I'm assuming that Jetty would still listen on port 8983, but it wouldn't initialise the Solr webapp: java -DzkRun -DzkEnsembleOnly -DzkHosts=zkhost01:9983,zkhost02:9983,zkhost03:9983 -jar start.jar Is this possible? If it is, I'm happy to have a go at making it happen. Upayavira
Re: zkHost in solr.xml goes missing after SPLITSHARD using Collections API
On 7/23/2013 7:50 AM, Alan Woodward wrote: Can you try upgrading to the just-released 4.4? Solr.xml persistence had all kinds of bugs in 4.3, which should have been fixed now. The 4.4.0 release has been finalized and uploaded, but the download link hasn't been changed yet because the mirror network isn't fully synchronized yet. It is available from many mirrors, but until the website download links get changed, there's not yet a direct way to access it. Here's some generic instructions for situations where the new version is done, but the official announcement isn't out yet: http://lucene.apache.org/solr/ 1) Go the the Solr website (URL above) and click on the latest version download button, which at this moment is 4.3.1. Wait for the redirect to take you to a mirror list. 2) Click on one of the mirrors, the best option is usually the one right on top that the website chose for you. 3) When the file list comes up, click the Parent Directory link. If this isn't showing, it will most likely be labelled with .. instead. 4) If a directory for the new version (in this case 4.4.0) is listed, click on it and then click the file that you want to download. If the new version is not listed, click the Back button on your browser twice, then go back to step 2, but this time choose a different mirror. One last reminder: This only works right before a release is officially announced. These instructions cannot be used while a release is still in development. Thanks, Shawn
Re: Document Similarity Algorithm at Solr/Lucene
if you need a specialized algorithm for detecting blogposts plagiarism / quotations (which are different tasks IMHO) I think you have 2 options: 1. implement a dedicated one based on your features / metrics / domain 2. try to fine tune an existing algorithm that is flexible enough If I were to do it with Solr I'd probably do something like: 1. index original blogposts in Solr (possibly using Jack's suggestion about ngrams / shingles) 2. do MLT queries with candidate blogposts copies text 3. get the first, say, 2-3 hits 4. mark it as quote / plagiarism 5. eventually train a classifier to help you mark other texts as quote / plagiarism HTH, Tommaso 2013/7/23 Furkan KAMACI furkankam...@gmail.com Actually I need a specialized algorithm. I want to use that algorithm to detect duplicate blog posts. 2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com Hi, I you may leverage and / or improve MLT component [1]. HTH, Tommaso [1] : http://wiki.apache.org/solr/MoreLikeThis 2013/7/23 Furkan KAMACI furkankam...@gmail.com Hi; Sometimes a huge part of a document may exist in another document. As like in student plagiarism or quotation of a blog post at another blog post. Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to detect it?
Re: Document Similarity Algorithm at Solr/Lucene
Thanks for your comments. 2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com if you need a specialized algorithm for detecting blogposts plagiarism / quotations (which are different tasks IMHO) I think you have 2 options: 1. implement a dedicated one based on your features / metrics / domain 2. try to fine tune an existing algorithm that is flexible enough If I were to do it with Solr I'd probably do something like: 1. index original blogposts in Solr (possibly using Jack's suggestion about ngrams / shingles) 2. do MLT queries with candidate blogposts copies text 3. get the first, say, 2-3 hits 4. mark it as quote / plagiarism 5. eventually train a classifier to help you mark other texts as quote / plagiarism HTH, Tommaso 2013/7/23 Furkan KAMACI furkankam...@gmail.com Actually I need a specialized algorithm. I want to use that algorithm to detect duplicate blog posts. 2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com Hi, I you may leverage and / or improve MLT component [1]. HTH, Tommaso [1] : http://wiki.apache.org/solr/MoreLikeThis 2013/7/23 Furkan KAMACI furkankam...@gmail.com Hi; Sometimes a huge part of a document may exist in another document. As like in student plagiarism or quotation of a blog post at another blog post. Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to detect it?
WikipediaTokenizer for Removing Unnecesary Parts
Hi; I have indexed wikipedia data with Solr DIH. However when I look data that is indexed at Solr I something like that as well: {| style=text-align: left; width: 50%; table-layout: fixed; border=0 |- valign=top | style=width: 50%| :*[[Ubuntu]] :*[[Fedora]] :*[[Mandriva]] :*[[Linux Mint]] :*[[Debian]] :*[[OpenSUSE]] | *[[Red Hat]] *[[Mageia]] *[[Arch Linux]] *[[PCLinuxOS]] *[[Slackware]] |} However I want to remove them before indexing. I know that there is a WikipediaTokenizer in Lucene but how can I remove unnecessary parts ( as like links, style, etc..) with Solr?
Re: Document Similarity Algorithm at Solr/Lucene
Here is a paper that I found useful: http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf On Tue, Jul 23, 2013 at 10:42 AM, Furkan KAMACI furkankam...@gmail.com wrote: Thanks for your comments. 2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com if you need a specialized algorithm for detecting blogposts plagiarism / quotations (which are different tasks IMHO) I think you have 2 options: 1. implement a dedicated one based on your features / metrics / domain 2. try to fine tune an existing algorithm that is flexible enough If I were to do it with Solr I'd probably do something like: 1. index original blogposts in Solr (possibly using Jack's suggestion about ngrams / shingles) 2. do MLT queries with candidate blogposts copies text 3. get the first, say, 2-3 hits 4. mark it as quote / plagiarism 5. eventually train a classifier to help you mark other texts as quote / plagiarism HTH, Tommaso 2013/7/23 Furkan KAMACI furkankam...@gmail.com Actually I need a specialized algorithm. I want to use that algorithm to detect duplicate blog posts. 2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com Hi, I you may leverage and / or improve MLT component [1]. HTH, Tommaso [1] : http://wiki.apache.org/solr/MoreLikeThis 2013/7/23 Furkan KAMACI furkankam...@gmail.com Hi; Sometimes a huge part of a document may exist in another document. As like in student plagiarism or quotation of a blog post at another blog post. Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to detect it?
Re: how number of indexed fields effect performance
There was also a bug in the lazy loading of multivalued fields at one point recently in Solr 4.2 https://issues.apache.org/jira/browse/SOLR-4589 4.x + enableLazyFieldLoading + large multivalued fields + varying fl = pathological CPU load response time Do you use multivalued fields very heavily? I'm still not ready to suggest that 1,000 fields is an okay thing to do, but there are still plenty of nuances in Solr performance that could explain the difficulties, before we even get to the 1,000 field issue itself. The real bottom line is that as you increase field count, there are lots of other aspects of Solr memory and performance degradation that increase as well. Some of those factors can be dealt with simply with more memory, more and faster CPU cores, or even more sharding, or other tuning, but not necessarily all of them. I think that I am already on the record on other threads as suggesting that a couple hundred is about the limit for field count for a slam dunk use of Solr. That doesn't mean you can't go above a couple hundred fields, just that you are in uncharted territory and may need to take extraordinary measures to get everything working satisfactorily. There's no magic hard limit, just a general sense that smaller numbers of of field are like a walk in a park, while higher numbers of fields are like chopping through a jungle. We each have our own threshold for... adventure. We need answers to the previous questions we raised before we can analyze this a lot further. Oh, and make sure there is enough OS system memory available for caching of the index pages. Sometimes, it is little things like this that can crush Solr performance. Unfortunately, Solr is not a packaged solution that automatically and magically auto-configures everything to work just right. Instead, it is a powerful toolkit that lets you do amazing things, but you the developer/architect need to supply amazing intelligence, wisdom, foresight, and insight to get it (and its hardware and software environment) to do those amazing things. -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Tuesday, July 23, 2013 9:54 AM To: solr-user@lucene.apache.org Subject: Re: how number of indexed fields effect performance Do you need all of the fields loaded every time and are they stored? Maybe there is a document with gigantic content that you don't actually need but it gets deserialized anyway. Try lazy loading setting: enableLazyFieldLoading in solrconfig.xml Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Jul 23, 2013 at 12:36 AM, Jack Krupansky j...@basetechnology.comwrote: After restarting Solr and doing a couple of queries to warm the caches, are queries already slow/failing, or does it take some time and a number of queries before failures start occurring? One possibility is that you just need a lot more memory for caches for this amount of data. So, maybe the failures are caused by heavy garbage collections. So, after restarting Solr, check how much Java heap is available, then do some warming queries, then check the Java heap available again. Add the debugQuery=true parameter to your queries and look at the timings to see what phases of query processing are taking the most time. Also check whether the reported QTime seems to match actual wall clock time; sometimes formatting of the results and network transfer time can dwarf actual query time. How many fields are you returning on a typical query? -- Jack Krupansky -Original Message- From: Suryansh Purwar Sent: Monday, July 22, 2013 11:06 PM To: solr-user@lucene.apache.org ; j...@basetechnology.com Subject: how number of indexed fields effect performance It was running fine initially when we just had around 100 fields indexed. In this case as well it runs fine but after sometime broken pipe exception starts coming which results in shard getting down. Regards, Suryansh On Tuesday, July 23, 2013, Jack Krupansky wrote: Was all of this running fine previously and only started running slow recently, or is this your first measurement? Are very simple queries (single keyword, no filters or facets or sorting or anything else, and returning only a few fields) working reasonably well? -- Jack Krupansky -Original Message- From: Suryansh Purwar Sent: Monday, July 22, 2013 4:07 PM To: solr-user@lucene.apache.org Subject: how number of indexed fields effect performance Hi, We have a two shard solrcloud cluster with each shard allocated 3 separate machines. We do complex queries involving a number of filter queries coupled with group queries and faceting. All of our machines are 64 bit with 32 gb ram. Our index size is around 10gb with around 8,00,000
Re: WikipediaTokenizer for Removing Unnecesary Parts
If you use wikipediatokenizer it will tag different wiki elements with different types (you can see it in the admin UI). so then followup with typetokenfilter to only filter the types you care about, and i think it will do what you want. On Tue, Jul 23, 2013 at 7:53 AM, Furkan KAMACI furkankam...@gmail.comwrote: Hi; I have indexed wikipedia data with Solr DIH. However when I look data that is indexed at Solr I something like that as well: {| style=text-align: left; width: 50%; table-layout: fixed; border=0 |- valign=top | style=width: 50%| :*[[Ubuntu]] :*[[Fedora]] :*[[Mandriva]] :*[[Linux Mint]] :*[[Debian]] :*[[OpenSUSE]] | *[[Red Hat]] *[[Mageia]] *[[Arch Linux]] *[[PCLinuxOS]] *[[Slackware]] |} However I want to remove them before indexing. I know that there is a WikipediaTokenizer in Lucene but how can I remove unnecessary parts ( as like links, style, etc..) with Solr?
Re: WikipediaTokenizer for Removing Unnecesary Parts
Are you actually seeing that output from the WikipediaTokenizerFactory?? Really? Even if you use the Solr Admin UI analysis page? You should just see the text tokens plus the URLs for links. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Tuesday, July 23, 2013 10:53 AM To: solr-user@lucene.apache.org Subject: WikipediaTokenizer for Removing Unnecesary Parts Hi; I have indexed wikipedia data with Solr DIH. However when I look data that is indexed at Solr I something like that as well: {| style=text-align: left; width: 50%; table-layout: fixed; border=0 |- valign=top | style=width: 50%| :*[[Ubuntu]] :*[[Fedora]] :*[[Mandriva]] :*[[Linux Mint]] :*[[Debian]] :*[[OpenSUSE]] | *[[Red Hat]] *[[Mageia]] *[[Arch Linux]] *[[PCLinuxOS]] *[[Slackware]] |} However I want to remove them before indexing. I know that there is a WikipediaTokenizer in Lucene but how can I remove unnecessary parts ( as like links, style, etc..) with Solr?
Re: softCommit doesn't work - ?
Thanks for your comment Eric. When I use *server.add(doc);* - everything is fine (but takes long time to hard commit every single doc) , so I am sure docs are uniquely indexed. Maybe I shouldn't do *server.commit();* at all from solrj code, so SOLR would use autoCommit/autoSoftCommit configuration defined in solrconfig.xml ? Maybe there are some bits missing ? -- View this message in context: http://lucene.472066.n3.nabble.com/softCommit-doesn-t-work-tp4079578p4079772.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: XInclude and Document Entity not working on schema.xml
Hello Chris, Thank you for your help. I checked differences between my files and your test files but I didn't find bugs in my files. All my files are in the same directory: collection1/conf = schema.xml content: ?xml version=1.0 encoding=UTF-8 ? !DOCTYPE schema [ !ENTITY commonschema_types SYSTEM commonschema_types.xml !ENTITY commonschema_others SYSTEM commonschema_others.xml ] schema name=searchSolrSchema version=1.5 types fieldType name=text_stemmed class=solr.TextField positionIncrementGap=100 omitNorms=true !-- FR : french -- !-- least aggressive stemming -- analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=com.kelkoo.search.solr.plugins.stemmer.fr.KelkooFrenchMinimalStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=com.kelkoo.search.solr.plugins.stemmer.fr.KelkooFrenchMinimalStemFilterFactory/ /analyzer /fieldType commonschema_types; /types commonschema_others; /schema = commonschema_types.xml content: fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true/ fieldType name=boolean class=solr.BoolField sortMissingLast=true omitNorms=true/ !-- int is for exact ids, work with grouped=true and distrib=true -- fieldType name=int class=solr.TrieIntField precisionStep=0 sortMissingLast=true omitNorms=true positionIncrementGap=0/ !-- tint is for numbers that need sorting and/or range queries (precisionStep=4 has better performance than precisionStep=8) and that do *not* need grouping (grouping does not work in distrib=true for tint)-- fieldType name=tint class=solr.TrieIntField precisionStep=4 sortMissingLast=true omitNorms=true positionIncrementGap=0/ fieldType name=long class=solr.TrieLongField precisionStep=0 positionIncrementGap=0/ fieldType name=byte class=solr.ByteField omitNorms=true/ fieldType name=float class=solr.TrieFloatField sortMissingLast=true omitNorms=true/ !-- A general text field which tokenizes with StandardTokenizer omitNorms=true means the (index time) lenghtNorm will be the same whatever the number of tokens. -- fieldType name=text_general class=solr.TextField positionIncrementGap=100 omitNorms=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ /analyzer /fieldType commonschema_others; include works. Do you see something wrong ? Unfortunately I cannot use the 4.3.0 version because I'm using solr.xml sharedLib which does not work in 4.3.0 (cf.https://issues.apache.org/jira/browse/SOLR-4791). Where can I found the newly voted 4.4 ? I have this bug with the nightly 4.5-2013-07-18_06-04-44 found here https://builds.apache.org/job/Solr-Artifacts-4.x/lastSuccessfulBuild/artifact/solr/package/ (the 18th of july). Elodie Sannier Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: WikipediaTokenizer for Removing Unnecesary Parts
Here is my fieldtype: fieldType name=text_tr class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WikipediaTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_tr.txt enablePositionIncrements=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WikipediaTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms_tr.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_tr.txt enablePositionIncrements=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType My input for indexing at analysis section of Solr admin page: {| style=text-align: left; width: 50%; table-layout: fixed; border=0 |- valign=top | style=width: 50%| :*[[Ubuntu]] :*[[Fedora]] :*[[Mandriva]] :*[[Linux Mint]] :*[[Debian]] :*[[OpenSUSE]] | *[[Red Hat]] *[[Mageia]] *[[Arch Linux]] *[[PCLinuxOS]] *[[Slackware]] |} and the output: WT styletextalignleft width50tablelayoutfixed border0valigntop stylewidth50UbuntuFedora MandrivaLinuxMintDebian OpenSUSERedHatMageiaArch Linux PCLinuxOSSlackware SF styletextalignleft width50tablelayoutfixed border0valigntop stylewidth50UbuntuFedora MandrivaLinuxMintDebian OpenSUSERedHatMageiaArch Linux PCLinuxOSSlackware LCF styletextalignleft width50tablelayoutfixed border0valigntop stylewidth50 ubuntufedora mandrivalinuxmintdebian opensuseredhatmageiaarch linuxpclinuxosslackware Any ideas? 2013/7/23 Jack Krupansky j...@basetechnology.com Are you actually seeing that output from the WikipediaTokenizerFactory?? Really? Even if you use the Solr Admin UI analysis page? You should just see the text tokens plus the URLs for links. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Tuesday, July 23, 2013 10:53 AM To: solr-user@lucene.apache.org Subject: WikipediaTokenizer for Removing Unnecesary Parts Hi; I have indexed wikipedia data with Solr DIH. However when I look data that is indexed at Solr I something like that as well: {| style=text-align: left; width: 50%; table-layout: fixed; border=0 |- valign=top | style=width: 50%| :*[[Ubuntu]] :*[[Fedora]] :*[[Mandriva]] :*[[Linux Mint]] :*[[Debian]] :*[[OpenSUSE]] | *[[Red Hat]] *[[Mageia]] *[[Arch Linux]] *[[PCLinuxOS]] *[[Slackware]] |} However I want to remove them before indexing. I know that there is a WikipediaTokenizer in Lucene but how can I remove unnecessary parts ( as like links, style, etc..) with Solr?
Re:
: Can anyone remove this spammer please? The recent influx is not confined to a single user, or a single list. Nor is there a clear course of action just yet, since the senders in question are all legitimate subscribers who have been active members of the community. There is an open issue to track the recent surge and see if we can address this problem holisticly before resorting to forcibly unsubscribing legitimate list members based on messages spoofed on their behalf... https://issues.apache.org/jira/browse/INFRA-6585 ...interested parties should watch that issue, and/or post comments there with any concrete technical suggestions for addressing the problem. -Hoss
RE: spellcheck and search in a same solr request
Solr doesn't support any kind of short-circuting the original query and returning the results of the corrected query or collation. You just re-issue the query in a second request. This would be a nice feature to add though. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: smanad [mailto:sma...@gmail.com] Sent: Monday, July 22, 2013 6:29 PM To: solr-user@lucene.apache.org Subject: spellcheck and search in a same solr request Hey, Is there a way to do spellcheck and search (using suggestions returned from spellcheck) in a single Solr request? I am seeing that if my query is spelled correctly, i get results but if misspelled, I just get suggestions. Any pointers will be very helpful. Thanks, -Manasi -- View this message in context: http://lucene.472066.n3.nabble.com/spellcheck-and-search-in-a-same-solr-request-tp4079571.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Use same spell check dictionary across different collections
DirectSolrSpellChecker does not prepare any kind of dictionary. It just uses the term dictionary from the indexed field. So what you are trying to do is impossible. You would think it would be possible with IndexBasedSpellChecker because it creates a dictionary as a sidecar lucene index. But it won't work either because it uses the term dictionary of the field the sidecar index was based on to get term frequencies. So both IndexBased- and DirectSolr- rely on the original field's term dictionary to work. You cannot use either to move one core's field data to another as a spellcheck dictionary. Possibly, you can create a Dictionary file based on the data that is in the field you want to use. Then you can use FileBasedSpellChecker. See http://wiki.apache.org/solr/FileBasedSpellChecker . James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: smanad [mailto:sma...@gmail.com] Sent: Monday, July 22, 2013 5:55 PM To: solr-user@lucene.apache.org Subject: Use same spell check dictionary across different collections I have 2 collections, lets say coll1 and coll2. I configured solr.DirectSolrSpellChecker in coll1 solrconfig.xml and works fine. Now, I want to configure coll2 solrconfig.xml to use SAME spell check dictionary index created above. (I do not want coll2 prepare its own dictionary index but just do spell check against the coll1 Spell dictionary index) Is it possible to do it? Tried out with IndexBasedSpellChecker but could not get it working. Any suggestions? Thanks, -Manasi -- View this message in context: http://lucene.472066.n3.nabble.com/Use-same-spell-check-dictionary-across-different-collections-tp4079566.html Sent from the Solr - User mailing list archive at Nabble.com.
[ANNOUNCE] Apache Solr 4.4 released
July 2013, Apache Solr™ 4.4 available The Lucene PMC is pleased to announce the release of Apache Solr 4.4 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.4 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html See the CHANGES.txt file included with the release for a full list of details. Solr 4.4 Release Highlights: * Solr indexes and transaction logs may stored in HDFS with full read/write capability. * Schemaless mode: Added support for a mode that requires no up-front schema modifications, in which previously unknown fields' types are guessed based on the values in added/updated documents, and are then added to the schema prior to processing the update. Note that the below-described features are also useful independently from schemaless mode operation. * New Parse{Date,Integer,Long,Float,Double,Boolean}UpdateProcessorFactory classes parse/guess the field value class for String-valued and unknown fields. * New AddSchemaFieldsUpdateProcessor: Automatically add new fields to the schema when adding/updating documents with unknown fields. Custom rules map field value class(es) to schema fieldTypes. * A new schemaless mode example configuration, using the above-described field-value-class-guessing and unknown-field-schema-addition features, is provided at solr/example/example-schemaless/. * Core Discovery mode: A new solr.xml format which does not store core information, but instead searches for files named 'core.properties' in the filesystem which tell Solr all the details about that core. The main example and the schemaless example both use this new format. * Schema REST API: Add support for creating copy fields. * A merged segment warmer may now be plugged into solrconfig.xml. * New MaxScoreQParserPlugin: Return max() instead of sum() of terms. * Binary files are now supported in ZooKeeper. * SolrJ's SolrPing object has new methods for ping, enable, and disable. * The Admin UI now supports adding documents to Solr. * Added a PUT command to the Solr ZkCli tool. * New deleteshard collections API that unloads all replicas of a given shard and then removes it from the cluster state. It will remove only those shards which are INACTIVE or have no range. * The Overseer can now optionally assign generic node names so that new addresses can host shards without naming confusion. * The CSV Update Handler now supports optionally adding the line number/ row id to a document. * Added a new system wide info admin handler that exposes the system info that could previously only be retrieved using a SolrCore. Solr 4.4 also includes many other new features as well as numerous optimizations and bugfixes. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) In the coming days, we will also be announcing the first official Solr Reference Guide available for download. In the meantime, users are encouraged to browse the online version and post comments and suggestions on the documentation: https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access.
Re: Start independent Zookeeper from within Solr install
The use case is to prevent the necessity to download something else (zookeeper) when everything needed to run it is (likely) present in the Solr distribution already. Maybe we don't need to start Jetty, maybe we can start Zookeeper with an extra script in the Solr codebase. At present, if you are unfamiliar with ZooKeeper, getting it up and running can be a challenge (I've seen quite a few people fail at it during training scenarios). Upayavira On Tue, Jul 23, 2013, at 03:21 PM, Timothy Potter wrote: Curious what the use case is for this? Zookeeper is not an HTTP service so loading it in Jetty by itself doesn't really make sense. I also think this creates more work for the Solr team especially since setting up a production ensemble shouldn't take more than a few minutes once you have the nodes provisioned. On Tue, Jul 23, 2013 at 7:05 AM, Upayavira u...@odoko.co.uk wrote: Assumptions: * you currently have two choices to start Zookeeper: run it embedded within Solr, or download it from the ZooKeeper site and start it independently. * everything you need to run ZooKeeper (embedded or not) is included within the Solr distribution Assuming I've got the above right, then currently starting an embedded ZooKeeper is easy (-DzkRun), and starting an ensemble is irritatingly complex. So, my question is, how hard would it be to start Zookeeper without Solr, but from within the Solr codebase? -DensembleOnly or some such, causes Solr not to load, but Zookeeper still starts. I'm assuming that Jetty would still listen on port 8983, but it wouldn't initialise the Solr webapp: java -DzkRun -DzkEnsembleOnly -DzkHosts=zkhost01:9983,zkhost02:9983,zkhost03:9983 -jar start.jar Is this possible? If it is, I'm happy to have a go at making it happen. Upayavira
Re: custom field type plugin
What are the dangers of trying to use a range of 10 billion? Simply a slower index time? Or will I get inaccurate results? I have tried it on a very small sample of documents, and it seemed to work. I could spend some time this week trying to get a more robust (and accurate) dataset loaded to play around with. The reason for the 10 billion is to support being able to query for a region on a chromosome. A user might want to know what genes overlap a point on a specific chromosome. Unless I can use 3 dimensional coordinates (which gave an error when I tried it), I'll need to multiply the coordinates by some offset for each chromosome to be able to normalise the data (at both index and query time). The largest chromosome (chr 1) has almost 250,000,000 base pairs. I could probably squeeze the rest a bit smaller, but I'd rather use one size for all chromosomes, since we have more than just human data to deal with. It would get quite messy otherwise. On 7/22/13 11:50 AM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: Like Hoss said, you're going to have to solve this using http://wiki.apache.org/solr/SpatialForTimeDurations Using PointType is *not* going to work because your durations are multi-valued per document. It would be useful to create a custom field type that wraps the capability outlined on the wiki to make it easier to use without requiring the user to think spatially. You mentioned that these numeric ranges extend upwards of 10 billion or so. Unfortunately, the current prefix tree implementation under the hood for non-geodetic spatial, the QuadTree, is unlikely to scale to numbers that big. I don't know where the boundary is, but I doubt 10B. You could try and see what happens. I'm working (very slowly on very little spare time) on improving the PrefixTree implementations to scale to such large numbers; I hope something will be available this fall. ~ David Smiley Kevin Stone wrote I have a particular use case that I think might require a custom field type, however I am having trouble getting the plugin to work. My use case has to do with genetics data, and we are running into several situations were we need to be able to query multiple regions of a chromosome (or gene, or other object types). All that really boils down to is being able to give a number, e.g. 10234, and return documents that have regions containing the number. So you'd have a document with a list like [1:16090,400:8000,40123:43564], and it should come back because 10234 falls between 1:16090. If there is a better or easier way to do this please speak up. I'd rather not have to use a join on another index, because 1) it's more complex to set up, and 2) we might need to join against something else and you can only do one join at a time. AnywayŠ I tried creating a field type similar to a PointType just to see if I could get one working. I added the following jars to get it to compile: apache-solr-core-4.0.0,lucene-core-4.0.0,lucene-queries-4.0.0,apache-solr -solrj-4.0.0. I am running solr 4.0.0 on jetty, and put my jar file in a sharedLib folder, and specified it in my solr.xml (I have multiple cores). After starting up solr, I got the line that it picked up the jar: INFO: Adding 'file:/blah/blah/lib/CustomPlugins.jar' to classloader But I get this error about it not being able to find the AbstractSubTypeFieldType class. Here is the first bit of the trace: SEVERE: null:java.lang.NoClassDefFoundError: org/apache/solr/schema/AbstractSubTypeFieldType at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:791) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) ...etcŠ Any hints as to what I did wrong? I can provide source code, or a fuller stack trace, config settings, etc. Also, I did try to unpack the solr.war, stick my jar in WEB-INF/lib, then repack. However, when I did that, I get a NoClassDefFoundError for my plugin itself. Thanks, Kevin The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible. - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/custom-field-type-plugin-tp4079086p4079 494.html Sent from the Solr - User mailing list archive at Nabble.com. The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon
Re: XInclude and Document Entity not working on schema.xml
Elodie: I just tested your configs (as close as i could get since i don't have the com.kelkoo classes) using the current HEAD of the 4x branch and had no problems with the entity includes. what java version/vendor are you using? are you using the provided jetty or your own servlet container? My best guess is that some combination of JVM version or servlet container implementation is subtly affecting the way the XML files are getting parsed, introducing the xml:base attribute in a way that isn't getting cleanly ignored by the code added to handle that in the issue you noted before ... but w/o being able to reproduce that's just a guess. : Where can I found the newly voted 4.4 ? the release announcement just went live on the mailing list, and it's now the main download link on the website... https://lucene.apache.org/solr/downloads.html -Hoss
Re: Collection not current after insert
Hi Alistair, You probably need a commit, and not an optimize. Which version of Solr are you running against? The 4.0 releases have more complications, but generally sending a commit will do. Not sure if GSearch sends one, only partly because I never was able to make it work. :) Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions w: appinions.com http://www.appinions.com/ On Tue, Jul 23, 2013 at 9:57 AM, Alistair Young alistair.yo...@uhi.ac.ukwrote: Hi there, My Solr is being fed by Fedora GSearch and when uploading a new resource, the Collection is optimized but not current so the new resource can't be found. I have to go to the Core Admin page and Optimize it from there, in order to make the collection current. Is there anything I should look for to see what the problem is? This is the comms to solr when inserting: DEBUG 2013-07-23 13:27:37,023 (OperationsImpl) resultXml = solrUpdateIndex indexName=FgsIndex insertedltk:13000116/inserted counts insertTotal=1 updateTotal=0 deleteTotal=0 emptyTotal=0 docCount=854 warnCount=0/ /solrUpdateIndex DEBUG 2013-07-23 13:27:37,023 (GTransformer) xsltName=fgsconfigFinal/index/FgsIndex/updateIndexToResultPage DEBUG 2013-07-23 13:27:37,027 (GTransformer) getTransformer transformer=org.apache.xalan.transformer.TransformerImpl@6561b973uriResolver=null DEBUG 2013-07-23 13:27:37,028 (GenericOperationsImpl) resultXml=?xml version=1.0 encoding=UTF-8? resultPage operation=updateIndex action=fromPid value=ltk:13000116 repositoryName=FgsRepos indexNames= resultPageXslt= dateTime=Tue Jul 23 13:27:36 UTC 2013 updateIndex xmlns:dc=http://purl.org/dc/elements/1.1/; xmlns:foxml=info:fedora/fedora-system:def/foxml# xmlns:zs= http://www.loc.gov/zing/srw/; warnCount=0 docCount=854 deleteTotal=0 updateTotal=0 insertTotal=1 indexName=FgsIndex/ /resultPage INFO 2013-07-23 13:27:37,028 (UpdateListener) Index updated by notification message, returning: ?xml version=1.0 encoding=UTF-8? resultPage operation=updateIndex action=fromPid value=ltk:13000116 repositoryName=FgsRepos indexNames= resultPageXslt= dateTime=Tue Jul 23 13:27:36 UTC 2013 updateIndex xmlns:dc=http://purl.org/dc/elements/1.1/; xmlns:foxml=info:fedora/fedora-system:def/foxml# xmlns:zs= http://www.loc.gov/zing/srw/; warnCount=0 docCount=854 deleteTotal=0 updateTotal=0 insertTotal=1 indexName=FgsIndex/ /resultPage thanks, Alistair -- mov eax,1 mov ebx,0 int 80h
Re: custom field type plugin
Oh cool! I'm glad it at least seemed to work. Can you post your configuration of the field type and report from Solr's logs what the maxLevels is used for this field, which is logged the first time you use the field type? Maybe there isn't a limit under 10B after all. Some quick'n'dirty calculations I just did indicate there shouldn't be a problem but real-world usage will be a better proof. Indexing probably won't be terribly slow, queries could get pretty slow if the amount of indexed data is really high. I'd love to hear how it works out for you. Your use-case would benefit a lot from an improved prefix tree implementation. I don't gather how a 3rd dimension would play into this. Support for multi-dimensional spatial is on the drawing board. ~ David Kevin Stone wrote What are the dangers of trying to use a range of 10 billion? Simply a slower index time? Or will I get inaccurate results? I have tried it on a very small sample of documents, and it seemed to work. I could spend some time this week trying to get a more robust (and accurate) dataset loaded to play around with. The reason for the 10 billion is to support being able to query for a region on a chromosome. A user might want to know what genes overlap a point on a specific chromosome. Unless I can use 3 dimensional coordinates (which gave an error when I tried it), I'll need to multiply the coordinates by some offset for each chromosome to be able to normalise the data (at both index and query time). The largest chromosome (chr 1) has almost 250,000,000 base pairs. I could probably squeeze the rest a bit smaller, but I'd rather use one size for all chromosomes, since we have more than just human data to deal with. It would get quite messy otherwise. On 7/22/13 11:50 AM, David Smiley (@MITRE.org) lt; DSMILEY@ gt; wrote: Like Hoss said, you're going to have to solve this using http://wiki.apache.org/solr/SpatialForTimeDurations Using PointType is *not* going to work because your durations are multi-valued per document. It would be useful to create a custom field type that wraps the capability outlined on the wiki to make it easier to use without requiring the user to think spatially. You mentioned that these numeric ranges extend upwards of 10 billion or so. Unfortunately, the current prefix tree implementation under the hood for non-geodetic spatial, the QuadTree, is unlikely to scale to numbers that big. I don't know where the boundary is, but I doubt 10B. You could try and see what happens. I'm working (very slowly on very little spare time) on improving the PrefixTree implementations to scale to such large numbers; I hope something will be available this fall. ~ David Smiley Kevin Stone wrote I have a particular use case that I think might require a custom field type, however I am having trouble getting the plugin to work. My use case has to do with genetics data, and we are running into several situations were we need to be able to query multiple regions of a chromosome (or gene, or other object types). All that really boils down to is being able to give a number, e.g. 10234, and return documents that have regions containing the number. So you'd have a document with a list like [1:16090,400:8000,40123:43564], and it should come back because 10234 falls between 1:16090. If there is a better or easier way to do this please speak up. I'd rather not have to use a join on another index, because 1) it's more complex to set up, and 2) we might need to join against something else and you can only do one join at a time. AnywayŠ I tried creating a field type similar to a PointType just to see if I could get one working. I added the following jars to get it to compile: apache-solr-core-4.0.0,lucene-core-4.0.0,lucene-queries-4.0.0,apache-solr -solrj-4.0.0. I am running solr 4.0.0 on jetty, and put my jar file in a sharedLib folder, and specified it in my solr.xml (I have multiple cores). After starting up solr, I got the line that it picked up the jar: INFO: Adding 'file:/blah/blah/lib/CustomPlugins.jar' to classloader But I get this error about it not being able to find the AbstractSubTypeFieldType class. Here is the first bit of the trace: SEVERE: null:java.lang.NoClassDefFoundError: org/apache/solr/schema/AbstractSubTypeFieldType at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:791) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) ...etcŠ Any hints as to what I did wrong? I can provide source code, or a fuller stack trace, config settings, etc. Also, I did try to unpack the solr.war, stick my jar in WEB-INF/lib, then repack. However,
Re: Calculating Solr document score by ignoring the boost field.
: Ok thanks, I just wanted the know is it possible to ignore boost value or : not during score calculation and as you said its not. : Now I would have to focus on nutch to fix the issue and not to send boost=0 : to Solr. the index time bosts are encoded in field norms -- if you wnat to ignore them, you could either modify your schema to 'omitNOrms=true' on all fields *beore* indexing, or you could customize hte SImilarity implementation you use to be osmething custom that does not tke into account field norms at all. https://cwiki.apache.org/confluence/display/solr/Other+Schema+Elements#OtherSchemaElements-Similarity https://lucene.apache.org/core/4_3_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html -Hoss
Re:
On 23 July 2013 21:52, Chris Hostetter hossman_luc...@fucit.org wrote: : Can anyone remove this spammer please? The recent influx is not confined to a single user, or a single list. Nor is there a clear course of action just yet, since the senders in question are all legitimate subscribers who have been active members of the community. There is an open issue to track the recent surge and see if we can address this problem holisticly before resorting to forcibly unsubscribing legitimate list members based on messages spoofed on their behalf... https://issues.apache.org/jira/browse/INFRA-6585 ...interested parties should watch that issue, and/or post comments there with any concrete technical suggestions for addressing the problem. Yes, this seems to be an across-the-board attack, and I am seeing this on other mailing lists, and also in personal mail from known friends. The modus operandi seems to be hacking (probably weak) passwords for accounts on sites like Gmail and Yahoo offering email addresses to people at large. Have no immediate worthwhile suggestions on how best to deal with this, but as Hoss says, it is worth bearing in mind that the people in whose names the mail is being sent are also victims. Regards, Gora
Re: Question about field boost
I'm not sure I understand, Erick. I don't have a text field in my schema; title and content are both legal fields. On Tue, Jul 23, 2013 at 5:15 AM, Erick Erickson erickerick...@gmail.comwrote: this isn't doing what you think. title^10 content is actually parsed as text:title^100 text:content where text is my default search field. assuming title is a field. If you look a little farther up the debug output you'll see that. You probably want title:content^100 or some such? Erick On Tue, Jul 23, 2013 at 1:43 AM, Jack Krupansky j...@basetechnology.com wrote: That means that for that document china occurs in the title vs. snowden found in a document but not in the title. -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Tuesday, July 23, 2013 12:52 AM To: solr-user@lucene.apache.org Subject: Re: Question about field boost Is my reading correct that the boost is only applied on china but not snowden? How can that be? My query is: q=china+snowdenqf=title^10 content On Mon, Jul 22, 2013 at 9:43 PM, Joe Zhang smartag...@gmail.com wrote: Thanks for your hint, Jack. Here is the debug results, which I'm having a hard deciphering (the two terms are china and snowden)... 0.26839527 = (MATCH) sum of: 0.26839527 = (MATCH) sum of: 0.26757246 = (MATCH) max of: 7.9147343E-4 = (MATCH) weight(content:china in 249), product of: 0.019873314 = queryWeight(content:china), product of: 1.6649085 = idf(docFreq=46832, maxDocs=91058) 0.01193658 = queryNorm 0.039825942 = (MATCH) fieldWeight(content:china in 249), product of: 4.8989797 = tf(termFreq(content:china)=24) 1.6649085 = idf(docFreq=46832, maxDocs=91058) 0.0048828125 = fieldNorm(field=content, doc=249) 0.26757246 = (MATCH) weight(title:china^10.0 in 249), product of: 0.5836803 = queryWeight(title:china^10.0), product of: 10.0 = boost 4.8898454 = idf(docFreq=1861, maxDocs=91058) 0.01193658 = queryNorm 0.45842302 = (MATCH) fieldWeight(title:china in 249), product of: 1.0 = tf(termFreq(title:china)=1) 4.8898454 = idf(docFreq=1861, maxDocs=91058) 0.09375 = fieldNorm(field=title, doc=249) 8.2282536E-4 = (MATCH) max of: 8.2282536E-4 = (MATCH) weight(content:snowden in 249), product of: 0.03407834 = queryWeight(content:snowden), product of: 2.8549502 = idf(docFreq=14246, maxDocs=91058) 0.01193658 = queryNorm 0.024145111 = (MATCH) fieldWeight(content:snowden in 249), product of: 1.7320508 = tf(termFreq(content:snowden)=3) 2.8549502 = idf(docFreq=14246, maxDocs=91058) 0.0048828125 = fieldNorm(field=content, doc=249) On Mon, Jul 22, 2013 at 9:27 PM, Jack Krupansky j...@basetechnology.comwrote: Maybe you're not doing anything wrong - other than having an artificial expectation of what the true relevance of your data actually is. Many factors go into relevance scoring. You need to look at all aspects of your data. Maybe your terms don't occur in your titles the way you think they do. Maybe you need a boost of 500 or more... Lots of potential maybes. Relevancy tuning is an art and craft, hardly a science. Step one: Know your data, inside and out. Use the debugQuery=true parameter on your queries and see how much of the score is dominated by your query terms in the non-title fields. -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Monday, July 22, 2013 11:06 PM To: solr-user@lucene.apache.org Subject: Question about field boost Dear Solr experts: Here is my query: defType=dismaxq=term1+term2**qf=title^100 content Apparently (at least I thought) my intention is to boost the title field. While I'm getting some non-trivial results, I'm surprised that the documents with both term1 and term2 in title (I know such docs do exist in my repository) were not returned (or maybe ranked very low). The situation does not change even when I use much larger boost factors. What am I doing wrong?
Spellcheck field element and collation issues
Hi All, I have an IndexBasedSpellChecker component configured as follows (note the field parameter is set to the spellcheck field): searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext_spell/str lst name=spellchecker str name=namedefault/str str name=classnamesolr.IndexBasedSpellChecker/str !-- Load tokens from the following field for spell checking, analyzer for the field's type as defined in schema.xml are used -- * str name=fieldspellcheck/str* str name=spellcheckIndexDir./spellchecker/str float name=thresholdTokenFrequency.0001/float /lst /searchComponent with the corresponding field type for spellcheck: fieldType name=text_spell class=solr.TextField positionIncrementGap=100 omitNorms=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StandardFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=moto_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StandardFilterFactory/ /analyzer /fieldType and field: !-- spellcheck field is multivalued because it has the title and markup fields copied into it -- field name=spellcheck type=text_spell stored=false omitTermFreqAndPositions=true multiValued=true/ values from a markup and title field are copied into the spellcheck field. My /select search component has the following defaults: lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfmarkup_texts title_texts/str !-- Spell checking defaults -- str name=spellchecktrue/str str name=spellcheck.collateExtendedResultstrue/str str name=spellcheck.extendedResultstrue/str str name=spellcheck.maxCollations2/str str name=spellcheck.maxCollationTries5/str str name=spellcheck.count5/str str name=spellcheck.collatetrue/str str name=spellcheck.maxResultsForSuggest5/str str name=spellcheck.alternativeTermCount5/str /lst When I issue a search like this: http://localhost:8981/solr/articles/select?indent=truespellcheck.q=markup_texts:(Perfrm%20HVC)q=Perfrm%20HVCrows=0 I get collations: lst name=collation str name=collationQuerymarkup_texts:(perform hvac)/str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperform/str str name=hvchvac/str /lst /lst lst name=collation str name=collationQuerymarkup_texts:(performed hvac)/str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperformed/str str name=hvchvac/str /lst /lst However, if I remove the spellcheck.q parameter I do not, i.e. no collations are returned for the following: http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0 If I specify the fields being searched over for the q parameter I get collations: http://localhost:8981/solr/articles/select?indent=trueq=markup_texts:(Perfrm%20HVC)rows=0 lst name=collation str name=collationQuerymarkup_texts:(perform hvac)/str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperform/str str name=hvchvac/str /lst /lst lst name=collation str name=collationQuerymarkup_texts:(performed hvac)/str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperformed/str str name=hvchvac/str /lst /lst I'm a bit confused as to what the value for field should be in spellcheck component definition. In fact what is it's purpose here, just as the input for building the spellchecking index? If that is so then why do I need to even specify the queryAnalyzerFieldType? Also, why do I need to explicitly specify the field in the query or spellcheck.q to get collations? Thanks and sorry for the rather long question. Brendan
Fw:
Hi! http://optiideas.com/google.com.offers.html
RE: Spellcheck field element and collation issues
For this query: http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0 ...do you get anything back in the spellcheck response? Is it correcting the individual words and not giving collations? Or are you getting no individual word suggestions also? James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Brendan Grainger [mailto:brendan.grain...@gmail.com] Sent: Tuesday, July 23, 2013 1:47 PM To: solr-user@lucene.apache.org Subject: Spellcheck field element and collation issues Hi All, I have an IndexBasedSpellChecker component configured as follows (note the field parameter is set to the spellcheck field): searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext_spell/str lst name=spellchecker str name=namedefault/str str name=classnamesolr.IndexBasedSpellChecker/str !-- Load tokens from the following field for spell checking, analyzer for the field's type as defined in schema.xml are used -- * str name=fieldspellcheck/str* str name=spellcheckIndexDir./spellchecker/str float name=thresholdTokenFrequency.0001/float /lst /searchComponent with the corresponding field type for spellcheck: fieldType name=text_spell class=solr.TextField positionIncrementGap=100 omitNorms=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StandardFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=moto_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StandardFilterFactory/ /analyzer /fieldType and field: !-- spellcheck field is multivalued because it has the title and markup fields copied into it -- field name=spellcheck type=text_spell stored=false omitTermFreqAndPositions=true multiValued=true/ values from a markup and title field are copied into the spellcheck field. My /select search component has the following defaults: lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfmarkup_texts title_texts/str !-- Spell checking defaults -- str name=spellchecktrue/str str name=spellcheck.collateExtendedResultstrue/str str name=spellcheck.extendedResultstrue/str str name=spellcheck.maxCollations2/str str name=spellcheck.maxCollationTries5/str str name=spellcheck.count5/str str name=spellcheck.collatetrue/str str name=spellcheck.maxResultsForSuggest5/str str name=spellcheck.alternativeTermCount5/str /lst When I issue a search like this: http://localhost:8981/solr/articles/select?indent=truespellcheck.q=markup_texts:(Perfrm%20HVC)q=Perfrm%20HVCrows=0 I get collations: lst name=collation str name=collationQuerymarkup_texts:(perform hvac)/str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperform/str str name=hvchvac/str /lst /lst lst name=collation str name=collationQuerymarkup_texts:(performed hvac)/str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperformed/str str name=hvchvac/str /lst /lst However, if I remove the spellcheck.q parameter I do not, i.e. no collations are returned for the following: http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0 If I specify the fields being searched over for the q parameter I get collations: http://localhost:8981/solr/articles/select?indent=trueq=markup_texts:(Perfrm%20HVC)rows=0 lst name=collation str name=collationQuerymarkup_texts:(perform hvac)/str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperform/str str name=hvchvac/str /lst /lst lst name=collation str name=collationQuerymarkup_texts:(performed hvac)/str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperformed/str str name=hvchvac/str /lst /lst I'm a bit confused as to what the value for field should be in spellcheck component definition. In fact what is it's purpose here, just as the input for building the spellchecking index? If that is so then why do I need to even specify the queryAnalyzerFieldType? Also, why do I need to explicitly specify the field in the query or spellcheck.q to get collations? Thanks and sorry for the rather long question. Brendan
how number of indexed fields effect performance
Hi, Thanks for your suggestions. I'll be able to provide answers to a few of your questions right now rest I'll answer after some time. It takes around 150k to 200k queries before it goes down again after restarting it. In a typical query we are returning around 20 fields. Memory utilization peaks only after sometime. Regard, Suryansh On Tuesday, July 23, 2013, Jack Krupansky wrote: There was also a bug in the lazy loading of multivalued fields at one point recently in Solr 4.2 https://issues.apache.org/**jira/browse/SOLR-4589https://issues.apache.org/jira/browse/SOLR-4589 4.x + enableLazyFieldLoading + large multivalued fields + varying fl = pathological CPU load response time Do you use multivalued fields very heavily? I'm still not ready to suggest that 1,000 fields is an okay thing to do, but there are still plenty of nuances in Solr performance that could explain the difficulties, before we even get to the 1,000 field issue itself. The real bottom line is that as you increase field count, there are lots of other aspects of Solr memory and performance degradation that increase as well. Some of those factors can be dealt with simply with more memory, more and faster CPU cores, or even more sharding, or other tuning, but not necessarily all of them. I think that I am already on the record on other threads as suggesting that a couple hundred is about the limit for field count for a slam dunk use of Solr. That doesn't mean you can't go above a couple hundred fields, just that you are in uncharted territory and may need to take extraordinary measures to get everything working satisfactorily. There's no magic hard limit, just a general sense that smaller numbers of of field are like a walk in a park, while higher numbers of fields are like chopping through a jungle. We each have our own threshold for... adventure. We need answers to the previous questions we raised before we can analyze this a lot further. Oh, and make sure there is enough OS system memory available for caching of the index pages. Sometimes, it is little things like this that can crush Solr performance. Unfortunately, Solr is not a packaged solution that automatically and magically auto-configures everything to work just right. Instead, it is a powerful toolkit that lets you do amazing things, but you the developer/architect need to supply amazing intelligence, wisdom, foresight, and insight to get it (and its hardware and software environment) to do those amazing things. -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Tuesday, July 23, 2013 9:54 AM To: solr-user@lucene.apache.org Subject: Re: how number of indexed fields effect performance Do you need all of the fields loaded every time and are they stored? Maybe there is a document with gigantic content that you don't actually need but it gets deserialized anyway. Try lazy loading setting: enableLazyFieldLoading in solrconfig.xml Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/**alexandrerafalovitchhttp://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Jul 23, 2013 at 12:36 AM, Jack Krupansky j...@basetechnology.com wrote: After restarting Solr and doing a couple of queries to warm the caches, are queries already slow/failing, or does it take some time and a number of queries before failures start occurring? One possibility is that you just need a lot more memory for caches for this amount of data. So, maybe the failures are caused by heavy garbage collections. So, after restarting Solr, check how much Java heap is available, then do some warming queries, then check the Java heap available again. Add the debugQuery=true parameter to your queries and look at the timings to see what phases of query processing are taking the most time. Also check whether the reported QTime seems to match actual wall clock time; sometimes formatting of the results and network transfer time can dwarf actual query time. How many fields are you returning on a typical query? -- Jack Krupansky -Original Message- From: Suryansh Purwar Sent: Monday, July 22, 2013 11:06 PM To: solr-user@lucene.apache.org ; j...@basetechnology.com Subject: how number of indexed fields effect performance It was running fine initially when we just had around 100 fields indexed. In this case as well it runs fine but after sometime broken pipe exception starts coming which results in shard getting down. Regards, Suryansh On Tuesday, July 23, 2013, Jack Krupansky wrote: Was all of this running fine previously and only started running slow recently, or is this your first measurement? Are very simple queries (single keyword, no filters or facets or sorting or anything else, and
Re: Spellcheck field element and collation issues
Hi James, I get the following response for that query: response lst name=responseHeader int name=status0/int int name=QTime8/int lst name=params str name=indenttrue/str str name=qPerfrm HVC/str str name=rows0/str /lst /lst result name=response numFound=0 start=0/result lst name=spellcheck lst name=suggestions lst name=perfrm int name=numFound3/int int name=startOffset0/int int name=endOffset6/int int name=origFreq0/int arr name=suggestion lst str name=wordperform/str int name=freq4/int /lst lst str name=wordperformed/str int name=freq1/int /lst lst str name=wordperformance/str int name=freq3/int /lst /arr /lst lst name=hvc int name=numFound2/int int name=startOffset7/int int name=endOffset10/int int name=origFreq0/int arr name=suggestion lst str name=wordhvac/str int name=freq4/int /lst lst str name=wordhave/str int name=freq5/int /lst /arr /lst bool name=correctlySpelledfalse/bool /lst /lst /response Thanks Brendan On Tue, Jul 23, 2013 at 3:19 PM, Dyer, James james.d...@ingramcontent.comwrote: For this query: http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0 ...do you get anything back in the spellcheck response? Is it correcting the individual words and not giving collations? Or are you getting no individual word suggestions also? James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Brendan Grainger [mailto:brendan.grain...@gmail.com] Sent: Tuesday, July 23, 2013 1:47 PM To: solr-user@lucene.apache.org Subject: Spellcheck field element and collation issues Hi All, I have an IndexBasedSpellChecker component configured as follows (note the field parameter is set to the spellcheck field): searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext_spell/str lst name=spellchecker str name=namedefault/str str name=classnamesolr.IndexBasedSpellChecker/str !-- Load tokens from the following field for spell checking, analyzer for the field's type as defined in schema.xml are used -- * str name=fieldspellcheck/str* str name=spellcheckIndexDir./spellchecker/str float name=thresholdTokenFrequency.0001/float /lst /searchComponent with the corresponding field type for spellcheck: fieldType name=text_spell class=solr.TextField positionIncrementGap=100 omitNorms=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StandardFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=moto_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StandardFilterFactory/ /analyzer /fieldType and field: !-- spellcheck field is multivalued because it has the title and markup fields copied into it -- field name=spellcheck type=text_spell stored=false omitTermFreqAndPositions=true multiValued=true/ values from a markup and title field are copied into the spellcheck field. My /select search component has the following defaults: lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfmarkup_texts title_texts/str !-- Spell checking defaults -- str name=spellchecktrue/str str name=spellcheck.collateExtendedResultstrue/str str name=spellcheck.extendedResultstrue/str str name=spellcheck.maxCollations2/str str name=spellcheck.maxCollationTries5/str str name=spellcheck.count5/str str name=spellcheck.collatetrue/str str name=spellcheck.maxResultsForSuggest5/str str name=spellcheck.alternativeTermCount5/str /lst When I issue a search like this: http://localhost:8981/solr/articles/select?indent=truespellcheck.q=markup_texts:(Perfrm%20HVC)q=Perfrm%20HVCrows=0 I get collations: lst name=collation str name=collationQuerymarkup_texts:(perform hvac)/str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperform/str str name=hvchvac/str /lst /lst lst name=collation str name=collationQuerymarkup_texts:(performed hvac)/str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperformed/str str name=hvchvac/str /lst /lst However, if I remove the spellcheck.q parameter I do not, i.e. no collations are returned for the following:
Re: Node down, but not out
I think the best bet here would be a ping like handler that would simply return the state of only this box in the cluster: Something like /admin/state which would return down,active,leader,recovering I'm not really sure where to begin however. Any ideas? jim On Mon, Jul 22, 2013 at 12:52 PM, Timothy Potter [via Lucene] ml-node+s472066n4079518...@n3.nabble.com wrote: There is but I couldn't get it to work in my environment on Jetty, see: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201306.mbox/%3CCAJt9Wnib+p_woYODtrSPhF==v8Vx==mDBd_qH=x_knbw-BnPXQ@...%3Ehttp://mail-archives.apache.org/mod_mbox/lucene-solr-user/201306.mbox/%3CCAJt9Wnib+p_woYODtrSPhF==v8Vx==mDBd_qH=x_knbw-bn...@mail.gmail.com%3E Let me know if you have any better luck. I had to resort to something hacky but was out of time I could devote to such unproductive endeavors ;-) On Mon, Jul 22, 2013 at 10:49 AM, jimtronic [hidden email]http://user/SendEmail.jtp?type=nodenode=4079518i=0 wrote: I'm not sure why it went down exactly -- I restarted the process and lost the logs. (d'oh!) An OOM seems likely, however. Is there a setting for killing the processes when solr encounters an OOM? Thanks! Jim -- View this message in context: http://lucene.472066.n3.nabble.com/Node-down-but-not-out-tp4079495p4079507.html Sent from the Solr - User mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Node-down-but-not-out-tp4079495p4079518.html To unsubscribe from Node down, but not out, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4079495code=amltdHJvbmljQGdtYWlsLmNvbXw0MDc5NDk1fDEzMjQ4NDk0MTQ= . NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://lucene.472066.n3.nabble.com/Node-down-but-not-out-tp4079495p4079856.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Spellcheck field element and collation issues
Hi James, If I try: http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0maxCollationTries=0 I get the same result: response lst name=responseHeader int name=status0/int int name=QTime7/int lst name=params str name=indenttrue/str str name=qPerfrm HVC/str str name=maxCollationTries0/str str name=rows0/str /lst /lst result name=response numFound=0 start=0/result lst name=spellcheck lst name=suggestions lst name=perfrm int name=numFound3/int int name=startOffset0/int int name=endOffset6/int int name=origFreq0/int arr name=suggestion lst str name=wordperform/str int name=freq4/int /lst lst str name=wordperformed/str int name=freq1/int /lst lst str name=wordperformance/str int name=freq3/int /lst /arr /lst lst name=hvc int name=numFound2/int int name=startOffset7/int int name=endOffset10/int int name=origFreq0/int arr name=suggestion lst str name=wordhvac/str int name=freq4/int /lst lst str name=wordhave/str int name=freq5/int /lst /arr /lst bool name=correctlySpelledfalse/bool /lst /lst /response However, you're right that my df field for the /select handler is in fact: str name=dfmarkup_texts title_texts/str I would note that if I specify the query as follows: http://localhost:8981/solr/articles/select?indent=trueq=markup_texts:(Perfrm%20HVC)+OR+title_texts:(Perfrm%20HVC)rows=0maxCollationTries=0 which is what I thought specifying a df would effectively do, I get collation results: lst name=collation str name=collationQuery markup_texts:(perform hvac) OR title_texts:(perform hvac) /str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperform/str str name=hvchvac/str str name=perfrmperform/str str name=hvchvac/str /lst /lst lst name=collation str name=collationQuery markup_texts:(perform hvac) OR title_texts:(performed hvac) /str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperform/str str name=hvchvac/str str name=perfrmperformed/str str name=hvchvac/str /lst /lst I think I'm confused about the relationship between the q parameter and what the field and queryAnalyzerFieldType are for in the spellcheck component definition, i.e. what is this for: str name=fieldspellcheck/str is it even needed if I've specified how the spelling index terms should analyzed with: str name=queryAnalyzerFieldTypetext_spell/str Thanks again Brendan On Tue, Jul 23, 2013 at 3:58 PM, Dyer, James james.d...@ingramcontent.comwrote: Try tacking maxCollationTries=0 to the URL and see if the collation returns. If you get a collation, then try the same URL with the collation as the q parameter. Does that get results? My suspicion here is that you are assuming that markup_texts is the default search field for /select but in fact it isn't. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Brendan Grainger [mailto:brendan.grain...@gmail.com] Sent: Tuesday, July 23, 2013 2:43 PM To: solr-user@lucene.apache.org Subject: Re: Spellcheck field element and collation issues Hi James, I get the following response for that query: response lst name=responseHeader int name=status0/int int name=QTime8/int lst name=params str name=indenttrue/str str name=qPerfrm HVC/str str name=rows0/str /lst /lst result name=response numFound=0 start=0/result lst name=spellcheck lst name=suggestions lst name=perfrm int name=numFound3/int int name=startOffset0/int int name=endOffset6/int int name=origFreq0/int arr name=suggestion lst str name=wordperform/str int name=freq4/int /lst lst str name=wordperformed/str int name=freq1/int /lst lst str name=wordperformance/str int name=freq3/int /lst /arr /lst lst name=hvc int name=numFound2/int int name=startOffset7/int int name=endOffset10/int int name=origFreq0/int arr name=suggestion lst str name=wordhvac/str int name=freq4/int /lst lst str name=wordhave/str int name=freq5/int /lst /arr /lst bool name=correctlySpelledfalse/bool /lst /lst /response Thanks Brendan On Tue, Jul 23, 2013 at 3:19 PM, Dyer, James james.d...@ingramcontent.comwrote: For this query: http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0 ...do you get anything back in the spellcheck response? Is it correcting the individual words and not giving collations? Or are you getting no individual word suggestions also? James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Brendan Grainger [mailto:brendan.grain...@gmail.com] Sent: Tuesday, July 23, 2013 1:47 PM To: solr-user@lucene.apache.org Subject: Spellcheck field element and collation issues Hi All, I have an IndexBasedSpellChecker component configured as follows (note the field parameter is set to the spellcheck field): searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext_spell/str lst name=spellchecker str
RE: Spellcheck field element and collation issues
Try tacking maxCollationTries=0 to the URL and see if the collation returns. If you get a collation, then try the same URL with the collation as the q parameter. Does that get results? My suspicion here is that you are assuming that markup_texts is the default search field for /select but in fact it isn't. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Brendan Grainger [mailto:brendan.grain...@gmail.com] Sent: Tuesday, July 23, 2013 2:43 PM To: solr-user@lucene.apache.org Subject: Re: Spellcheck field element and collation issues Hi James, I get the following response for that query: response lst name=responseHeader int name=status0/int int name=QTime8/int lst name=params str name=indenttrue/str str name=qPerfrm HVC/str str name=rows0/str /lst /lst result name=response numFound=0 start=0/result lst name=spellcheck lst name=suggestions lst name=perfrm int name=numFound3/int int name=startOffset0/int int name=endOffset6/int int name=origFreq0/int arr name=suggestion lst str name=wordperform/str int name=freq4/int /lst lst str name=wordperformed/str int name=freq1/int /lst lst str name=wordperformance/str int name=freq3/int /lst /arr /lst lst name=hvc int name=numFound2/int int name=startOffset7/int int name=endOffset10/int int name=origFreq0/int arr name=suggestion lst str name=wordhvac/str int name=freq4/int /lst lst str name=wordhave/str int name=freq5/int /lst /arr /lst bool name=correctlySpelledfalse/bool /lst /lst /response Thanks Brendan On Tue, Jul 23, 2013 at 3:19 PM, Dyer, James james.d...@ingramcontent.comwrote: For this query: http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0 ...do you get anything back in the spellcheck response? Is it correcting the individual words and not giving collations? Or are you getting no individual word suggestions also? James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Brendan Grainger [mailto:brendan.grain...@gmail.com] Sent: Tuesday, July 23, 2013 1:47 PM To: solr-user@lucene.apache.org Subject: Spellcheck field element and collation issues Hi All, I have an IndexBasedSpellChecker component configured as follows (note the field parameter is set to the spellcheck field): searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext_spell/str lst name=spellchecker str name=namedefault/str str name=classnamesolr.IndexBasedSpellChecker/str !-- Load tokens from the following field for spell checking, analyzer for the field's type as defined in schema.xml are used -- * str name=fieldspellcheck/str* str name=spellcheckIndexDir./spellchecker/str float name=thresholdTokenFrequency.0001/float /lst /searchComponent with the corresponding field type for spellcheck: fieldType name=text_spell class=solr.TextField positionIncrementGap=100 omitNorms=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StandardFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=moto_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StandardFilterFactory/ /analyzer /fieldType and field: !-- spellcheck field is multivalued because it has the title and markup fields copied into it -- field name=spellcheck type=text_spell stored=false omitTermFreqAndPositions=true multiValued=true/ values from a markup and title field are copied into the spellcheck field. My /select search component has the following defaults: lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfmarkup_texts title_texts/str !-- Spell checking defaults -- str name=spellchecktrue/str str name=spellcheck.collateExtendedResultstrue/str str name=spellcheck.extendedResultstrue/str str name=spellcheck.maxCollations2/str str name=spellcheck.maxCollationTries5/str str name=spellcheck.count5/str str name=spellcheck.collatetrue/str str name=spellcheck.maxResultsForSuggest5/str str name=spellcheck.alternativeTermCount5/str /lst When I issue a search like this:
socket write error Solrj 4.3.1
Hi all, im testing solrcloud (version 4.3.1) with 2 shards and 1 external zookeeper. All its runing ok, documents are indexing in 2 diferent shards and select *:* give me all documents. Now im trying to add/index a new document via solj ussing CloudSolrServer. *the code:* server = new CloudSolrServer(localhost:2181); server.setDefaultCollection(tika); server.setZkConnectTimeout(9); input = new FileInputStream(new File(C:\\sample.pdf)); ContentStreamUpdateRequest up = new ContentStreamUpdateRequest(/update/extract); up.addFile(new File(C:\\caca.pdf), application/octet-stream); up.setParam(literal.id, 444); Parser parser = new PDFParser(); ContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); parser.parse(input, handler, metadata, context); up.setParam(literal.text,handler.toString()); up.setMethod(SolrRequest.METHOD.POST); server.request(up); server.commit(); input.close(); } catch (MalformedURLException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (SolrServerException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (Exception e) { // TODO Auto-generated catch block e.printStackTrace(); } *My schema looks like this:* fields field name=id type=integer indexed=true stored=true required=true/ field name=title type=string indexed=true stored=true/ field name=author type=string indexed=true stored=true / field name=text type=text_ind indexed=true stored=true / field name=_version_ type=long indexed=true stored=true/ dynamicField name=ignored_* type=string indexed=true stored=true/ /fields *where text_ind type is like this:* fieldType name=text_ind class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.LetterTokenizerFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=3 maxGramSize=25 / filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType *when i execute code next exception is thrown:* log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request:[http://192.168.1.12:8983/solr/tika_shard1_replica1, http://192.168.1.12:8984/solr/tika_shard2_replica1] at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:333) at org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:306) at solrCloud.solrJDemo.main(solrJDemo.java:51) Caused by: org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://192.168.1.12:8983/solr/tika_shard1_replica1 at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:264) ... 2 more Caused by: org.apache.http.client.ClientProtocolException at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:909) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352) ... 4 more Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot retry request with a non-repeatable request entity. The cause lists the reason the original request failed. at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:691) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:522) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906) ... 7
Re: zkHost in solr.xml goes missing after SPLITSHARD using Collections API
Thanks Alan and Shawn. Just installed Solr 4.4, and no longer experiencing the issue. Thanks! :) On Tue, Jul 23, 2013 at 7:21 AM, Shawn Heisey s...@elyograg.org wrote: On 7/23/2013 7:50 AM, Alan Woodward wrote: Can you try upgrading to the just-released 4.4? Solr.xml persistence had all kinds of bugs in 4.3, which should have been fixed now. The 4.4.0 release has been finalized and uploaded, but the download link hasn't been changed yet because the mirror network isn't fully synchronized yet. It is available from many mirrors, but until the website download links get changed, there's not yet a direct way to access it. Here's some generic instructions for situations where the new version is done, but the official announcement isn't out yet: http://lucene.apache.org/solr/ 1) Go the the Solr website (URL above) and click on the latest version download button, which at this moment is 4.3.1. Wait for the redirect to take you to a mirror list. 2) Click on one of the mirrors, the best option is usually the one right on top that the website chose for you. 3) When the file list comes up, click the Parent Directory link. If this isn't showing, it will most likely be labelled with .. instead. 4) If a directory for the new version (in this case 4.4.0) is listed, click on it and then click the file that you want to download. If the new version is not listed, click the Back button on your browser twice, then go back to step 2, but this time choose a different mirror. One last reminder: This only works right before a release is officially announced. These instructions cannot be used while a release is still in development. Thanks, Shawn
Processing a lot of results in Solr
Hello Solr users, Question regarding processing a lot of docs returned from a query; I potentially have millions of documents returned back from a query. What is the common design to deal with this ? 2 ideas I have are: - create a client service that is multithreaded to handled this - Use the Solr pagination to retrieve a batch of rows at a time (start, rows in Solr Admin console ) Any other ideas that I may be missing ? Thanks, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
RE: Spellcheck field element and collation issues
I don't believe you can specify more than 1 field on df (default field). What you want, I think, is qf (query fields), which is available only if using dismax/edismax. http://wiki.apache.org/solr/SearchHandler#df http://wiki.apache.org/solr/ExtendedDisMax#qf_.28Query_Fields.29 James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Brendan Grainger [mailto:brendan.grain...@gmail.com] Sent: Tuesday, July 23, 2013 3:22 PM To: solr-user@lucene.apache.org Subject: Re: Spellcheck field element and collation issues Hi James, If I try: http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0maxCollationTries=0 I get the same result: response lst name=responseHeader int name=status0/int int name=QTime7/int lst name=params str name=indenttrue/str str name=qPerfrm HVC/str str name=maxCollationTries0/str str name=rows0/str /lst /lst result name=response numFound=0 start=0/result lst name=spellcheck lst name=suggestions lst name=perfrm int name=numFound3/int int name=startOffset0/int int name=endOffset6/int int name=origFreq0/int arr name=suggestion lst str name=wordperform/str int name=freq4/int /lst lst str name=wordperformed/str int name=freq1/int /lst lst str name=wordperformance/str int name=freq3/int /lst /arr /lst lst name=hvc int name=numFound2/int int name=startOffset7/int int name=endOffset10/int int name=origFreq0/int arr name=suggestion lst str name=wordhvac/str int name=freq4/int /lst lst str name=wordhave/str int name=freq5/int /lst /arr /lst bool name=correctlySpelledfalse/bool /lst /lst /response However, you're right that my df field for the /select handler is in fact: str name=dfmarkup_texts title_texts/str I would note that if I specify the query as follows: http://localhost:8981/solr/articles/select?indent=trueq=markup_texts:(Perfrm%20HVC)+OR+title_texts:(Perfrm%20HVC)rows=0maxCollationTries=0 which is what I thought specifying a df would effectively do, I get collation results: lst name=collation str name=collationQuery markup_texts:(perform hvac) OR title_texts:(perform hvac) /str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperform/str str name=hvchvac/str str name=perfrmperform/str str name=hvchvac/str /lst /lst lst name=collation str name=collationQuery markup_texts:(perform hvac) OR title_texts:(performed hvac) /str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperform/str str name=hvchvac/str str name=perfrmperformed/str str name=hvchvac/str /lst /lst I think I'm confused about the relationship between the q parameter and what the field and queryAnalyzerFieldType are for in the spellcheck component definition, i.e. what is this for: str name=fieldspellcheck/str is it even needed if I've specified how the spelling index terms should analyzed with: str name=queryAnalyzerFieldTypetext_spell/str Thanks again Brendan On Tue, Jul 23, 2013 at 3:58 PM, Dyer, James james.d...@ingramcontent.comwrote: Try tacking maxCollationTries=0 to the URL and see if the collation returns. If you get a collation, then try the same URL with the collation as the q parameter. Does that get results? My suspicion here is that you are assuming that markup_texts is the default search field for /select but in fact it isn't. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Brendan Grainger [mailto:brendan.grain...@gmail.com] Sent: Tuesday, July 23, 2013 2:43 PM To: solr-user@lucene.apache.org Subject: Re: Spellcheck field element and collation issues Hi James, I get the following response for that query: response lst name=responseHeader int name=status0/int int name=QTime8/int lst name=params str name=indenttrue/str str name=qPerfrm HVC/str str name=rows0/str /lst /lst result name=response numFound=0 start=0/result lst name=spellcheck lst name=suggestions lst name=perfrm int name=numFound3/int int name=startOffset0/int int name=endOffset6/int int name=origFreq0/int arr name=suggestion lst str name=wordperform/str int name=freq4/int /lst lst str name=wordperformed/str int name=freq1/int /lst lst str name=wordperformance/str int name=freq3/int /lst /arr /lst lst name=hvc int name=numFound2/int int name=startOffset7/int int name=endOffset10/int int name=origFreq0/int arr name=suggestion lst str name=wordhvac/str int name=freq4/int /lst lst str name=wordhave/str int name=freq5/int /lst /arr /lst bool name=correctlySpelledfalse/bool /lst /lst /response Thanks Brendan On Tue, Jul 23, 2013 at 3:19 PM, Dyer, James james.d...@ingramcontent.comwrote: For this query: http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0 ...do you get anything back in the spellcheck response? Is it correcting the individual words and not giving collations? Or are you getting no individual word suggestions also? James Dyer Ingram Content Group (615) 213-4311
Re: socket write error Solrj 4.3.1
For people who have same issue, solved solved adding: str name=fmap.contenttext/str in the requestHandler /update/extract in solrconfig.xml: requestHandler name=/update/extract class=org.apache.solr.handler.extraction.ExtractingRequestHandler lst name=defaults str name=fmap.Last-Modifiedlast_modified/str str name=uprefixignored_/str * str name=fmap.contenttext/str* /lst lst name=date.formats str-MM-dd/str /lst /requestHandler So no need to add content in solrj: p.setParam(literal.text,handler.toString()); Regards -- View this message in context: http://lucene.472066.n3.nabble.com/socket-write-error-Solrj-4-3-1-tp4079869p4079881.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Spellcheck field element and collation issues
Thanks James. That's it! Now: http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0maxCollationTries=0 returns: lst name=collation str name=collationQueryperform hvac/str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperform/str str name=hvchvac/str /lst /lst lst name=collation str name=collationQueryperformed hvac/str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperformed/str str name=hvchvac/str /lst /lst If you have time, I'm still slightly unclear on the field element in the spellcheck configuration. Maybe I should explain how I think it works: 1. You create a relatively unanalyzed field type (e.g. no stemming) 2. You copy text you want to be used to build the spellcheck index into that field. 3. Build the spellcheck sidecar index (or noop if using DirectSpellChecker in which case I assume it still uses the dedicated spellcheck field text was copied into). When executing a spellcheck request, solr uses the analyzer specified in queryAnalyzerFieldType to tokenize the query passed in via the q or spellcheck.q parameter and this tokenized text is the input the spellcheckchecking instance. Does that sound right? Thanks Brendan On Tue, Jul 23, 2013 at 5:15 PM, Dyer, James james.d...@ingramcontent.comwrote: I don't believe you can specify more than 1 field on df (default field). What you want, I think, is qf (query fields), which is available only if using dismax/edismax. http://wiki.apache.org/solr/SearchHandler#df http://wiki.apache.org/solr/ExtendedDisMax#qf_.28Query_Fields.29 James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Brendan Grainger [mailto:brendan.grain...@gmail.com] Sent: Tuesday, July 23, 2013 3:22 PM To: solr-user@lucene.apache.org Subject: Re: Spellcheck field element and collation issues Hi James, If I try: http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0maxCollationTries=0 I get the same result: response lst name=responseHeader int name=status0/int int name=QTime7/int lst name=params str name=indenttrue/str str name=qPerfrm HVC/str str name=maxCollationTries0/str str name=rows0/str /lst /lst result name=response numFound=0 start=0/result lst name=spellcheck lst name=suggestions lst name=perfrm int name=numFound3/int int name=startOffset0/int int name=endOffset6/int int name=origFreq0/int arr name=suggestion lst str name=wordperform/str int name=freq4/int /lst lst str name=wordperformed/str int name=freq1/int /lst lst str name=wordperformance/str int name=freq3/int /lst /arr /lst lst name=hvc int name=numFound2/int int name=startOffset7/int int name=endOffset10/int int name=origFreq0/int arr name=suggestion lst str name=wordhvac/str int name=freq4/int /lst lst str name=wordhave/str int name=freq5/int /lst /arr /lst bool name=correctlySpelledfalse/bool /lst /lst /response However, you're right that my df field for the /select handler is in fact: str name=dfmarkup_texts title_texts/str I would note that if I specify the query as follows: http://localhost:8981/solr/articles/select?indent=trueq=markup_texts:(Perfrm%20HVC)+OR+title_texts:(Perfrm%20HVC)rows=0maxCollationTries=0 which is what I thought specifying a df would effectively do, I get collation results: lst name=collation str name=collationQuery markup_texts:(perform hvac) OR title_texts:(perform hvac) /str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperform/str str name=hvchvac/str str name=perfrmperform/str str name=hvchvac/str /lst /lst lst name=collation str name=collationQuery markup_texts:(perform hvac) OR title_texts:(performed hvac) /str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperform/str str name=hvchvac/str str name=perfrmperformed/str str name=hvchvac/str /lst /lst I think I'm confused about the relationship between the q parameter and what the field and queryAnalyzerFieldType are for in the spellcheck component definition, i.e. what is this for: str name=fieldspellcheck/str is it even needed if I've specified how the spelling index terms should analyzed with: str name=queryAnalyzerFieldTypetext_spell/str Thanks again Brendan On Tue, Jul 23, 2013 at 3:58 PM, Dyer, James james.d...@ingramcontent.comwrote: Try tacking maxCollationTries=0 to the URL and see if the collation returns. If you get a collation, then try the same URL with the collation as the q parameter. Does that get results? My suspicion here is that you are assuming that markup_texts is the default search field for /select but in fact it isn't. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Brendan Grainger [mailto:brendan.grain...@gmail.com] Sent: Tuesday, July 23, 2013 2:43 PM To: solr-user@lucene.apache.org Subject: Re: Spellcheck field element
maximum number of documents per shard?
still 2.1 billion documents?
RE: Spellcheck field element and collation issues
You've got it. The only other thing is that spellcheck.q does not analyze anything. The whole purpose of this is to allow you to just send raw keywords to be spellchecked. This is handy if you have a complex q parameter (say, you're using local params, etc) and the SpellingQueryConverter cannot handle it. You could write your own Query COnverter but its often just easier to strip out the keywords and send them over with spellcheck.q. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Brendan Grainger [mailto:brendan.grain...@gmail.com] Sent: Tuesday, July 23, 2013 4:41 PM To: solr-user@lucene.apache.org Subject: Re: Spellcheck field element and collation issues Thanks James. That's it! Now: http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0maxCollationTries=0 returns: lst name=collation str name=collationQueryperform hvac/str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperform/str str name=hvchvac/str /lst /lst lst name=collation str name=collationQueryperformed hvac/str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperformed/str str name=hvchvac/str /lst /lst If you have time, I'm still slightly unclear on the field element in the spellcheck configuration. Maybe I should explain how I think it works: 1. You create a relatively unanalyzed field type (e.g. no stemming) 2. You copy text you want to be used to build the spellcheck index into that field. 3. Build the spellcheck sidecar index (or noop if using DirectSpellChecker in which case I assume it still uses the dedicated spellcheck field text was copied into). When executing a spellcheck request, solr uses the analyzer specified in queryAnalyzerFieldType to tokenize the query passed in via the q or spellcheck.q parameter and this tokenized text is the input the spellcheckchecking instance. Does that sound right? Thanks Brendan On Tue, Jul 23, 2013 at 5:15 PM, Dyer, James james.d...@ingramcontent.comwrote: I don't believe you can specify more than 1 field on df (default field). What you want, I think, is qf (query fields), which is available only if using dismax/edismax. http://wiki.apache.org/solr/SearchHandler#df http://wiki.apache.org/solr/ExtendedDisMax#qf_.28Query_Fields.29 James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Brendan Grainger [mailto:brendan.grain...@gmail.com] Sent: Tuesday, July 23, 2013 3:22 PM To: solr-user@lucene.apache.org Subject: Re: Spellcheck field element and collation issues Hi James, If I try: http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0maxCollationTries=0 I get the same result: response lst name=responseHeader int name=status0/int int name=QTime7/int lst name=params str name=indenttrue/str str name=qPerfrm HVC/str str name=maxCollationTries0/str str name=rows0/str /lst /lst result name=response numFound=0 start=0/result lst name=spellcheck lst name=suggestions lst name=perfrm int name=numFound3/int int name=startOffset0/int int name=endOffset6/int int name=origFreq0/int arr name=suggestion lst str name=wordperform/str int name=freq4/int /lst lst str name=wordperformed/str int name=freq1/int /lst lst str name=wordperformance/str int name=freq3/int /lst /arr /lst lst name=hvc int name=numFound2/int int name=startOffset7/int int name=endOffset10/int int name=origFreq0/int arr name=suggestion lst str name=wordhvac/str int name=freq4/int /lst lst str name=wordhave/str int name=freq5/int /lst /arr /lst bool name=correctlySpelledfalse/bool /lst /lst /response However, you're right that my df field for the /select handler is in fact: str name=dfmarkup_texts title_texts/str I would note that if I specify the query as follows: http://localhost:8981/solr/articles/select?indent=trueq=markup_texts:(Perfrm%20HVC)+OR+title_texts:(Perfrm%20HVC)rows=0maxCollationTries=0 which is what I thought specifying a df would effectively do, I get collation results: lst name=collation str name=collationQuery markup_texts:(perform hvac) OR title_texts:(perform hvac) /str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperform/str str name=hvchvac/str str name=perfrmperform/str str name=hvchvac/str /lst /lst lst name=collation str name=collationQuery markup_texts:(perform hvac) OR title_texts:(performed hvac) /str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperform/str str name=hvchvac/str str name=perfrmperformed/str str name=hvchvac/str /lst /lst I think I'm confused about the relationship between the q parameter and what the field and queryAnalyzerFieldType are for in the spellcheck component definition, i.e. what is this for: str name=fieldspellcheck/str is it even needed if I've specified how the spelling index terms should analyzed with: str
How to make soft commit more reliable?
Currently I am using SOLR 3.5.X and I push updates to SOLR via queue (Active MQ) and perform hard commit every 30 minutes (since my index is relatively big around 30 million documents). I am thinking of using soft commit to implement NRT search but I am worried about the reliability. For ex: If I have the hard autocommit set to 10 minutes and a softcommit every second, new documents will show up every second but in case of JVM crash or power goes out I will lose all the documents after the last hard commit. I was thinking of using a backup database or another SOLR index that I can use as a backup and write the document from queue in both places (one with soft commit, another index with just the push updates with normal hard commits (or) write simultaneously to a db and delete the rows once the hard commit is successful after making sure that we didn't lose any records). Does someone have any other idea to improve the reliability of the push updates when using soft commit? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-make-soft-commit-more-reliable-tp4079892.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Spellcheck field element and collation issues
Perfect thanks so much. You just cleared up the other little bit, i.e. when the SpellingQueryConverter is used/not used and why you might implement your own. Thanks again. On Tue, Jul 23, 2013 at 6:48 PM, Dyer, James james.d...@ingramcontent.comwrote: You've got it. The only other thing is that spellcheck.q does not analyze anything. The whole purpose of this is to allow you to just send raw keywords to be spellchecked. This is handy if you have a complex q parameter (say, you're using local params, etc) and the SpellingQueryConverter cannot handle it. You could write your own Query COnverter but its often just easier to strip out the keywords and send them over with spellcheck.q. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Brendan Grainger [mailto:brendan.grain...@gmail.com] Sent: Tuesday, July 23, 2013 4:41 PM To: solr-user@lucene.apache.org Subject: Re: Spellcheck field element and collation issues Thanks James. That's it! Now: http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0maxCollationTries=0 returns: lst name=collation str name=collationQueryperform hvac/str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperform/str str name=hvchvac/str /lst /lst lst name=collation str name=collationQueryperformed hvac/str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperformed/str str name=hvchvac/str /lst /lst If you have time, I'm still slightly unclear on the field element in the spellcheck configuration. Maybe I should explain how I think it works: 1. You create a relatively unanalyzed field type (e.g. no stemming) 2. You copy text you want to be used to build the spellcheck index into that field. 3. Build the spellcheck sidecar index (or noop if using DirectSpellChecker in which case I assume it still uses the dedicated spellcheck field text was copied into). When executing a spellcheck request, solr uses the analyzer specified in queryAnalyzerFieldType to tokenize the query passed in via the q or spellcheck.q parameter and this tokenized text is the input the spellcheckchecking instance. Does that sound right? Thanks Brendan On Tue, Jul 23, 2013 at 5:15 PM, Dyer, James james.d...@ingramcontent.comwrote: I don't believe you can specify more than 1 field on df (default field). What you want, I think, is qf (query fields), which is available only if using dismax/edismax. http://wiki.apache.org/solr/SearchHandler#df http://wiki.apache.org/solr/ExtendedDisMax#qf_.28Query_Fields.29 James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Brendan Grainger [mailto:brendan.grain...@gmail.com] Sent: Tuesday, July 23, 2013 3:22 PM To: solr-user@lucene.apache.org Subject: Re: Spellcheck field element and collation issues Hi James, If I try: http://localhost:8981/solr/articles/select?indent=trueq=Perfrm%20HVCrows=0maxCollationTries=0 I get the same result: response lst name=responseHeader int name=status0/int int name=QTime7/int lst name=params str name=indenttrue/str str name=qPerfrm HVC/str str name=maxCollationTries0/str str name=rows0/str /lst /lst result name=response numFound=0 start=0/result lst name=spellcheck lst name=suggestions lst name=perfrm int name=numFound3/int int name=startOffset0/int int name=endOffset6/int int name=origFreq0/int arr name=suggestion lst str name=wordperform/str int name=freq4/int /lst lst str name=wordperformed/str int name=freq1/int /lst lst str name=wordperformance/str int name=freq3/int /lst /arr /lst lst name=hvc int name=numFound2/int int name=startOffset7/int int name=endOffset10/int int name=origFreq0/int arr name=suggestion lst str name=wordhvac/str int name=freq4/int /lst lst str name=wordhave/str int name=freq5/int /lst /arr /lst bool name=correctlySpelledfalse/bool /lst /lst /response However, you're right that my df field for the /select handler is in fact: str name=dfmarkup_texts title_texts/str I would note that if I specify the query as follows: http://localhost:8981/solr/articles/select?indent=trueq=markup_texts:(Perfrm%20HVC)+OR+title_texts:(Perfrm%20HVC)rows=0maxCollationTries=0 which is what I thought specifying a df would effectively do, I get collation results: lst name=collation str name=collationQuery markup_texts:(perform hvac) OR title_texts:(perform hvac) /str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperform/str str name=hvchvac/str str name=perfrmperform/str str name=hvchvac/str /lst /lst lst name=collation str name=collationQuery markup_texts:(perform hvac) OR title_texts:(performed hvac) /str int name=hits4/int lst name=misspellingsAndCorrections str name=perfrmperform/str str
Re: maximum number of documents per shard?
2.1 billion documents (including deleted documents) per Lucene index, but essentially per Solr shard as well. But don’t even think about going that high. In fact, don't plan on going above 100 million unless you do a proof of concept that validates that you get acceptable query and update performance . There is no hard limit besides that 2.1 billion Lucene limit, but... performance will vary. -- Jack Krupansky -Original Message- From: Ali, Saqib Sent: Tuesday, July 23, 2013 6:18 PM To: solr-user@lucene.apache.org Subject: maximum number of documents per shard? still 2.1 billion documents?
Re: problems about solr replication in 4.3
Are you mixing SolrCloud and old-style master/slave? There was a bug a while ago (search the JIRA) where replication was copying the entire index unnecessarily, but I think that was fixed by 4.3. Best Erick On Tue, Jul 23, 2013 at 6:33 AM, xiaoqi belivexia...@gmail.com wrote: hi,all i have two solr ,one is master , one is replication , before i use them under 3.5 version . it works fine . when i upgrade to 4.3version , i found when replication solr copying index from master , it will clean current index and copy new version to self folder . slave can't search during this process ! i am newer to solr 4 , does this normal ? any ideas , thanks ! -- View this message in context: http://lucene.472066.n3.nabble.com/problems-about-solr-replication-in-4-3-tp4079665.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: softCommit doesn't work - ?
Right, issuing a commit after every document is not good practice. Relying on the auto commit parameters in solrconfig.xml is usually best, although I will sometimes issue a commit at the very end of the indexing run. Several things about this thread aren't making sense. First of all your commitwithin parameter (your server.add(doc, int) is not soft committing anything, it's just telling solr to commit documents in the future). But you should be seeing these after 10 seconds. Check your solrconfig and insure that your autocommit settings have openSearchertrue/openSearcher set, that could possibly be what you're seeing. The fact that you have all those segments indicates that the commits are going through, so maybe you just have openSearcher set to false. This is almost certainly an assumption you're making that's not so, what you're doing _should_ work For that matter, what are your soft commit settings in solrconfig.xml? Best Erick On Tue, Jul 23, 2013 at 11:48 AM, tskom tsiedlac...@hotmail.co.uk wrote: Thanks for your comment Eric. When I use *server.add(doc);* - everything is fine (but takes long time to hard commit every single doc) , so I am sure docs are uniquely indexed. Maybe I shouldn't do *server.commit();* at all from solrj code, so SOLR would use autoCommit/autoSoftCommit configuration defined in solrconfig.xml ? Maybe there are some bits missing ? -- View this message in context: http://lucene.472066.n3.nabble.com/softCommit-doesn-t-work-tp4079578p4079772.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Question about field boost
Bah! I didn't notice that you'd used edismax, ignore my comments. Sorry for the confusion Erick On Tue, Jul 23, 2013 at 2:34 PM, Joe Zhang smartag...@gmail.com wrote: I'm not sure I understand, Erick. I don't have a text field in my schema; title and content are both legal fields. On Tue, Jul 23, 2013 at 5:15 AM, Erick Erickson erickerick...@gmail.comwrote: this isn't doing what you think. title^10 content is actually parsed as text:title^100 text:content where text is my default search field. assuming title is a field. If you look a little farther up the debug output you'll see that. You probably want title:content^100 or some such? Erick On Tue, Jul 23, 2013 at 1:43 AM, Jack Krupansky j...@basetechnology.com wrote: That means that for that document china occurs in the title vs. snowden found in a document but not in the title. -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Tuesday, July 23, 2013 12:52 AM To: solr-user@lucene.apache.org Subject: Re: Question about field boost Is my reading correct that the boost is only applied on china but not snowden? How can that be? My query is: q=china+snowdenqf=title^10 content On Mon, Jul 22, 2013 at 9:43 PM, Joe Zhang smartag...@gmail.com wrote: Thanks for your hint, Jack. Here is the debug results, which I'm having a hard deciphering (the two terms are china and snowden)... 0.26839527 = (MATCH) sum of: 0.26839527 = (MATCH) sum of: 0.26757246 = (MATCH) max of: 7.9147343E-4 = (MATCH) weight(content:china in 249), product of: 0.019873314 = queryWeight(content:china), product of: 1.6649085 = idf(docFreq=46832, maxDocs=91058) 0.01193658 = queryNorm 0.039825942 = (MATCH) fieldWeight(content:china in 249), product of: 4.8989797 = tf(termFreq(content:china)=24) 1.6649085 = idf(docFreq=46832, maxDocs=91058) 0.0048828125 = fieldNorm(field=content, doc=249) 0.26757246 = (MATCH) weight(title:china^10.0 in 249), product of: 0.5836803 = queryWeight(title:china^10.0), product of: 10.0 = boost 4.8898454 = idf(docFreq=1861, maxDocs=91058) 0.01193658 = queryNorm 0.45842302 = (MATCH) fieldWeight(title:china in 249), product of: 1.0 = tf(termFreq(title:china)=1) 4.8898454 = idf(docFreq=1861, maxDocs=91058) 0.09375 = fieldNorm(field=title, doc=249) 8.2282536E-4 = (MATCH) max of: 8.2282536E-4 = (MATCH) weight(content:snowden in 249), product of: 0.03407834 = queryWeight(content:snowden), product of: 2.8549502 = idf(docFreq=14246, maxDocs=91058) 0.01193658 = queryNorm 0.024145111 = (MATCH) fieldWeight(content:snowden in 249), product of: 1.7320508 = tf(termFreq(content:snowden)=3) 2.8549502 = idf(docFreq=14246, maxDocs=91058) 0.0048828125 = fieldNorm(field=content, doc=249) On Mon, Jul 22, 2013 at 9:27 PM, Jack Krupansky j...@basetechnology.comwrote: Maybe you're not doing anything wrong - other than having an artificial expectation of what the true relevance of your data actually is. Many factors go into relevance scoring. You need to look at all aspects of your data. Maybe your terms don't occur in your titles the way you think they do. Maybe you need a boost of 500 or more... Lots of potential maybes. Relevancy tuning is an art and craft, hardly a science. Step one: Know your data, inside and out. Use the debugQuery=true parameter on your queries and see how much of the score is dominated by your query terms in the non-title fields. -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Monday, July 22, 2013 11:06 PM To: solr-user@lucene.apache.org Subject: Question about field boost Dear Solr experts: Here is my query: defType=dismaxq=term1+term2**qf=title^100 content Apparently (at least I thought) my intention is to boost the title field. While I'm getting some non-trivial results, I'm surprised that the documents with both term1 and term2 in title (I know such docs do exist in my repository) were not returned (or maybe ranked very low). The situation does not change even when I use much larger boost factors. What am I doing wrong?
Fw:
Hi! http://millanao.cl/google.com.offers.html
Re: Processing a lot of results in Solr
Hi Matt, This feature is commonly known as deep paging and Lucene and Solr have issues with it ... take a look at http://solr.pl/en/2011/07/18/deep-paging-problem/ as a potential starting point using filters to bucketize a result set into sets of sub result sets. Cheers, Tim On Tue, Jul 23, 2013 at 3:04 PM, Matt Lieber mlie...@impetus.com wrote: Hello Solr users, Question regarding processing a lot of docs returned from a query; I potentially have millions of documents returned back from a query. What is the common design to deal with this ? 2 ideas I have are: - create a client service that is multithreaded to handled this - Use the Solr pagination to retrieve a batch of rows at a time (start, rows in Solr Admin console ) Any other ideas that I may be missing ? Thanks, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: custom field type plugin
Sorry for the late response. I needed to find the time to load a lot of extra data (closer to what we're anticipating). I have an index with close to 220,000 documents, each with at least two coordinate regions anywhere between -10 billion to +10 billion, but could potentially have up to maybe half dozen regions in one document. The reason for the negatives, is because you can read a chromosome either backwards or forwards, so many coordinates can be minus. Here is the schema field definition: fieldType name=geneticLocation class=solr.SpatialRecursivePrefixTreeFieldType multiValued=true geo=false worldBounds=-1000 -1000 1000 1000 distErrPct=0 maxDistErr=0.9 units=degrees / Here is the first query in the log: INFO: geneticLocation{class=org.apache.solr.schema.SpatialRecursivePrefixTreeFiel dType,analyzer=org.apache.solr.schema.FieldType$DefaultAnalyzer,args={distE rrPct=0, geo=false, multiValued=true, worldBounds=-1000 -1000 1000 1000, maxDistErr=0.9, units=degrees}} strat: RecursivePrefixTreeStrategy(prefixGridScanLevel:46,SPG:(QuadPrefixTree(maxL evels:50,ctx:SpatialContext{geo=false, calculator=CartesianDistCalc, worldBounds=Rect(minX=-1.0E11,maxX=1.0E11,minY=-1.0E11,maxY=1.0E11)}))) maxLevels: 50 Jul 23, 2013 9:11:45 PM org.apache.solr.core.SolrCore execute INFO: [testIndex] webapp=/solr path=/select params={wt=xmlq=humanCoordinate:Intersects(0+60330+6033041244+100 )rows=100} hits=81112 status=0 QTime=122 Here are some other queries to give different timings (the one above brings back quite a lot): INFO: [testIndex] webapp=/solr path=/select params={wt=xmlq=humanCoordinate:Intersects(0+60+69+10 0)rows=100} hits=6031 status=0 QTime=10 Jul 23, 2013 9:13:43 PM org.apache.solr.core.SolrCore execute INFO: [testIndex] webapp=/solr path=/select params={wt=xmlq=humanCoordinate:Intersects(0+0+1000+100)row s=100} hits=500 status=0 QTime=15 Jul 23, 2013 9:14:14 PM org.apache.solr.core.SolrCore execute INFO: [testIndex] webapp=/solr path=/select params={wt=xmlq=humanCoordinate:Intersects(0+7831329+7831329+100) rows=100} hits=4 status=0 QTime=17 INFO: [testIndex] webapp=/solr path=/select params={wt=xmlq=humanCoordinate:Intersects(-100+-1051057963+-1001 057963+0)rows=100} hits=661 status=0 QTime=8 The query times look pretty fast to me. Certainly I'm pretty impressed. Our other backup solutions (involving SQL) likely wouldn't even touch this in terms of speed. We will be testing this more in depth in the coming month. I am sort of jumping ahead of our team to research possible solutions, since this is something that worried us. Looks like it might work! Thanks, -Kevin On 7/23/13 1:47 PM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: Oh cool! I'm glad it at least seemed to work. Can you post your configuration of the field type and report from Solr's logs what the maxLevels is used for this field, which is logged the first time you use the field type? Maybe there isn't a limit under 10B after all. Some quick'n'dirty calculations I just did indicate there shouldn't be a problem but real-world usage will be a better proof. Indexing probably won't be terribly slow, queries could get pretty slow if the amount of indexed data is really high. I'd love to hear how it works out for you. Your use-case would benefit a lot from an improved prefix tree implementation. I don't gather how a 3rd dimension would play into this. Support for multi-dimensional spatial is on the drawing board. ~ David Kevin Stone wrote What are the dangers of trying to use a range of 10 billion? Simply a slower index time? Or will I get inaccurate results? I have tried it on a very small sample of documents, and it seemed to work. I could spend some time this week trying to get a more robust (and accurate) dataset loaded to play around with. The reason for the 10 billion is to support being able to query for a region on a chromosome. A user might want to know what genes overlap a point on a specific chromosome. Unless I can use 3 dimensional coordinates (which gave an error when I tried it), I'll need to multiply the coordinates by some offset for each chromosome to be able to normalise the data (at both index and query time). The largest chromosome (chr 1) has almost 250,000,000 base pairs. I could probably squeeze the rest a bit smaller, but I'd rather use one size for all chromosomes, since we have more than just human data to deal with. It would get quite messy otherwise. On 7/22/13 11:50 AM, David Smiley (@MITRE.org) lt; DSMILEY@ gt; wrote: Like Hoss said, you're going to have to solve this using http://wiki.apache.org/solr/SpatialForTimeDurations Using PointType is *not* going to work because your durations are multi-valued per document. It would be useful to create a
Re: Processing a lot of results in Solr
Hello Matt, You can consider writing a batch processing handler, which receives a query and instead of sending results back, it writes them into a file which is then available for streaming (it has its own UUID). I am dumping many GBs of data from solr in few minutes - your query + streaming writer can go very long way :) roman On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote: Hello Solr users, Question regarding processing a lot of docs returned from a query; I potentially have millions of documents returned back from a query. What is the common design to deal with this ? 2 ideas I have are: - create a client service that is multithreaded to handled this - Use the Solr pagination to retrieve a batch of rows at a time (start, rows in Solr Admin console ) Any other ideas that I may be missing ? Thanks, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: Processing a lot of results in Solr
That sounds like a satisfactory solution for the time being - I am assuming you dump the data from Solr in a csv format? How did you implement the streaming processor ? (what tool did you use for this? Not familiar with that) You say it takes a few minutes only to dump the data - how long does it to stream it back in, are performances acceptable (~ within minutes) ? Thanks, Matt On 7/23/13 6:57 PM, Roman Chyla roman.ch...@gmail.com wrote: Hello Matt, You can consider writing a batch processing handler, which receives a query and instead of sending results back, it writes them into a file which is then available for streaming (it has its own UUID). I am dumping many GBs of data from solr in few minutes - your query + streaming writer can go very long way :) roman On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote: Hello Solr users, Question regarding processing a lot of docs returned from a query; I potentially have millions of documents returned back from a query. What is the common design to deal with this ? 2 ideas I have are: - create a client service that is multithreaded to handled this - Use the Solr pagination to retrieve a batch of rows at a time (start, rows in Solr Admin console ) Any other ideas that I may be missing ? Thanks, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: custom field type plugin
Kevin, Those are some good query response times but they could be better. You've configured the field type sub-optimally. Look again at http://wiki.apache.org/solr/SpatialForTimeDurations and note in particular maxDistErr. You've left it at the value that comes pre-configured with Solr, 0.9, which is ~1 meter measured in degrees, and this value makes no sense when your numeric range is in whole numbers. I suspect you inherited this value from Hoss's slides. **Instead use 1.** (as shown on the wiki). This affects performance in a big way since you've configured the prefixTree to hold 2.22e18 values (calculated via (max-min) / maxDistErr) as opposed to just 2e10. Your log shows maxLevels is 50 for quad tree. The comments in QuadPrefixTree (and I put them there once) indicate maxLevels of 50 is about as much as is supported. But again, I'm not certain what the limit really is without validating. Hopefully you can stay clear of 50. To do some tests, try querying just on the edge on either side of an indexed value to make sure you match the point and then don't match the indexed point as you would expect based on the instructions. Also, be sure to read more of the details on Search on this wiki page in which you are advised to buffer the query shape slightly; you didn't do this in your examples below. This is all a bit of a hack when using a field that internally is using floating point instead of fixed precision. ~ David Smiley On 7/23/13 9:32 PM, Kevin Stone kevin.st...@jax.org wrote: Sorry for the late response. I needed to find the time to load a lot of extra data (closer to what we're anticipating). I have an index with close to 220,000 documents, each with at least two coordinate regions anywhere between -10 billion to +10 billion, but could potentially have up to maybe half dozen regions in one document. The reason for the negatives, is because you can read a chromosome either backwards or forwards, so many coordinates can be minus. Here is the schema field definition: fieldType name=geneticLocation class=solr.SpatialRecursivePrefixTreeFieldType multiValued=true geo=false worldBounds=-1000 -1000 1000 1000 distErrPct=0 maxDistErr=0.9 units=degrees / Here is the first query in the log: INFO: geneticLocation{class=org.apache.solr.schema.SpatialRecursivePrefixTreeFie l dType,analyzer=org.apache.solr.schema.FieldType$DefaultAnalyzer,args={dist E rrPct=0, geo=false, multiValued=true, worldBounds=-1000 -1000 1000 1000, maxDistErr=0.9, units=degrees}} strat: RecursivePrefixTreeStrategy(prefixGridScanLevel:46,SPG:(QuadPrefixTree(max L evels:50,ctx:SpatialContext{geo=false, calculator=CartesianDistCalc, worldBounds=Rect(minX=-1.0E11,maxX=1.0E11,minY=-1.0E11,maxY=1.0E11)}))) maxLevels: 50 Jul 23, 2013 9:11:45 PM org.apache.solr.core.SolrCore execute INFO: [testIndex] webapp=/solr path=/select params={wt=xmlq=humanCoordinate:Intersects(0+60330+6033041244+10 0 )rows=100} hits=81112 status=0 QTime=122 Here are some other queries to give different timings (the one above brings back quite a lot): INFO: [testIndex] webapp=/solr path=/select params={wt=xmlq=humanCoordinate:Intersects(0+60+69+1 0 0)rows=100} hits=6031 status=0 QTime=10 Jul 23, 2013 9:13:43 PM org.apache.solr.core.SolrCore execute INFO: [testIndex] webapp=/solr path=/select params={wt=xmlq=humanCoordinate:Intersects(0+0+1000+100)ro w s=100} hits=500 status=0 QTime=15 Jul 23, 2013 9:14:14 PM org.apache.solr.core.SolrCore execute INFO: [testIndex] webapp=/solr path=/select params={wt=xmlq=humanCoordinate:Intersects(0+7831329+7831329+100 ) rows=100} hits=4 status=0 QTime=17 INFO: [testIndex] webapp=/solr path=/select params={wt=xmlq=humanCoordinate:Intersects(-100+-1051057963+-100 1 057963+0)rows=100} hits=661 status=0 QTime=8 The query times look pretty fast to me. Certainly I'm pretty impressed. Our other backup solutions (involving SQL) likely wouldn't even touch this in terms of speed. We will be testing this more in depth in the coming month. I am sort of jumping ahead of our team to research possible solutions, since this is something that worried us. Looks like it might work! Thanks, -Kevin On 7/23/13 1:47 PM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: Oh cool! I'm glad it at least seemed to work. Can you post your configuration of the field type and report from Solr's logs what the maxLevels is used for this field, which is logged the first time you use the field type? Maybe there isn't a limit under 10B after all. Some quick'n'dirty calculations I just did indicate there shouldn't be a problem but real-world usage will be a better proof. Indexing probably won't be terribly slow, queries could get pretty slow if the amount of indexed data is really high. I'd love to hear how it