Debugging on Tika
Hi I'm using Tika 0.10 for indexing my documents but I am not getting the expected results when doing a search. Even after I delete the index and started over again. Some of the words in for example a PDF document can be found but most of them not. Is it related to some language setting perhaps? How can I start debugging on Tika? Any tips? Thx! -- Smartbit bvba Hoogstraat 13 B-3670 Meeuwen T: +32 11 64 08 80 F: +32 89 46 81 10 W: http://www.smartbit.be E: ark...@smartbit.be
Parallel indexing in Solr
Hi This topic has probably been covered before, but I havnt had the luck to find the answer. We are running solr instances with several cores inside. Solr running out-of-the-box on top of jetty. I believe jetty is receiving all the http-requests about indexing ned documents, and forwards it to the solr engine. What kind of parallelism does this setup provide. Can more than one index-request get processed concurrently? How many? How to increase the number of index-requests that can be handled in parallel? Will I get better parallelism by running on another web-container than jetty - e.g. tomcat? What is the recommended web-container for high performance production systems? Thanks! Regards, Per Steffensen
Re: error in indexing
Someone can help me? Leonardo -- View this message in context: http://lucene.472066.n3.nabble.com/error-in-indexing-tp3709686p3712495.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Debugging on Tika
Hi Arkadi, You can try to extract text from your documents using Tika's CLI (more details http://tika.apache.org/0.7/gettingstarted.html). If you were succeeded that means that something goes wrong during the indexing. Tika only extracts text and metadata from the documents and sends this text to the Lucene. Lucene itself constructs the index. That index you can check using LUKE (http://code.google.com/p/luke/). Hope it helps. Oleg On Fri, Feb 3, 2012 at 10:43 AM, Arkadi Colson ark...@smartbit.be wrote: Hi I'm using Tika 0.10 for indexing my documents but I am not getting the expected results when doing a search. Even after I delete the index and started over again. Some of the words in for example a PDF document can be found but most of them not. Is it related to some language setting perhaps? How can I start debugging on Tika? Any tips? Thx! -- Smartbit bvba Hoogstraat 13 B-3670 Meeuwen T: +32 11 64 08 80 F: +32 89 46 81 10 W: http://www.smartbit.be E: ark...@smartbit.be
Solr index update approach
hi, I have an opinion mining application running solr that serves to retrieve documents and perform some analytics using facet queries. It works great. But I have a big issue. The document has an attribute for opinion that is automatically detected, but users can change it if it´s not correct. A document may be shared by some users and each user can change the opinion of the document. And the opinion may be different for each user. Opinion value is crucial here because its the main facet field on the analytic view. The thing is that solr does not handle doc updates, right now I need to delete it first and recreate the whole doc index to change it with the new metadata. And of course this is not fast enough. So I´m probably doing this the wrong way. Seams to me that is not a good approach and I should not update the index this way. The index should be more static, otherwise I will be reindexing the whole index to often. I´m running solr with a master/slave topology (2 slaves replication). The master to write and the slaves to read. Solr index is feed by a PostgresSQL database. I was wondering about using a nosql keyvalue database to store this kind of metadata and keep the index untouchable. So this way I could keep the index intact and store the user´s custom data there. It would fit if this value was not used by facet queries. That´s the problem. So my question is, what would be the best approach to handle this kind of use case with solr? If is not a usual use case, consider for example favorite docs. Favorites docs is probably a common use case in information retrieval. How do you handle for example favorite docs between users? I´d be very interested to hear about the best approach here. best Arian
Re: Debugging on Tika
I'm using Tika 0.10 for indexing my documents but I am not getting the expected results when doing a search. Even after I delete the index and started over again. Some of the words in for example a PDF document can be found but most of them not. It could be the maxFieldLength setting in solrconfig.xml . Try setting it to maxFieldLength2147483647/maxFieldLength
Re: Solr index update approach
Hello Arian, Pls look into http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-external.htmlit can be useful for your purpose. If you need to count facets against an external field you need to develop your own component - shouldn't be a big deal. Solr's bolts are http://lucidworks.lucidimagination.com/display/solr/Solr+Field+Types#SolrFieldTypes-WorkingwithExternalFiles http://wiki.apache.org/solr/FunctionQuery http://www.lucidimagination.com/blog/2009/07/06/ranges-over-functions-in-solr-14/ Regards On Fri, Feb 3, 2012 at 3:39 PM, Listas Discussões lis...@arianpasquali.comwrote: hi, I have an opinion mining application running solr that serves to retrieve documents and perform some analytics using facet queries. It works great. But I have a big issue. The document has an attribute for opinion that is automatically detected, but users can change it if it´s not correct. A document may be shared by some users and each user can change the opinion of the document. And the opinion may be different for each user. Opinion value is crucial here because its the main facet field on the analytic view. The thing is that solr does not handle doc updates, right now I need to delete it first and recreate the whole doc index to change it with the new metadata. And of course this is not fast enough. So I´m probably doing this the wrong way. Seams to me that is not a good approach and I should not update the index this way. The index should be more static, otherwise I will be reindexing the whole index to often. I´m running solr with a master/slave topology (2 slaves replication). The master to write and the slaves to read. Solr index is feed by a PostgresSQL database. I was wondering about using a nosql keyvalue database to store this kind of metadata and keep the index untouchable. So this way I could keep the index intact and store the user´s custom data there. It would fit if this value was not used by facet queries. That´s the problem. So my question is, what would be the best approach to handle this kind of use case with solr? If is not a usual use case, consider for example favorite docs. Favorites docs is probably a common use case in information retrieval. How do you handle for example favorite docs between users? I´d be very interested to hear about the best approach here. best Arian -- Sincerely yours Mikhail Khludnev Lucid Certified Apache Lucene/Solr Developer Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Which patch 236 to choose for collapse - Solr 3.5
Prateesh: I'm not understanding here. I believe Tamanjit is correct. Your example works if and only if *all* the groups are returned, which happens in the example case but not in the general case. Try your experiment with rows=3 and you'll find that (trunk, example) This search: http://localhost:8983/solr/select?q=*:*group=truegroup.field=manu_exactgroup.ngroups=truerows=3 returns this (lots of stuff removed for clarity) response lst name=responseHeader lst name=params str name=group.fieldmanu_exact/str str name=group.ngroupstrue/str str name=grouptrue/str str name=q*:*/str str name=rows3/str /lst /lst lst name=grouped lst name=manu_exact int name=matches28/int int name=ngroups13/int arr name=groups lst null name=groupValue/ result name=doclist numFound=12 start=0 /result /lst lst str name=groupValueSamsung Electronics Co. Ltd./str result name=doclist numFound=1 start=0 /result /lst lst str name=groupValueMaxtor Corp./str result name=doclist numFound=1 start=0 /result /lst /arr /lst /lst /response Sum of numfounds is different than matches. Or perhaps I'm misunderstanding your example... Best Erick On Fri, Feb 3, 2012 at 12:46 AM, preetesh dubey dubeypreet...@gmail.com wrote: Nope! if u r doing grouping then matches is always the total no. of results and ngroups is the number of groups. Every groups can have some docs belonging to it which can be anything according to provided parameter group.limit. If u get the sum of all the docs of each group, then it's equivalent to matches. Ok.u can do one experiment.execute a simple query in solr which returns very few results. 1)execute the query *without grouping* in browser and check the xml/json response...it will show u the total no. of result matches as a response in numFound...e,g. result name=response numFound=20 start=0 lets say it. *a)* numFound*withoutgrouping=20* 2) Now execute the query same *with grouping *parameters* *and* *look at the xml/json response in browser. it will show u the results like this /** lst name=groupid int name=matches20/int int name=ngroups12/int arr name=groups lst str name=groupValue4362/str result name=doclist numFound=1 start=0.../result /lst lst str name=groupValue3170/str result name=doclist numFound=3 start=0.../result /lst **/ b)matcheswithgroups=20 Now do the sum of docs of every group... result name=doclist numFound=1 start=0.../result result name=doclist numFound=3 start=0.../result sum of numFound=1, numFound=3 . lets say * * *c) sumofgroups*=1+3+...u will find a==b==c at last. do that experiment and reply back.. after doing sum compare the a), b), c) On Fri, Feb 3, 2012 at 10:32 AM, tamanjit.bin...@yahoo.co.in tamanjit.bin...@yahoo.co.in wrote: Ummm.. I think there is some confusion here. As per my understanding, matches is the total no of docs which the original query/filter query returned. On these docs grouping is done. So matches may not be actually equal to total no. of returned in your result, post grouping. Its just a subset of the matches, divided into groups. Is my understanding correct? -- View this message in context: http://lucene.472066.n3.nabble.com/Which-patch-236-to-choose-for-collapse-Solr-3-5-tp3697685p3712195.html Sent from the Solr - User mailing list archive at Nabble.com. -- Thanks Regards Preetesh Dubey
Re: error in indexing
Perhaps you could review: http://wiki.apache.org/solr/UsingMailingLists You really haven't shown us what it is that you're doing that generates this error, about all you've said is it doesn't work. I'd start with trying to index a document with only the required fields for your particular schema (i.e. fields in schema.xml where 'required=true' and build up from there. Many people use SolrJ to index docs, so I'd assume it's something in your setup, which you haven't shown us. Best Erick On Fri, Feb 3, 2012 at 4:05 AM, leonardo2 leonardo.rigut...@gmail.com wrote: Someone can help me? Leonardo -- View this message in context: http://lucene.472066.n3.nabble.com/error-in-indexing-tp3709686p3712495.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Which patch 236 to choose for collapse - Solr 3.5
Erick, yes, u r correct. But with that example I only wanted to explain Tamanjit that matches in solr response contains all docs which matched with group query. Tamanjit if u want the counts of docs of only first page according to rows parameter then that is the only way which u mentioned...iterate and count... There was small misunderstanding b/w us, I thought Tamanjit wants all matched docs, but I think he wanted to know docs matched in first page according to rows parameter. On Fri, Feb 3, 2012 at 7:32 PM, Erick Erickson erickerick...@gmail.comwrote: Prateesh: I'm not understanding here. I believe Tamanjit is correct. Your example works if and only if *all* the groups are returned, which happens in the example case but not in the general case. Try your experiment with rows=3 and you'll find that (trunk, example) This search: http://localhost:8983/solr/select?q=*:*group=truegroup.field=manu_exactgroup.ngroups=truerows=3 returns this (lots of stuff removed for clarity) response lst name=responseHeader lst name=params str name=group.fieldmanu_exact/str str name=group.ngroupstrue/str str name=grouptrue/str str name=q*:*/str str name=rows3/str /lst /lst lst name=grouped lst name=manu_exact int name=matches28/int int name=ngroups13/int arr name=groups lst null name=groupValue/ result name=doclist numFound=12 start=0 /result /lst lst str name=groupValueSamsung Electronics Co. Ltd./str result name=doclist numFound=1 start=0 /result /lst lst str name=groupValueMaxtor Corp./str result name=doclist numFound=1 start=0 /result /lst /arr /lst /lst /response Sum of numfounds is different than matches. Or perhaps I'm misunderstanding your example... Best Erick On Fri, Feb 3, 2012 at 12:46 AM, preetesh dubey dubeypreet...@gmail.com wrote: Nope! if u r doing grouping then matches is always the total no. of results and ngroups is the number of groups. Every groups can have some docs belonging to it which can be anything according to provided parameter group.limit. If u get the sum of all the docs of each group, then it's equivalent to matches. Ok.u can do one experiment.execute a simple query in solr which returns very few results. 1)execute the query *without grouping* in browser and check the xml/json response...it will show u the total no. of result matches as a response in numFound...e,g. result name=response numFound=20 start=0 lets say it. *a)* numFound*withoutgrouping=20* 2) Now execute the query same *with grouping *parameters* *and* *look at the xml/json response in browser. it will show u the results like this /** lst name=groupid int name=matches20/int int name=ngroups12/int arr name=groups lst str name=groupValue4362/str result name=doclist numFound=1 start=0.../result /lst lst str name=groupValue3170/str result name=doclist numFound=3 start=0.../result /lst **/ b)matcheswithgroups=20 Now do the sum of docs of every group... result name=doclist numFound=1 start=0.../result result name=doclist numFound=3 start=0.../result sum of numFound=1, numFound=3 . lets say * * *c) sumofgroups*=1+3+...u will find a==b==c at last. do that experiment and reply back.. after doing sum compare the a), b), c) On Fri, Feb 3, 2012 at 10:32 AM, tamanjit.bin...@yahoo.co.in tamanjit.bin...@yahoo.co.in wrote: Ummm.. I think there is some confusion here. As per my understanding, matches is the total no of docs which the original query/filter query returned. On these docs grouping is done. So matches may not be actually equal to total no. of returned in your result, post grouping. Its just a subset of the matches, divided into groups. Is my understanding correct? -- View this message in context: http://lucene.472066.n3.nabble.com/Which-patch-236-to-choose-for-collapse-Solr-3-5-tp3697685p3712195.html Sent from the Solr - User mailing list archive at Nabble.com. -- Thanks Regards Preetesh Dubey -- Thanks Regards Preetesh Dubey
Re: Parallel indexing in Solr
Unfortunately, the answer is it depends(tm). First question: How are you indexing things? SolrJ? post.jar? But some observations: 1 sure, using multiple cores will have some parallelism. So will using a single core but using something like SolrJ and StreamingUpdateSolrServer. Especially with trunk (4.0) and the Document Writer Per Thread stuff. In 3.x, you'll see some pauses when segments are merged that you can't get around (per core). See: http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/ for an excellent writeup. But whether or not you use several cores should be determined by your problem space, certainly not by trying to increase the throughput. Indexing usually take a back seat to search performance. 2 general settings are hard to come by. If you're sending structured documents that use Tika to parse the data behind the scenes, your performance will be much different (slower) than sending SolrInputDocuments (SolrJ). 3 The recommended servlet container is, generally, The one you're most comfortable with. Tomcat is certainly popular. That said, use whatever you're most comfortable with until you see a performance problem. Odds are you'll find your load on Solr is a at its limit before your servlet container has problems. 4 Monitor you CPU, fire more requests at it until it hits 100%. Note that there are occasions where the servlet container limits the number of outstanding requests it will allow and queues ones over that limit (find the magic setting to increase this if it's a problem, it differs by container). If you start to see your response times lengthen but the CPU not being fully utilized, that may be the cause. 5 How high is high performance? On a stock solr with the Wikipedia dump (11M docs), all running on my laptop, I see 7K docs/sec indexed. I know of installations that see 60 docs/sec or even less. I'm sending simple docs with SolrJ locally and they're sending huge documents over the wire that Tika handles. There are just so many variables it's hard to say anything except try it and see.. Best Erick On Fri, Feb 3, 2012 at 3:55 AM, Per Steffensen st...@designware.dk wrote: Hi This topic has probably been covered before, but I havnt had the luck to find the answer. We are running solr instances with several cores inside. Solr running out-of-the-box on top of jetty. I believe jetty is receiving all the http-requests about indexing ned documents, and forwards it to the solr engine. What kind of parallelism does this setup provide. Can more than one index-request get processed concurrently? How many? How to increase the number of index-requests that can be handled in parallel? Will I get better parallelism by running on another web-container than jetty - e.g. tomcat? What is the recommended web-container for high performance production systems? Thanks! Regards, Per Steffensen
Re: Solr index update approach
hi Mikhail external fields was one of the options, but I was not 100% sure if it would fit. I will study more about this option. thank you so much for your reply Arian 2012/2/3 Mikhail Khludnev mkhlud...@griddynamics.com Hello Arian, Pls look into http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-external.htmlit can be useful for your purpose. If you need to count facets against an external field you need to develop your own component - shouldn't be a big deal. Solr's bolts are http://lucidworks.lucidimagination.com/display/solr/Solr+Field+Types#SolrFieldTypes-WorkingwithExternalFiles http://wiki.apache.org/solr/FunctionQuery http://www.lucidimagination.com/blog/2009/07/06/ranges-over-functions-in-solr-14/ Regards On Fri, Feb 3, 2012 at 3:39 PM, Listas Discussões lis...@arianpasquali.comwrote: hi, I have an opinion mining application running solr that serves to retrieve documents and perform some analytics using facet queries. It works great. But I have a big issue. The document has an attribute for opinion that is automatically detected, but users can change it if it´s not correct. A document may be shared by some users and each user can change the opinion of the document. And the opinion may be different for each user. Opinion value is crucial here because its the main facet field on the analytic view. The thing is that solr does not handle doc updates, right now I need to delete it first and recreate the whole doc index to change it with the new metadata. And of course this is not fast enough. So I´m probably doing this the wrong way. Seams to me that is not a good approach and I should not update the index this way. The index should be more static, otherwise I will be reindexing the whole index to often. I´m running solr with a master/slave topology (2 slaves replication). The master to write and the slaves to read. Solr index is feed by a PostgresSQL database. I was wondering about using a nosql keyvalue database to store this kind of metadata and keep the index untouchable. So this way I could keep the index intact and store the user´s custom data there. It would fit if this value was not used by facet queries. That´s the problem. So my question is, what would be the best approach to handle this kind of use case with solr? If is not a usual use case, consider for example favorite docs. Favorites docs is probably a common use case in information retrieval. How do you handle for example favorite docs between users? I´d be very interested to hear about the best approach here. best Arian -- Sincerely yours Mikhail Khludnev Lucid Certified Apache Lucene/Solr Developer Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Arian Pasquali FEUP researcher twitter: @arianpasquali www.arianpasquali.com
Re: Shard timeouts on large (1B docs) Solr cluster
timeAllowed can be used outside distributed search. It is used by the TimeL¡mitingCollector. When the search time is equal to timeAllowed it will stop searching and will return the results that could find till then. This can be a problem when using incremental indexing. Lucene starts searching from the bottom and new docs are inserted on the top, so, timeAllowed could cause that new docs never appear on the search results. -- View this message in context: http://lucene.472066.n3.nabble.com/Shard-timeouts-on-large-1B-docs-Solr-cluster-tp3691229p3713263.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr index update approach
external fields was one of the options, but I was not 100% sure if it would fit. I will study more about this option. I was wondering if Lucene's ToChildBlockJoinQuery and/or ToParentBlockJoinQuery can be replacement for ExternalFileField. http://www.searchworkings.org/blog/-/blogs/tochildblockjoinquery-in-lucene Also what are the similarities and differences from solr's join QueryParser http://wiki.apache.org/solr/Join
Zero Matches Weirdness
Hi! I am having a weird issue with a search string not producing a match where it should. I can reproduce it with both 3.4 and 3.5. Where it should means that I am getting a hit in the Analyse tool in the admin panel, but not in a query via /select. Now when I try select?q=Am+Heidstamm... I get zero results back. But, when I quote the string select?q=%22Am+Heidstamm%22... I get several hits. BTW, the token am is filtered out in the field text, since it's in a stopword list. Any ideas on how this can b explained? My defaultSearchField ist text. The field gets its content via several copyField statements. The configuration for text is as follows: field name=text type=text_de indexed=true stored=false multiValued=true / The configuration for type text_de is this: fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer !-- protect slashes from tokenizer by replacing with something unique -- charFilter class=solr.PatternReplaceCharFilterFactory pattern=([A-Z]+)/([0-9]+)/([0-9]+) replacement=$1ḧ$2ḧ$3 / charFilter class=solr.PatternReplaceCharFilterFactory pattern=([0-9]+)/([0-9]+) replacement=$1ḧ$2 / !-- protect paragraph symbol from tokenizer -- charFilter class=solr.PatternReplaceCharFilterFactory pattern=§\s*([0-9]+) replacement=ǚ$1 / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 splitOnCaseChange=1/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory / filter class=solr.GermanMinimalStemFilterFactory / !-- get slashes back in -- filter class=solr.PatternReplaceFilterFactory pattern=ḧ replacement=/ / !-- get paragraph symbols back in -- filter class=solr.PatternReplaceFilterFactory pattern=ǚ replacement=§ / /analyzer /fieldType Log output for the unquoted phrase: INFO: [] webapp=/solr path=/select params={facet=truesort=score+descfl=sitzung,gremium,betreff,datum,timestamp,score,aktenzeichen,typ,id,anhangdebugQuery=truestart=0q=Am+Heidstammhl.fl=betreffwt=jsonfq=hl=truerows=10} hits=0 status=0 QTime=29 ... and for the quoted one: INFO: [] webapp=/solr path=/select params={facet=truesort=score+descfl=sitzung,gremium,betreff,datum,timestamp,score,aktenzeichen,typ,id,anhangstart=0q=Am+Heidstammhl.fl=betreffwt=standardfq=hl=truerows=10version=2.2} hits=14 status=0 QTime=244 Thanks!
Re: SolrCloud war?
UPDATE: I set my app server[1] system property jetty.port to be equal to the app servers open port and was able to get two Solr shards to talk. The overall properties I set are: App server domain 1: bootstrap_confdir collection.configName jetty.port solr.solr.home zkRun App server domain 2: bootstrap_confdir collection.configName jetty.port solr.solr.home zkHost I deployed each war app into the /solr context. I presume its needed by remote URL addressing. I checked the zookeeper config page and it shows both shards. Awesome. [1] Glassfish 3.1.1 On 02/01/2012 08:50 PM, Mark Miller wrote: I have not yet tried to run SolrCloud in another app server, but it shouldn't be a problem. One issue you might have is the fact that we count on hostPort coming from the system property jetty.port. This is set in the default solr.xml - the hostPort defaults to jetty.port. You probably want to explicitly pass -DhostPort= if you are not going to use jetty.port. - Mark Miller lucidimagination.com On Feb 1, 2012, at 2:44 PM, Darren Govoni wrote: Hi, I'm trying to get the SolrCloud2 examples to work using a war deployed solr into glassfish. The startup properties must be different in this case, because its having trouble connecting to zookeeper when I deploy the solr war file. Perhaps the embedded zookeeper has trouble running in an app server? Any tips appreciated! Darren On 01/30/2012 06:58 PM, Darren Govoni wrote: Hi, Is there any issue with running the new SolrCloud deployed as a war in another app server? Has anyone tried this yet? thanks.
Re: Zero Matches Weirdness
What about query side of the field? On Fri, Feb 3, 2012 at 6:11 PM, Marian Steinbach mar...@sendung.de wrote: Hi! I am having a weird issue with a search string not producing a match where it should. I can reproduce it with both 3.4 and 3.5. Where it should means that I am getting a hit in the Analyse tool in the admin panel, but not in a query via /select. Now when I try select?q=Am+Heidstamm... I get zero results back. But, when I quote the string select?q=%22Am+Heidstamm%22... I get several hits. BTW, the token am is filtered out in the field text, since it's in a stopword list. Any ideas on how this can b explained? My defaultSearchField ist text. The field gets its content via several copyField statements. The configuration for text is as follows: field name=text type=text_de indexed=true stored=false multiValued=true / The configuration for type text_de is this: fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer !-- protect slashes from tokenizer by replacing with something unique -- charFilter class=solr.PatternReplaceCharFilterFactory pattern=([A-Z]+)/([0-9]+)/([0-9]+) replacement=$1ḧ$2ḧ$3 / charFilter class=solr.PatternReplaceCharFilterFactory pattern=([0-9]+)/([0-9]+) replacement=$1ḧ$2 / !-- protect paragraph symbol from tokenizer -- charFilter class=solr.PatternReplaceCharFilterFactory pattern=§\s*([0-9]+) replacement=ǚ$1 / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 splitOnCaseChange=1/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory / filter class=solr.GermanMinimalStemFilterFactory / !-- get slashes back in -- filter class=solr.PatternReplaceFilterFactory pattern=ḧ replacement=/ / !-- get paragraph symbols back in -- filter class=solr.PatternReplaceFilterFactory pattern=ǚ replacement=§ / /analyzer /fieldType Log output for the unquoted phrase: INFO: [] webapp=/solr path=/select params={facet=truesort=score+descfl=sitzung,gremium,betreff,datum,timestamp,score,aktenzeichen,typ,id,anhangdebugQuery=truestart=0q=Am+Heidstammhl.fl=betreffwt=jsonfq=hl=truerows=10} hits=0 status=0 QTime=29 ... and for the quoted one: INFO: [] webapp=/solr path=/select params={facet=truesort=score+descfl=sitzung,gremium,betreff,datum,timestamp,score,aktenzeichen,typ,id,anhangstart=0q=Am+Heidstammhl.fl=betreffwt=standardfq=hl=truerows=10version=2.2} hits=14 status=0 QTime=244 Thanks! -- Regards, Dmitry Kan
Re: SolrCloud - issues running with embedded zookeeper ensemble
Hi Mark, Thanks for looking into the issue. As for specifying the bootstrap dir for each instance with ZK, it was just a typo on my side. I went back and looked at my script on the second and 3rd nodes and it did not have the bootstrp dir, so I had specified it for only the very FIRST node that registers ZK. 2. java -DzkRun=ec2-compute-2.amazonaws.com:9983 -Dsolr.solr.home=/home/ec2-user/solrcloud/example/solr -DzkHost=ec2-compute-1.amazonaws.com:9983,ec2-compute-2.amazonaws.com:9983, ec2-compute-3.amazonaws.com:9983 -DnumShards=2 -jar start.jar Thanks! Dipti On 2/2/12 8:27 PM, Mark Miller markrmil...@gmail.com wrote: Thanks Dipti! One thing that seems off is that you are passing the bootstrap_confdir param on each instance? Other than that though, the problem you are seeing is indeed a bug - though hidden if using localhost. I'll fix it here: https://issues.apache.org/jira/browse/SOLR-3091 Again, thanks for the detailed report. - mark On Feb 2, 2012, at 4:44 PM, Dipti Srivastava wrote: Hi Mark, I am trying to set up on 4 ami's, where 3 of the instances will have the embedded ZK running. Here are the startup commands for all 4. - Note that on the 4th instance I do not have the ZK host and bootstrap conf dir specified. The 4th instance throws exception (earlier in this email chain) at startup. - Ideally, I should not have to specify the host for the -DzkRun since it is the localhost, but without that I get the exception as well. 1. java -DzkRun=ec2-compute-1.amazonaws.com:9983 -Dsolr.solr.home=/home/ec2-user/solrcloud/example/solr -Dbootstrap_confdir=/home/ec2-user/solrcloud//example/solr/conf -DzkHost=ec2-compute-1.amazonaws.com:9983,ec2-compute-2.amazonaws.com:998 3, ec2-compute-3.amazonaws.com:9983 -DnumShards=2 -jar start.jar 2. java -DzkRun=ec2-compute-2.amazonaws.com:9983 -Dsolr.solr.home=/home/ec2-user/solrcloud/example/solr -Dbootstrap_confdir=/home/ec2-user/solrcloud//example/solr/conf -DzkHost=ec2-compute-1.amazonaws.com:9983,ec2-compute-2.amazonaws.com:998 3, ec2-compute-3.amazonaws.com:9983 -DnumShards=2 -jar start.jar 3. java -DzkRun=ec2-compute-3.amazonaws.com:9983 -Dsolr.solr.home=/home/ec2-user/solrcloud/example/solr -Dbootstrap_confdir=/home/ec2-user/solrcloud//example/solr/conf -DzkHost=ec2-compute-1.amazonaws.com:9983,ec2-compute-2.amazonaws.com:998 3, ec2-compute-3.amazonaws.com:9983 -DnumShards=2 -jar start.jar 4. java -Dsolr.solr.home=/home/ec2-user/solrcloud/example/solr -DzkHost=ec2-compute-1.amazonaws.com:9983,ec2-compute-2.amazonaws.com:998 3, ec2-compute-3.amazonaws.com:9983 -DnumShards=2 -jar start.jar Thanks, Dipti On 1/31/12 10:18 AM, Mark Miller markrmil...@gmail.com wrote: Hey Dipti - Can you give the exact startup cmds you are using for each of the instances? I have got Example C going, so I'll have to try and dig into whatever you are seeing. - mark On Jan 27, 2012, at 12:53 PM, Dipti Srivastava wrote: Hi Mark, Did you get a chance to look into the issues with running the embedded Zookeeper ensemble, as per Example C, from the http://wiki.apache.org/solr/SolrCloud2 Hi All, Did anyone else run multiple shards with embedded zk ensemble successfully? If so would like some tips on any issues that you came across. Regards, Dipti From: diptis dipti.srivast...@apollogrp.edu Date: Fri, 23 Dec 2011 10:32:52 -0700 To: markrmil...@gmail.com markrmil...@gmail.com Subject: Re: Release build or code for SolrCloud Hi Mark, There is some issue with specifying localhost vs actual host names for zk. When I changed my script to specify the actual hostname (which should be local by default) the first, 2nd and 3rd instances came up, that have the embedded zk running. Now, I am getting the same exception for the 4th AMI which in NOT part of the zookeeper ensemble. I want to zk only on 3 of the 4 instances. java -Dbootstrap_confdir=./solr/conf DzkRun=ami-19983 -DzkHost=ami-1:9983,ami-2:9983,ami-3:9983 -DnumShards=2 -jar start.jar Dipti From: Mark Miller markrmil...@gmail.com Reply-To: markrmil...@gmail.com markrmil...@gmail.com Date: Fri, 23 Dec 2011 09:34:52 -0700 To: diptis dipti.srivast...@apollogrp.edu Subject: Re: Release build or code for SolrCloud I'm having trouble getting a quorum up using the built in SolrZkServer as well - so i have not been able to replicate this - I'll have to keep digging. Not sure if it's due to a ZooKeeper update or what yet. 2011/12/21 Dipti Srivastava dipti.srivast...@apollogrp.edu Hi Mark, Thanks! So now I am deploying a 4 node cluster on AMI's and the main instance that bootstraps the config to the zookeeper does not come up I get an exception as follows. My solrcloud.sh looks like #!/usr/bin/env bash cd .. rm -r -f example/solr/zoo_data rm -f example/example.log cd example #java -DzkRun -DnumShards=2 -DSTOP.PORT=7983 -DSTOP.KEY=key -jar start.jar 1example.log 21 java -Dbootstrap_confdir=./solr/conf -DzkRun
Re: Zero Matches Weirdness
2012/2/3 Dmitry Kan dmitry@gmail.com: What about query side of the field? It's identical. At least that's what I think, since I din't specify the type=query or type=index attribute for the analyzer part. Marian
Re: Zero Matches Weirdness
Actually, I wouldn't count on it and just specify index and query sides explicitly. Just to play it safe. On Fri, Feb 3, 2012 at 8:34 PM, Marian Steinbach mar...@sendung.de wrote: 2012/2/3 Dmitry Kan dmitry@gmail.com: What about query side of the field? It's identical. At least that's what I think, since I din't specify the type=query or type=index attribute for the analyzer part. Marian -- Regards, Dmitry Kan
Re: Zero Matches Weirdness
No, don't do that. That's definitely not good advice. If the analysis chain is the same for both index and query, just use analyzer. As for Marian's issue... was there literally a + in the query or was that urlencoded? Try debugQuery=true for both queries and see what you get for the query parsing output. Erik On Feb 3, 2012, at 14:18 , Dmitry Kan wrote: Actually, I wouldn't count on it and just specify index and query sides explicitly. Just to play it safe. On Fri, Feb 3, 2012 at 8:34 PM, Marian Steinbach mar...@sendung.de wrote: 2012/2/3 Dmitry Kan dmitry@gmail.com: What about query side of the field? It's identical. At least that's what I think, since I din't specify the type=query or type=index attribute for the analyzer part. Marian -- Regards, Dmitry Kan
Another zero match issue
Hi everyone! I'm also having some zero match weirdness. When I execute this search: ?q=Create+a+self+contained+Part+ModuledefType=edismaxqf=location^0.9+text^0.8+fileName^8.0+title^4.0 I get ZERO results. If I remove the fileName qf parameter (an indexed but not stored field), I get 5 hits. ?q= Create+a+self+contained+Part+ModuledefType=edismaxqf=location^0.9+text^0.8+title^4.0 Putting quotes around the original query returns the hit, but that shouldn't be required, I would think. Also, removing part of the query text gives the intended results(!): ?q=contained+Part+ModuledefType=edismaxqf=location^0.9+text^0.8+fileName^8.0+title^4.0 These search parameters haven't seemed to be a problem until this example. Other searches with the same parameters return their intended results. What are some things I should be looking at? Thanks in advance! Debug info: str name=rawquerystringCreate a self contained Part Module/strstr name=querystringCreate a self contained Part Module/strstr name=parsedquery+((DisjunctionMaxQuery((fileName:Create^8.0 | title:creat^4.0 | text:creat^0.8 | location:creat^0.9)) DisjunctionMaxQuery((fileName:a^8.0)) DisjunctionMaxQuery((fileName:self^8.0 | title:self^4.0 | text:self^0.8 | location:self^0.9)) DisjunctionMaxQuery((fileName:contained^8.0 | title:contain^4.0 | text:contain^0.8 | location:contain^0.9)) DisjunctionMaxQuery((fileName:Part^8.0 | title:part^4.0 | text:part^0.8 | location:part^0.9)) DisjunctionMaxQuery((fileName:Module^8.0 | title:modul^4.0 | text:modul^0.8 | location:modul^0.9)))~6)/str str name=parsedquery_toString+(((fileName:Create^8.0 | title:creat^4.0 | text:creat^0.8 | location:creat^0.9) (fileName:a^8.0) (fileName:self^8.0 | title:self^4.0 | text:self^0.8 | location:self^0.9) (fileName:contained^8.0 | title:contain^4.0 | text:contain^0.8 | location:contain^0.9) (fileName:Part^8.0 | title:part^4.0 | text:part^0.8 | location:part^0.9) (fileName:Module^8.0 | title:modul^4.0 | text:modul^0.8 | location:modul^0.9))~6)/str
Setting solrj server connection timeout
Is the following a reasonable approach to setting a connection timeout with SolrJ? queryCore.getHttpClient().getHttpConnectionManager().getParams() .setConnectionTimeout(15000); Right now I have all my solr server objects sharing a single HttpClient that gets created using the multithreaded connection manager, where I set the timeout for all of them. Now I will be letting each server object create its own HttpClient object, and using the above statement to set the timeout on each one individually. It'll use up a bunch more memory, as there are 56 server objects, but maybe it'll work better. The total of 56 objects comes about from 7 shards, a build core and a live core per shard, two complete index chains, and for each of those, one server object for access to CoreAdmin and another for the index. The impetus for this, as it's possible I'm stating an XY problem: Currently I have an occasional problem where SolrJ connections throw an exception. When it happens, nothing is logged in Solr. My code is smart enough to notice the problem, send an email alert, and simply try again at the top of the next minute. The simple explanation is that this is a Linux networking problem, but I never had any problem like this when I was using Perl with LWP to keep my index up to date. I sent a message to the list some time ago on this exception, but I never got a response that helped me figure it out. Caused by: org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Connection reset at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:480) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:246) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:276) at com.newscom.idxbuild.solr.Core.getCount(Core.java:325) ... 3 more Caused by: java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:168) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78) at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106) at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1413) at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973) at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:424) ... 7 more Thanks, Shawn
Re: Zero Matches Weirdness
2012/2/3 Erik Hatcher erik.hatc...@gmail.com: As for Marian's issue... was there literally a + in the query or was that urlencoded? Try debugQuery=true for both queries and see what you get for the query parsing output. I tested both + and %20 with and without quotes, it doesn't make a difference whether I use + or %20. Here is the debug output for the unquoted version (zero hits): debug: { rawquerystring: Am Heidstamm, querystring: Am Heidstamm, parsedquery: +((DisjunctionMaxQuery((aktenzeichen:Am^10.0)) DisjunctionMaxQuery((text:heidstamm^0.1 | betreff:heidstamm^3.0 | aktenzeichen:Heidstamm^10.0)))~2), parsedquery_toString: +(((aktenzeichen:Am^10.0) (text:heidstamm^0.1 | betreff:heidstamm^3.0 | aktenzeichen:Heidstamm^10.0))~2), QParser: ExtendedDismaxQParser, } And for the quoted version (with hits): { rawquerystring: Am Heidstamm, querystring: Am Heidstamm, parsedquery: +DisjunctionMaxQuery((text:heidstamm^0.1 | betreff:heidstamm^3.0 | aktenzeichen:Am Heidstamm^10.0)), parsedquery_toString: +(text:heidstamm^0.1 | betreff:heidstamm^3.0 | aktenzeichen:Am Heidstamm^10.0), explain: { }, QParser: ExtendedDismaxQParser, } As it seems to me, the +(((aktenzeichen:Am^10.0) (text:heidstamm^0.1 | betreff:heidstamm^3.0 | aktenzeichen:Heidstamm^10.0))~2) condition cannot be fulfilled. I have AND as the detault operator. The term (aktenzeichen:Am^10.0) cannot be satisfied. The thing is: why does it even appear there? This is my current qf: betreff^5.0 aktenzeichen^10.0 body^0.2 text^0.1 I have just changed this to only text^0.1 for the sake of testing, and then it works. It seems as if I haven't quite understood the impact of qf. I thought it would allow me to boost the score based on a string appearing in a field. I didn't expect it to affect what matches and what doesnt. Marian
Re: Zero Matches Weirdness
It just got rid of the one field aktenzeichen never matching in the qf string. Now it works fine. Solved for now. Thanks!
Re: Another zero match issue
: ?q=Create+a+self+contained+Part+ModuledefType=edismaxqf=location^0.9+text^0.8+fileName^8.0+title^4.0 : : I get ZERO results. : : If I remove the fileName qf parameter (an indexed but not stored field), I get 5 hits. lemme guess: fileName doesn't use stopwords but other fields do, correct? you're getting zero matches because you've told dismax that every clause must match something, and a is a clause in your query thta gets ignored for everyfield that uses stopwords, but for fields that don't use stopwords (like fileName) it is kept arround, and you get no matches for the whole query unless that clause gets a match. http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/ http://www.lucidimagination.com/search/document/ca18cbded00bdc79#6a30d2ed7914a4d9 https://issues.apache.org/jira/browse/SOLR-3085 ...if you google for dismax stopwords' you'll find lots of discussion on how how/why this happens. in general you really need to think carefully about hte fields you put in your qf field, and make sure their query analyzers play nicely with eachother in these multi-term query situations. -Hoss
Re: Zero Matches Weirdness
Ok, thanks, Erick, good to know. Sorry for the confusion. On Fri, Feb 3, 2012 at 9:42 PM, Erik Hatcher erik.hatc...@gmail.com wrote: No, don't do that. That's definitely not good advice. If the analysis chain is the same for both index and query, just use analyzer. As for Marian's issue... was there literally a + in the query or was that urlencoded? Try debugQuery=true for both queries and see what you get for the query parsing output. Erik On Feb 3, 2012, at 14:18 , Dmitry Kan wrote: Actually, I wouldn't count on it and just specify index and query sides explicitly. Just to play it safe. On Fri, Feb 3, 2012 at 8:34 PM, Marian Steinbach mar...@sendung.de wrote: 2012/2/3 Dmitry Kan dmitry@gmail.com: What about query side of the field? It's identical. At least that's what I think, since I din't specify the type=query or type=index attribute for the analyzer part. Marian -- Regards, Dmitry Kan -- Regards, Dmitry Kan
Re: SolrCloud war?
On Feb 3, 2012, at 1:04 PM, Darren Govoni wrote: I deployed each war app into the /solr context. I presume its needed by remote URL addressing. Yup - but you can override this by setting the hostContext in solr.xml. It defaults to solr as that fits the example jetty distribution. - Mark Miller lucidimagination.com
Re: error in indexing
: Subject: Re: error in indexing FWIW: it's really crucial to state which version of Solr you are using when you have bugs with error stack traces like this -- going back through the versions i'm *guessing* that you are using Solr 1.4.1 (or possibly older), correct? Based on that assumption (and the stack trace) i *think* your problem is that somehow you are adding a field to your documents where the *name* of the field is null ... but unless you left something out of the java code you posted i'm not reall sure how that would be possible. are you sure you don't have any other code adding fields to these SOlrInputDocuments? : output_documents=new ArrayListSolrInputDocument(); : while () { : sdoc=new SolrInputDocument(); : sdoc.setField(id, idb); : sdoc.setField(file_id, id); : sdoc.addField(box_text, zone.Text); : final IteratorWPWord it_on_words = zone.Words.iterator(); : while (it_on_words.hasNext()) { : final WPWord word = it_on_words.next(); : final String word_box = word.boxesToString(); : final String word_payload = word.Text + | + word_box; : sdoc.addField(word, word_payload); : } -Hoss
ReversedWildcardFilterFactory and PorterStemFilterFactory
I'd like to use both the ReversedWildcardFilterFactory and PorterStemFilterFactory on a text field that I have, I'd like to avoid stemming the reversed fields and would also like to avoid reversing the stemmed fields. My original thought was to have the ReversedWildcardFilterFactory higher in the chain, but what would this do with the stemmer? Would it attempt o stem the reversed tokens or are they ignored? What is the best way to achieve the result I am looking for in a single field? Again goal is to have text come in have it be reversed and stemmed but I don't want the stemmed reversed and I don't want the reversed stemmed, is this possible?
Re: frange with multi-valued fields
: Has anyone had experience using frange with multi-valued fields? In : solr 3.5 doing so results in the error: can not use FieldCache on : multivalued field correct. : Here's the use case. We have multiple years attached to each document : and want to be able to refine by a year range. We're currently using : the standard range query syntax [ 1900 TO 1910 ] which works, but those : queries are slower than we would like. I've seen reports that using : frange can greatly improve performance. : http://solr.pl/en/2011/05/30/quick-look-frange/ note that there is a mistake in the Faster implementation column of performance table on that article .. the actaul data (and hte paragraph after the table) indicate that... standard range query is faster only for queries that cover a small number of terms from the given field. Yonik got similar results when he did testing on range queries over strings, but the specifics on where the cut-off point was were slightly different... https://yonik.wordpress.com/2009/07/06/ranges-over-functions-in-solr-1-4/ In general you'd have to test it, but for things like years unless you are dealing really big spans of time (ie: [1901 TO 20]) and will have ranges that are generally large relative the total span of data you are dealing with, i seriously doubt fgrange would be much faster for you if you had a single valued fields -- and the bottom line is frange won't work with multivalued fields. forget about frange for a moment, and tell us more about your specific sitaution. to start with: what field configuration are you using right now for your year field? specificly are you using TrieIntField? have you tried tunning the options on it? how many unique year values are in your corpus? how big to your ranges usually get? https://people.apache.org/~hossman/#xyproblem Your question appears to be an XY Problem ... that is: you are dealing with X, you are assuming Y will help you, and you are asking about Y without giving more details about the X so that we can understand the full issue. Perhaps the best solution doesn't involve Y at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 -Hoss
Re: multiple index analyzer chains on a field
Looking closer I think I asked the wrong question, please disregard and I will start a new chain with that question On Friday, February 3, 2012, Jamie Johnson jej2...@gmail.com wrote: Is it possible to have multiple index analysis chains on a single field?
Performance degradation with distributed search
Hello, I am experimenting with solr distributed search/random sharding (currently use geo sharding), hope to gain some performance and also scalability in the future. (index size keep growing and geo shard is hard to scale) However I'm seeing worse performance with distributed search, on a testing server of 6 shards, 15 core cpu, 24G mem, index size is about 8G on each shard. With geo sharding it can easily take 150 QPS load with good response time. Now with distribute search, there are timeout and average response time also inreases. This is probably no big surprise since I'm using same amount of shards and plus overhead of distribute search/merge/http network etc. When I look into details (slow queries), I found some real issues that I need help with. For example, a query which takes 200ms with geo sharding, now timeout (2000ms) with distributed search. And each shard query (isShard=true) takes about 1200ms. But if I run the query toward the shard only (without distributed search), it only takes 200ms. So I compared the two query urls, the only difference is shard query using distribute search has fsv=true. I understand field sort values are need during merge process, but didn't expect that'll make this much difference in performance, although we do have lot of sort orders (about 20 different sort orders). Any suggestion/comment on the performance problem I'm having with distributed search? Is distributed search the right choice for me? What other setup/idea I can try? thanks, XJ