Re: KeywordTokenizerFactory - trouble with exact matches
Tried the following config for setting the autoGeneratePhraseQueries but it didn't seem to change anything. Tested both true and false. fieldType name=keyword class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class= solr.KeywordTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ /analyzer /fieldType Still I do not get any matches when searching for FE 009 without quotes. Set debugQuery to on and this is what it shows. Definitely looks like it does this MultiPhraseQuery thing. lst name=debug str name=rawquerystringFE 009/str str name=querystringFE 009/str str name=parsedquery (+(DisjunctionMaxQuery((number:FE)) DisjunctionMaxQuery((number:009/no_coord /str str name=parsedquery_toString+((number:FE) (number:009))/str lst name=explain/ str name=QParserExtendedDismaxQParser/str I also looked into these query-parsers, but as it may look like the splitting on whitespace is something that is done by the dismax queryparser before the terms are passed to any analyzers. And it is vital to me that I can differentiate this on a per field basis. *Aleksander Akerø* Systemkonsulent Mobil: 944 89 054 E-post: aleksan...@gurusoft.no *Gurusoft AS* Telefon: 92 44 09 99 Østre Kullerød www.gurusoft.no 2014-01-29 Aleksander Akerø aleksan...@gurusoft.no Thanks a lot, I'll try the autoGeneratePhraseQueries property and see how that works. Regarding the reindexing tip, it's a good tip but due to the my current on the fly setup on the servers at work i basically have do build a project with maven and deploy to tomcat, wherein the index lies, and I therefore have to reindex each time otherwise the index would be empty. Also i usually add use the clean parameter when testing with DIH. So that shouldn't be a problem. *Aleksander Akerø* Systemkonsulent Mobil: 944 89 054 E-post: aleksan...@gurusoft.no *Gurusoft AS* Telefon: 92 44 09 99 Østre Kullerød www.gurusoft.no 2014-01-29 Alexandre Rafalovitch arafa...@gmail.com I think the whitespace might also be the issue. The query gets parsed by standard component that splits it on space before passing individual components into the field searches. Try enabling autoGeneratePhraseQueries on the field (or field type) and reindexing. See if that makes a difference. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Jan 29, 2014 at 9:55 PM, Aleksander Akerø aleksan...@gurusoft.no wrote: update: Guessing that this has nothing to do with the tokenizer. Tried to use the string fieldtype as well, but still the same results. So this must have to do with some other solr config. What confuses me is that when I search 1005 which is another valid value to search for, it works perfectly, but then again, this query contains no whitespace. Any ideas? *Aleksander Akerø* Systemkonsulent Mobil: 944 89 054 E-post: aleksan...@gurusoft.no *Gurusoft AS* Telefon: 92 44 09 99 Østre Kullerød www.gurusoft.no 2014-01-29 Aleksander Akerø aleksan...@gurusoft.no Thanks for the quick answer, but it doesn't help if I remove the lowercase analyzer like so: *fieldType name=keyword class=solr.TextField positionIncrementGap=100* *analyzer type=index* *tokenizer class=solr.KeywordTokenizerFactory/* */analyzer* *analyzer type=query* *tokenizer class=solr.KeywordTokenizerFactory/* */analyzer* */fieldType* I still need to add quotes to the searchquery to get results. And the weird thing is that if I use the analyzer and put in FE 009 (again, without quotes) for both index and query values, it highlights the result as to show a match, but when i search using the GUI it gives me no results. The same happens when posting directly to the /select requestHandler via GET These is what i post using GET: http://mysite.com/solr/corename/select?q=number:FE%20009qf=number = this does not work http://mysite.com/solr/corename/select?q=number:FE%20009qf=number = this works Really starting to wonder if I am doing something terribly wrong somewhere. This is my requestHandler btw, pretty basic: !-- Default handler -- requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str str name=defTypeedismax/str str name=q.alt*:*/str str name=rows10/str str name=fl*,score/str str name=qfnumber/str /lst /requestHandler *Aleksander Akerø* Systemkonsulent Mobil: 944 89 054 E-post:
Lucene Join
Hi, I am trying to find whether the lucene joins (not solr join) if they are using any filter cache. The API that lucene uses is for joining joinutil.createjoinquery(), where can I find the source code for this API. Thanks in advance Thanks, Anand
ant eclipse hangs - branch_4x
Hi Earlier in used to be able to successfully run ant eclipse from branch_4x. With the newest code (tip of branch_4x today) I cant. ant eclipse hangs forever at the point showed by console output below. I noticed that this problem has been around for a while - not something that happened today. Any idea about what might be wrong? A solution? Help to debug? Regards Per Steffensen --- console when running ant eclipse - ... resolve: [echo] Building solr-example-DIH... ivy-availability-check: [echo] Building solr-example-DIH... ivy-fail: ivy-configure: [ivy:configure] :: loading settings :: file = /Some/Path/ws_kepler_apache_lucene_solr_4x/branch_4x/lucene/ivy-settings.xml resolve: resolve: [echo] Building solr-core... ivy-availability-check: [echo] Building solr-core... ivy-fail: ivy-fail: ivy-configure: [ivy:configure] :: loading settings :: file = /Some/Path/ws_kepler_apache_lucene_solr_4x/branch_4x/lucene/ivy-settings.xml resolve: HERE IT JUST HANGS FOREVER -
Re: Lucene Join
Look in lucene's join module? Mike McCandless http://blog.mikemccandless.com On Thu, Jan 30, 2014 at 4:15 AM, anand chandak anand.chan...@oracle.com wrote: Hi, I am trying to find whether the lucene joins (not solr join) if they are using any filter cache. The API that lucene uses is for joining joinutil.createjoinquery(), where can I find the source code for this API. Thanks in advance Thanks, Anand
Concurrency handling in DataImportHandler
Hi All, Can I please know about how concurrency is handled in the DIH? What happens if multiple /dataimport requests are issued to the same Datasource? I'm doing some custom processing at the end of dataimport process as an EventListener configured in the data-config.xml as below. document name=stanboldata onImportEnd=com.solr.stanbol.processor.StanbolEventListener Will each DIH request create a new EventListener object? I'm copying some field values from my custom processor configured in the /dataimport request handler to a static Map in my StanbolEventListener class. I need to figure out how to handle concurrency when data is copied to my EvenetListener object to perform the rest of my update process. Thanks, Dileepa
Re: Concurrency handling in DataImportHandler
I would particularly like to know how DIH handles concurrency in JDBC database connections during datamport.. dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/solrtest user=usr1 password=123 batchSize=1 / Thanks, Dileepa On Thu, Jan 30, 2014 at 4:05 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, Can I please know about how concurrency is handled in the DIH? What happens if multiple /dataimport requests are issued to the same Datasource? I'm doing some custom processing at the end of dataimport process as an EventListener configured in the data-config.xml as below. document name=stanboldata onImportEnd=com.solr.stanbol.processor.StanbolEventListener Will each DIH request create a new EventListener object? I'm copying some field values from my custom processor configured in the /dataimport request handler to a static Map in my StanbolEventListener class. I need to figure out how to handle concurrency when data is copied to my EvenetListener object to perform the rest of my update process. Thanks, Dileepa
Re: Concurrency handling in DataImportHandler
Hi All, I triggered a /dataimport for first 100 rows from my database and while it's running issued another import request for rows 101-200. In my log I see below exception; It seems multiple JDBC connections cannot be opened. Does this mean concurrency is not supported in DIH for JDBC datasources? Please share your thoughts on how to tackle concurrency in dataimport.. [Thread-15] ERROR org.apache.solr.handler.dataimport.JdbcDataSource - Ignoring Error when closing connection java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@1e820764 is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries. at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:924) at com.mysql.jdbc.MysqlIO.checkForOutstandingStreamingData(MysqlIO.java:3314) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2477) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2731) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2809) at com.mysql.jdbc.ConnectionImpl.rollbackNoChecks(ConnectionImpl.java:5165) at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:5048) at com.mysql.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:4654) at com.mysql.jdbc.ConnectionImpl.close(ConnectionImpl.java:1630) at org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:410) at org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:395) at org.apache.solr.handler.dataimport.DocBuilder.closeEntityProcessorWrappers(DocBuilder.java:284) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:273) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468) Thanks, Dileepa On Thu, Jan 30, 2014 at 4:13 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: I would particularly like to know how DIH handles concurrency in JDBC database connections during datamport.. dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/solrtest user=usr1 password=123 batchSize=1 / Thanks, Dileepa On Thu, Jan 30, 2014 at 4:05 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, Can I please know about how concurrency is handled in the DIH? What happens if multiple /dataimport requests are issued to the same Datasource? I'm doing some custom processing at the end of dataimport process as an EventListener configured in the data-config.xml as below. document name=stanboldata onImportEnd=com.solr.stanbol.processor.StanbolEventListener Will each DIH request create a new EventListener object? I'm copying some field values from my custom processor configured in the /dataimport request handler to a static Map in my StanbolEventListener class. I need to figure out how to handle concurrency when data is copied to my EvenetListener object to perform the rest of my update process. Thanks, Dileepa
Re: Use a field without predefining it it the schema
Thanks, That's a good feature since I dont have to reindex the whole data, nor to restart solr app. 2014-01-30 Steve Rowe sar...@gmail.com Hakim, All the fields you have added manually to the schema will be kept when you switch to using managed schema. From the managed schema page on the Solr Reference Guide you linked to (describing what happens after you add schemaFactory class=ManagedIndexSchemaFactory.../schemaFactory to your solrconfig.xml, and then restart Solr in order for the change to take effect): Once Solr is restarted, the existing schema.xml file is renamed to schema.xml.bak and the contents are written to a file with the name defined as the managedSchemaResourceName. Steve On Jan 29, 2014, at 7:15 PM, Hakim Benoudjit h.benoud...@gmail.com wrote: I have found this link https://cwiki.apache.org/confluence/display/solr/Managed+Schema+Definition+in+SolrConfig . I dont know if it's required to modify the schema (see the link), to make it editable by the REST API. I wish that it doesnt clear all the fields that I have added manually to the schema. 2014-01-30 Hakim Benoudjit h.benoud...@gmail.com Thanks Steve for the link. It seems very easy to create `new fields` in the `schema` using the `POST request`. But doest mean that I dont have to restart the `solr app`? Is so, is this feature available in latest solr version (`v4.6`)? 2014-01-29 Alexandre Rafalovitch arafa...@gmail.com There is an example in the distribution that shows how new fields are auto-defined. I think it is example-schemaless. The secret is in the UpdateRequestProcessor chain that does cleanup and auto-mapping. Plus - I guess - automatically generated schema. Just remember that once the field is added the first time, it now exists. So careful not to send a date-looking thing into what should be a text field. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Jan 29, 2014 at 5:45 AM, Steve Rowe sar...@gmail.com wrote: Hi Hakim, Check out the section of the Solr Reference Guide on modifying the schema via REST API: https://cwiki.apache.org/confluence/display/solr/Schema+API#SchemaAPI-Modifytheschema Steve On Jan 28, 2014, at 5:00 PM, Hakim Benoudjit h.benoud...@gmail.com wrote: Hi guys With the new version of solr (4.6), can I add a field to the index, knowing that this field doesnt appear(isnt predefined) in the schema? I ask this question because I ve seen an issue (on jira) related to this. Thanks!
Re: KeywordTokenizerFactory - trouble with exact matches
Hi, I have similar kind of problem where I want search for a words with spaces in that. And I wanted to search by stripping all the spaces . I have used following schema for that fieldType name=nospaces class=solr.TextField autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=[^\w]+ replacement= replace=all/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=[^\w]+ replacement= replace=all/ /analyzer /fieldType And field name=text_nospaces type=nospaces indexed=true stored=true omitNorms=true / copyField source=text dest=text_nospaces / But it is not searching the right terms . we are stripping the spaces and indexing lowercase values when we do that. Like : East Enders when I seach for 'east end ers' text, its not returning any values saying no document found. I realised the solr uses QueryParser before passing query string to the QueryAnalyzer in defined in schema. And The Query parser is tokenizing the query string providing in query . So it is sending each token to the QueryAnalyser that is defined in schema. SO is there anyway that I can by pass this query parser or use a correct query processor which can consider the entire string as single pharse. At the moment I am using dismax query processor. Any suggestion would be much appreciated. Thanks Srinivasa -- View this message in context: http://lucene.472066.n3.nabble.com/KeywordTokenizerFactory-trouble-with-exact-matches-tp4114193p4114432.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: KeywordTokenizerFactory - trouble with exact matches
Aleksander Akerø It would be great if you can share the solution how you are handling it on field basis -- View this message in context: http://lucene.472066.n3.nabble.com/KeywordTokenizerFactory-trouble-with-exact-matches-tp4114193p4114435.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: KeywordTokenizerFactory - trouble with exact matches
The standard, keyword-oriented query parsers will all treat unquoted, unescaped white space as term delimiters and ignore the what space. There is no way to bypass that behavior. So, your regex will never even see the white space - unless you enclose the text and white space in quotes or use a backslash to quote each white space character. You can use the field and term query parsers to pass a query string as if it were fully enclosed in quotes, but that only handles a single term and does not allow for multiple terms or any query operators. For example: {!field f=myfield}Foo Bar See: http://wiki.apache.org/solr/QueryParser You can also pre-configure the field query parser with the defType=field parameter. -- Jack Krupansky -Original Message- From: Srinivasa7 Sent: Thursday, January 30, 2014 6:37 AM To: solr-user@lucene.apache.org Subject: Re: KeywordTokenizerFactory - trouble with exact matches Hi, I have similar kind of problem where I want search for a words with spaces in that. And I wanted to search by stripping all the spaces . I have used following schema for that fieldType name=nospaces class=solr.TextField autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=[^\w]+ replacement= replace=all/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=[^\w]+ replacement= replace=all/ /analyzer /fieldType And field name=text_nospaces type=nospaces indexed=true stored=true omitNorms=true / copyField source=text dest=text_nospaces / But it is not searching the right terms . we are stripping the spaces and indexing lowercase values when we do that. Like : East Enders when I seach for 'east end ers' text, its not returning any values saying no document found. I realised the solr uses QueryParser before passing query string to the QueryAnalyzer in defined in schema. And The Query parser is tokenizing the query string providing in query . So it is sending each token to the QueryAnalyser that is defined in schema. SO is there anyway that I can by pass this query parser or use a correct query processor which can consider the entire string as single pharse. At the moment I am using dismax query processor. Any suggestion would be much appreciated. Thanks Srinivasa -- View this message in context: http://lucene.472066.n3.nabble.com/KeywordTokenizerFactory-trouble-with-exact-matches-tp4114193p4114432.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: ant eclipse hangs - branch_4x
Hi Per, You may be seeing the stale-Ivy-lock problem (see IVY-1388). LUCENE-4636 upgraded the bootstrapped Ivy to 2.3.0 to reduce the likelihood of this problem, so the first thing is to make sure you have that version in ~/.ant/lib/ - if not, remove the Ivy jar that’s there and run ‘ant ivy-bootstrap’ to download and put the 2.3.0 jar in place. You should run the following and remove any files it finds: find ~/.ivy2/cache -name ‘*.lck’ That should stop ‘ant resolve’ from hanging. Steve On Jan 30, 2014, at 5:06 AM, Per Steffensen st...@designware.dk wrote: Hi Earlier in used to be able to successfully run ant eclipse from branch_4x. With the newest code (tip of branch_4x today) I cant. ant eclipse hangs forever at the point showed by console output below. I noticed that this problem has been around for a while - not something that happened today. Any idea about what might be wrong? A solution? Help to debug? Regards Per Steffensen --- console when running ant eclipse - ... resolve: [echo] Building solr-example-DIH... ivy-availability-check: [echo] Building solr-example-DIH... ivy-fail: ivy-configure: [ivy:configure] :: loading settings :: file = /Some/Path/ws_kepler_apache_lucene_solr_4x/branch_4x/lucene/ivy-settings.xml resolve: resolve: [echo] Building solr-core... ivy-availability-check: [echo] Building solr-core... ivy-fail: ivy-fail: ivy-configure: [ivy:configure] :: loading settings :: file = /Some/Path/ws_kepler_apache_lucene_solr_4x/branch_4x/lucene/ivy-settings.xml resolve: HERE IT JUST HANGS FOREVER -
Re: KeywordTokenizerFactory - trouble with exact matches
Hi Srinivasa Yes I've come to understand that the analyzers will never see the whitespace, thus no need for patternreplacement, like Jack points out. So the solution would be to set wich parser to use for the query. Also Jack has pointed out that the field queryparser should work in this particular setting - http://wiki.apache.org/solr/QueryParser My problem was though, that it was only for one of the fields in the schema that i needed this for, but for all the other fields, e.g. name, description etc., I would very much like to make use of the eDisMax functionality. And it seems that there can only be defined one query parser per query. in other words: for all fields. Jack, you may correct me if I'm wrong here :) This particular customer wanted a wildcard search at both ends of the phrase, and that sort of ambiguated the problem. And therefore I chose to replace all whitespace for this field in sql at index time, using the DIH. And then using EdgeNGramFilterFactory on both sides of the keyword like the config below, and that seemed to work pretty nicely. !-- WildCard search number -- fieldType name=keyword class= solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class= solr.LowerCaseFilterFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=25 side=front/ filter class= solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=25 side=back/ /analyzer analyzer type=query tokenizer class= solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType I also added a bit of extra weighting for the keyword field so that exact matches recieved a higher score. What this solution doesn't do is to exclude values like EE 009, when searching for FE 009, but they return far down on the list, which for the customer is ok, because usually these results are somewhat related og within the same category. *Aleksander Akerø* Systemkonsulent Mobil: 944 89 054 E-post: aleksan...@gurusoft.no *Gurusoft AS* Telefon: 92 44 09 99 Østre Kullerød www.gurusoft.no 2014-01-30 Jack Krupansky j...@basetechnology.com The standard, keyword-oriented query parsers will all treat unquoted, unescaped white space as term delimiters and ignore the what space. There is no way to bypass that behavior. So, your regex will never even see the white space - unless you enclose the text and white space in quotes or use a backslash to quote each white space character. You can use the field and term query parsers to pass a query string as if it were fully enclosed in quotes, but that only handles a single term and does not allow for multiple terms or any query operators. For example: {!field f=myfield}Foo Bar See: http://wiki.apache.org/solr/QueryParser You can also pre-configure the field query parser with the defType=field parameter. -- Jack Krupansky -Original Message- From: Srinivasa7 Sent: Thursday, January 30, 2014 6:37 AM To: solr-user@lucene.apache.org Subject: Re: KeywordTokenizerFactory - trouble with exact matches Hi, I have similar kind of problem where I want search for a words with spaces in that. And I wanted to search by stripping all the spaces . I have used following schema for that fieldType name=nospaces class=solr.TextField autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=[^\w]+ replacement= replace=all/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=[^\w]+ replacement= replace=all/ /analyzer /fieldType And field name=text_nospaces type=nospaces indexed=true stored=true omitNorms=true / copyField source=text dest=text_nospaces / But it is not searching the right terms . we are stripping the spaces and indexing lowercase values when we do that. Like : East Enders when I seach for 'east end ers' text, its not returning any values saying no document found. I realised the solr uses QueryParser before passing query string to the QueryAnalyzer in defined in schema. And The Query parser is tokenizing the query string providing in query . So it is sending each token to the QueryAnalyser that is defined in schema. SO is there anyway that I can by pass this query parser or use a correct query processor which can consider the entire string as single pharse. At the moment I am using dismax query processor. Any suggestion would be much appreciated. Thanks Srinivasa -- View this message in context: http://lucene.472066.n3.nabble.com/
Re: Not finding part of fulltext field when word ends in dot
The word delimiter filter will turn 26KA into two tokens, as if you had written 26 KA without the quotes. The autoGeneratePhraseQueries option will cause the multiple terms to be treated as if they actually were enclosed within quotes, otherwise they will be treated as separate and unquoted terms. If you do enclose 26KA in quotes in your query then autoGeneratePhraseQueries is not relevant. Ah... maybe the problem is that you have preserveOriginal=true in your query analyzer. Do you have your default query operator set to AND? If so, it would treat 26KA as 26 AND KA AND 26KA, which requires that 26KA (without the trailing dot) to be in the index. It seems counter-intuitive, but the attributes of the index and query word delimiter filters need to be slightly asymmetric. -- Jack Krupansky -Original Message- From: Thomas Michael Engelke Sent: Thursday, January 30, 2014 2:16 AM To: solr-user@lucene.apache.org Subject: Re: Not finding part of fulltext field when word ends in dot I'm not sure I got my problem across. If I understand the snippet of documentation right, autoGeneratePhraseQueries only affects queries that result in multiple tokens, which mine does not. The version also is 3.6.0.1, and we're not planning on upgrading to any 4.x version. 2014-01-29 Jack Krupansky j...@basetechnology.com You might want to add autoGeneratePhraseQueries=true to your field type, but I don't think that would cause a break when going from 3.6 to 4.x. The default for that attribute changed in Solr 3.5. What release was your data indexed using? There may have been some subtle word delimiter filter changes between 3.x and 4.x. Read: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201202.mbox/% 3CC0551C512C863540BC59694A118452AA0764A434@ITS-EMBX-03. adsroot.itcs.umich.edu%3E -Original Message- From: Thomas Michael Engelke Sent: Wednesday, January 29, 2014 11:16 AM To: solr-user@lucene.apache.org Subject: Re: Not finding part of fulltext field when word ends in dot The fieldType definition is a tad on the longer side: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory catenateWords=1 catenateNumbers=1 generateNumberParts=1 splitOnCaseChange=1 generateWordParts=1 catenateAll=0 preserveOriginal=1 splitOnNumerics=0 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=german/synonyms.txt ignoreCase=true expand=true/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=german/german-common-nouns.txt minWordSize=5 minSubwordSize=4 maxSubwordSize=15 onlyLongestMatch=true / filter class=solr.StopFilterFactory words=german/stopwords.txt ignoreCase=true enablePositionIncrements=true/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=german/protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory catenateWords=0 catenateNumbers=0 generateWordParts=1 splitOnCaseChange=1 generateNumberParts=1 catenateAll=0 preserveOriginal=1 splitOnNumerics=0 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=german/stopwords.txt ignoreCase=true enablePositionIncrements=true/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=german/protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer
Re: how to write an efficient query with a subquery to restrict the search space?
Lucene's default scoring should give you much of what you want - ranking hits of low-frequency terms higher - without any special query syntax - just list out your terms and use OR as your default operator. -- Jack Krupansky -Original Message- From: svante karlsson Sent: Thursday, January 23, 2014 6:42 AM To: solr-user@lucene.apache.org Subject: how to write an efficient query with a subquery to restrict the search space? I have a solr db containing 1 billion records that I'm trying to use in a NoSQL fashion. What I want to do is find the best matches using all search terms but restrict the search space to the most unique terms In this example I know that val2 and val4 is rare terms and val1 and val3 are more common. In my real scenario I'll have 20 fields that I want to include or exclude in the inner query depending on the uniqueness of the requested value. my first approach was: q=field1:val1 OR field2:val2 OR field3:val3 OR field4:val4 AND (field2:val2 OR field4:val4)rows=100fl=* but what I think I get is . field4:val4 AND (field2:val2 OR field4:val4) this result is then OR'ed with the rest if I write q=(field1:val1 OR field2:val2 OR field3:val3 OR field4:val4) AND (field2:val2 OR field4:val4)rows=100fl=* then what I think I get is two sub-queries that is evaluated separately and then joined - performance wise this is bad. Whats the best way to write these types of queries? Are there any performance issues when running it on several solrcloud nodes vs a single instance or should it scale? /svante
SOLR suggester with highlighting
Hello, I am trying to make a typehead autocomplete with SOLR using the suggester. The search will be done for users and group names which aggregate users. The search will be done on usernames , bio , web page and other stuff. What I want to achieve is sort of facebook or twitter alike search. For this I need to enrich the result from SOLR with additional data (user type, url of the profile, his avatar url etc.). The user and group would have the ID field in SOLR which would correspond to the ID in the DB to get these information. I am stuck on how to do that. Currently I have the suggester working but it only returns the suggesting value, when I try to return some other attribute from the document it doesn't work. Here is the part of the solrconfig: searchComponent class=solr.SpellCheckComponent name=suggest !-- configure the spellchecker used for autocomplete (dictionary) -- lst name=spellchecker str name=namesuggester_dictionary/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.fst.FSTLookupFactory/str !-- The indexed field to derive suggestions from -- str name=fieldautocomplete/str !-- buildOnCommit must be set to true because suggester keeps data in memory -- str name=buildOnCommittrue/str /lst /searchComponent requestHandler class=solr.SearchHandler name=/suggest lst name=defaults !-- by default use the suggester_dictionary -- str name=spellcheck.dictionarysuggester_dictionary/str str name=spellcheck.count5/str str name=spellcheck.onlyMorePopularfalse/str /lst lst name=invariants !-- always run the Suggester for queries to this handler -- str name=spellchecktrue/str !-- collate not needed, query if tokenized as keyword, we need only suggestions for that term -- str name=spellcheck.collatefalse/str /lst !-- this handler uses only the needed component : suggest defined above -- arr name=components strsuggest/str strhighlight/str /arr /requestHandler and scheme: field name=groupid type=int indexed=true stored=true required=true multiValued=false/ field name=groupusername type=text_general indexed=true stored=true multiValued=true/ field name=groupname type=text_general indexed=true stored=true multiValued=false/ field name=grouporuser type=boolean indexed=true stored=true multiValued=false/ field name=autocomplete type=text_autocomplete/ copyField source=groupusername dest=autocomplete/ copyField source=groupname dest=autocomplete/ The query: http://gruppu.com:8983/solr/suggest?q=*:* spellcheck.q=jospellcheck=truehl=onhl.fl=groupid The respond: response lst name=responseHeader int name=status0/int int name=QTime0/int /lst lst name=spellcheck lst name=suggestions lst name=jo int name=numFound2/int int name=startOffset0/int int name=endOffset2/int arr name=suggestion strjorge/str strjorgen/str /arr /lst /lst /lst /response I would like to have the groupid and grouporuser fields returned ... No luck so far.
Re: Solr middle-ware?
It would be great if an example were available as part of the Solr release. Please file a Jira request. Maybe this could be one of the GSOC (Google Summer of Code) projects, or maybe somebody/everybody could submit their search middleware code as possible examples, attached to the Jira, so that even if these examples are not formally released, at least people can view and copy them. -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Tuesday, January 21, 2014 8:00 AM To: solr-user@lucene.apache.org Subject: Solr middle-ware? Hello, All the Solr documents talk about not running Solr directly to the cloud. But I see people keep asking for a thin secure layer in front of Solr they can talk from JavaScript to, perhaps with some basic extension options. Has anybody actually written one? Open source or in a community part of larger project? I would love to be able to point people at something. Is there something particularly difficult about writing one? Does anybody has a story of aborted attempt or mid-point reversal? I would like to know. Regards, Alex. P.s. Personal context: I am thinking of doing a series of lightweight examples of how to use Solr. Like I did for a book, but with a bit more depth and something that can actually be exposed to the live web with live data. I don't want to reinvent the wheel of the thin Solr middleware. P.p.s. Though I keep thinking that Dart could make an interesting option for the middleware as it could have the same codebase on the server and in the client. Like NodeJS, but with saner syntax. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Regarding Solr Faceting on the query response.
I believe its not possible to facet only the page you are, facet is supposed to work only with the full resultset. I never tried but i've never seen a way this could be done. alexei martchenko Facebook http://www.facebook.com/alexeiramone | Linkedinhttp://br.linkedin.com/in/alexeimartchenko| Steam http://steamcommunity.com/id/alexeiramone/ | 4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone | Github https://github.com/alexeiramone | (11) 9 7613.0966 | 2014-01-30 Mikhail Khludnev mkhlud...@griddynamics.com: Hello Do you mean setting http://wiki.apache.org/solr/SimpleFacetParameters#facet.mincount to 1 or you want to facet only returned page (rows) instead of full resultset (numFound) ? On Thu, Jan 30, 2014 at 6:24 AM, Nilesh Kuchekar kuchekar.nil...@gmail.comwrote: Yeah it's a typo... I meant company:Apple Thanks Nilesh On Jan 29, 2014, at 8:59 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: On Thu, Jan 30, 2014 at 3:43 AM, Kuchekar kuchekar.nil...@gmail.com wrote: company=Apple Did you mean company:Apple ? Otherwise, that could be the issue. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Solr middle-ware?
Hi; If you need such kind of thing and if you/we can define the requirements I can contribute to Solr as a part of GSOC. Thanks; Furkan KAMACI 2014-01-30 Jack Krupansky j...@basetechnology.com: It would be great if an example were available as part of the Solr release. Please file a Jira request. Maybe this could be one of the GSOC (Google Summer of Code) projects, or maybe somebody/everybody could submit their search middleware code as possible examples, attached to the Jira, so that even if these examples are not formally released, at least people can view and copy them. -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Tuesday, January 21, 2014 8:00 AM To: solr-user@lucene.apache.org Subject: Solr middle-ware? Hello, All the Solr documents talk about not running Solr directly to the cloud. But I see people keep asking for a thin secure layer in front of Solr they can talk from JavaScript to, perhaps with some basic extension options. Has anybody actually written one? Open source or in a community part of larger project? I would love to be able to point people at something. Is there something particularly difficult about writing one? Does anybody has a story of aborted attempt or mid-point reversal? I would like to know. Regards, Alex. P.s. Personal context: I am thinking of doing a series of lightweight examples of how to use Solr. Like I did for a book, but with a bit more depth and something that can actually be exposed to the live web with live data. I don't want to reinvent the wheel of the thin Solr middleware. P.p.s. Though I keep thinking that Dart could make an interesting option for the middleware as it could have the same codebase on the server and in the client. Like NodeJS, but with saner syntax. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: high memory usage with small data set
Do your used entries in your caches increase in parallel? This would be the case if you aren't updating your index and would explain it. BTW, take a look at your cache statistics (from the admin page) and look at the cache hit ratios. If they are very small (and my guess is that with 1,500 boolean operations, you aren't getting significant re-use) then you're just wasting space, try the cache=false option. Also, how are you measuring memory? It's sometimes confusing that virtual memory can be include, see: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Best, Erick On Wed, Jan 29, 2014 at 7:49 AM, Johannes Siegert johannes.sieg...@marktjagd.de wrote: Hi, we are using Apache Solr Cloud within a production environment. If the maximum heap-space is reached the Solr access time slows down, because of the working garbage collector for a small amount of time. We use the following configuration: - Apache Tomcat as webserver to run the Solr web application - 13 indices with about 150 entries (300 MB) - 5 server with one replication per index (5 GB max heap-space) - All indices have the following caches - maximum document-cache-size is 4096 entries, all other indices have between 64 and 1536 entries - maximum query-cache-size is 1024 entries, all other indices have between 64 and 768 - maximum filter-cache-size is 1536 entries, all other i ndices have between 64 and 1024 - the directory-factory-implementation is NRTCachingDirectoryFactory - the index is updated once per hour (no auto commit) - ca. 5000 requests per hour per server - large filter-queries (up to 15000 bytes and 1500 boolean operations) - many facet-queries (30%) Behaviour: Started with 512 MB heap space. Over several days the heap-space grow up, until the 5 GB was reached. At this moment the described problem occurs. From this time on the heap-space-useage is between 50 and 90 percent. No OutOfMemoryException occurs. Questions: 1. Why does Solr use 5 GB ram, with this small amount of data? 2. Which impact does the large filter-queries have in relation to ram usage? Thanks! Johannes Siegert
Re: 4.6 Core Discovery coreRootDirectory not working
I'm traveling and can't pursue this right now, but a couple of questions: /home/user1/solr/core.properties exists in all these cases, right? Tangential, but I'd be very cautious about setting core root the way you are, since it'll walk each and every directory under /home looking for cores. Perhaps you're just caught in that file-traversal loop (guessing here). Do the log files show anything interesting? I'll be able to respond occasionally between now and next week, since we're on the road... Best Erick On Wed, Jan 29, 2014 at 3:41 PM, Sam Batschelet sbatsche...@mac.com wrote: On Jan 29, 2014, at 4:31 PM, Sam Batschelet wrote: Hello this is my 1st post to you group I am in the process of setting up a development environment using solr. We will require multiple cores managed by multiple users in the following layout. I am running a fairly vanilla version of 4.6 solrHome /home/camp/example/solr/solr.xml cores /home/user1/solr/core.properties /home/user2/solr/core.properties If I manually add the core from admin everything works fine I can index etc but when I kill the server the core information is no longer available. I need to delete the core.properties file and recreate core from admin. I since have learned that this should be done with Core Discovery. Mainly setting coreRootDirectory which logically in this case should be /home. But solr is not finding the core even if I set the directory directly. ie /home/user1/solr/ or /home/user1/. I must be missing another config and was hoping for some insight. ## solr.xml solr !-- str name=coreRootDirectory${coreRootDirectory:/home}/str -- Just to point out the obvious before I get 20 responses to such I did test this without the commenting :).
Re: KeywordTokenizerFactory - trouble with exact matches
Note, the comments about lowercasetokenizer were a red herring. You were using LowerCaseFilterFactory. note Filter rather than Tokenizer. So it would just do what you expected, lowercase the entire input. You would have used LowerCaseTokenizerFactory in place of KeywordTokenizerFactory, not as a Filter. As for the rest, I expect Jack is right, it's the query parsing above the field input. Best Erick On Thu, Jan 30, 2014 at 6:29 AM, Aleksander Akerø aleksan...@gurusoft.no wrote: Hi Srinivasa Yes I've come to understand that the analyzers will never see the whitespace, thus no need for patternreplacement, like Jack points out. So the solution would be to set wich parser to use for the query. Also Jack has pointed out that the field queryparser should work in this particular setting - http://wiki.apache.org/solr/QueryParser My problem was though, that it was only for one of the fields in the schema that i needed this for, but for all the other fields, e.g. name, description etc., I would very much like to make use of the eDisMax functionality. And it seems that there can only be defined one query parser per query. in other words: for all fields. Jack, you may correct me if I'm wrong here :) This particular customer wanted a wildcard search at both ends of the phrase, and that sort of ambiguated the problem. And therefore I chose to replace all whitespace for this field in sql at index time, using the DIH. And then using EdgeNGramFilterFactory on both sides of the keyword like the config below, and that seemed to work pretty nicely. !-- WildCard search number -- fieldType name=keyword class= solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class= solr.LowerCaseFilterFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=25 side=front/ filter class= solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=25 side=back/ /analyzer analyzer type=query tokenizer class= solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType I also added a bit of extra weighting for the keyword field so that exact matches recieved a higher score. What this solution doesn't do is to exclude values like EE 009, when searching for FE 009, but they return far down on the list, which for the customer is ok, because usually these results are somewhat related og within the same category. *Aleksander Akerø* Systemkonsulent Mobil: 944 89 054 E-post: aleksan...@gurusoft.no *Gurusoft AS* Telefon: 92 44 09 99 Østre Kullerød www.gurusoft.no 2014-01-30 Jack Krupansky j...@basetechnology.com The standard, keyword-oriented query parsers will all treat unquoted, unescaped white space as term delimiters and ignore the what space. There is no way to bypass that behavior. So, your regex will never even see the white space - unless you enclose the text and white space in quotes or use a backslash to quote each white space character. You can use the field and term query parsers to pass a query string as if it were fully enclosed in quotes, but that only handles a single term and does not allow for multiple terms or any query operators. For example: {!field f=myfield}Foo Bar See: http://wiki.apache.org/solr/QueryParser You can also pre-configure the field query parser with the defType=field parameter. -- Jack Krupansky -Original Message- From: Srinivasa7 Sent: Thursday, January 30, 2014 6:37 AM To: solr-user@lucene.apache.org Subject: Re: KeywordTokenizerFactory - trouble with exact matches Hi, I have similar kind of problem where I want search for a words with spaces in that. And I wanted to search by stripping all the spaces . I have used following schema for that fieldType name=nospaces class=solr.TextField autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=[^\w]+ replacement= replace=all/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=[^\w]+ replacement= replace=all/ /analyzer /fieldType And field name=text_nospaces type=nospaces indexed=true stored=true omitNorms=true / copyField source=text dest=text_nospaces / But it is not searching the right terms . we are stripping the spaces and indexing lowercase values when we do that. Like : East Enders when I seach for 'east end ers' text, its not returning any values saying no document found. I realised the solr uses QueryParser before passing query string to the QueryAnalyzer in defined in schema. And The Query parser is
Re: KeywordTokenizerFactory - trouble with exact matches
Yes, I actually noted that about the filter vs. tokenizer. It's easy to get confused if you don't have a good understanding of the differences between tokenizers and filters. As for the query parser problem, there's always a workaround, but it was nice to be made aware of. It sort of was a ghost-like problem before. Allthough it would be great to have the opportunity to disable the splitting on whitespace even for DisMax, I understand that it probably not the most wanted feature for next solr release :) *Aleksander Akerø* Systemkonsulent Mobil: 944 89 054 E-post: aleksan...@gurusoft.no *Gurusoft AS* Telefon: 92 44 09 99 Østre Kullerød www.gurusoft.no 2014-01-30 Erick Erickson erickerick...@gmail.com: Note, the comments about lowercasetokenizer were a red herring. You were using LowerCaseFilterFactory. note Filter rather than Tokenizer. So it would just do what you expected, lowercase the entire input. You would have used LowerCaseTokenizerFactory in place of KeywordTokenizerFactory, not as a Filter. As for the rest, I expect Jack is right, it's the query parsing above the field input. Best Erick On Thu, Jan 30, 2014 at 6:29 AM, Aleksander Akerø aleksan...@gurusoft.no wrote: Hi Srinivasa Yes I've come to understand that the analyzers will never see the whitespace, thus no need for patternreplacement, like Jack points out. So the solution would be to set wich parser to use for the query. Also Jack has pointed out that the field queryparser should work in this particular setting - http://wiki.apache.org/solr/QueryParser My problem was though, that it was only for one of the fields in the schema that i needed this for, but for all the other fields, e.g. name, description etc., I would very much like to make use of the eDisMax functionality. And it seems that there can only be defined one query parser per query. in other words: for all fields. Jack, you may correct me if I'm wrong here :) This particular customer wanted a wildcard search at both ends of the phrase, and that sort of ambiguated the problem. And therefore I chose to replace all whitespace for this field in sql at index time, using the DIH. And then using EdgeNGramFilterFactory on both sides of the keyword like the config below, and that seemed to work pretty nicely. !-- WildCard search number -- fieldType name=keyword class= solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class= solr.LowerCaseFilterFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=25 side=front/ filter class= solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=25 side=back/ /analyzer analyzer type=query tokenizer class= solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType I also added a bit of extra weighting for the keyword field so that exact matches recieved a higher score. What this solution doesn't do is to exclude values like EE 009, when searching for FE 009, but they return far down on the list, which for the customer is ok, because usually these results are somewhat related og within the same category. *Aleksander Akerø* Systemkonsulent Mobil: 944 89 054 E-post: aleksan...@gurusoft.no *Gurusoft AS* Telefon: 92 44 09 99 Østre Kullerød www.gurusoft.no 2014-01-30 Jack Krupansky j...@basetechnology.com The standard, keyword-oriented query parsers will all treat unquoted, unescaped white space as term delimiters and ignore the what space. There is no way to bypass that behavior. So, your regex will never even see the white space - unless you enclose the text and white space in quotes or use a backslash to quote each white space character. You can use the field and term query parsers to pass a query string as if it were fully enclosed in quotes, but that only handles a single term and does not allow for multiple terms or any query operators. For example: {!field f=myfield}Foo Bar See: http://wiki.apache.org/solr/QueryParser You can also pre-configure the field query parser with the defType=field parameter. -- Jack Krupansky -Original Message- From: Srinivasa7 Sent: Thursday, January 30, 2014 6:37 AM To: solr-user@lucene.apache.org Subject: Re: KeywordTokenizerFactory - trouble with exact matches Hi, I have similar kind of problem where I want search for a words with spaces in that. And I wanted to search by stripping all the spaces . I have used following schema for that fieldType name=nospaces class=solr.TextField autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=[^\w]+ replacement= replace=all/
SolR performance problem
Hi, I am working on solr 4.2.1 jetty and we are facing some performance issue and heap memory overflow issue as well. So i am searching the actual cause for this exceptions. then i applied load test for different solr queries. After few mins got below errors. WARN:oejs.Response:Committed before 500 {msg=Software caused connection abort: socket write Caused by: java.net.SocketException: Software caused connection abort: socket write error SEVERE: null:org.eclipse.jetty.io.EofException I also tried to set the maxIdleTime to 30 milliSeconds. But still getting same error. Any ideas? Please help, how to tackle this. Thanks, Mayur -- View this message in context: http://lucene.472066.n3.nabble.com/SolR-performance-problem-tp4114459.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: KeywordTokenizerFactory - trouble with exact matches
I vaguely recall that there was a Jira floating around for multi-word synonyms that dealt with parsing of spaces as well. And Robert Muir has (repeatedly) referred to this query parser feature as a bug. Somehow, eventually, I think it will be dealt with, but the difficulty remains for now. -- Jack Krupansky -Original Message- From: Aleksander Akerø Sent: Thursday, January 30, 2014 9:31 AM To: solr-user@lucene.apache.org Subject: Re: KeywordTokenizerFactory - trouble with exact matches Yes, I actually noted that about the filter vs. tokenizer. It's easy to get confused if you don't have a good understanding of the differences between tokenizers and filters. As for the query parser problem, there's always a workaround, but it was nice to be made aware of. It sort of was a ghost-like problem before. Allthough it would be great to have the opportunity to disable the splitting on whitespace even for DisMax, I understand that it probably not the most wanted feature for next solr release :) *Aleksander Akerø* Systemkonsulent Mobil: 944 89 054 E-post: aleksan...@gurusoft.no *Gurusoft AS* Telefon: 92 44 09 99 Østre Kullerød www.gurusoft.no 2014-01-30 Erick Erickson erickerick...@gmail.com: Note, the comments about lowercasetokenizer were a red herring. You were using LowerCaseFilterFactory. note Filter rather than Tokenizer. So it would just do what you expected, lowercase the entire input. You would have used LowerCaseTokenizerFactory in place of KeywordTokenizerFactory, not as a Filter. As for the rest, I expect Jack is right, it's the query parsing above the field input. Best Erick On Thu, Jan 30, 2014 at 6:29 AM, Aleksander Akerø aleksan...@gurusoft.no wrote: Hi Srinivasa Yes I've come to understand that the analyzers will never see the whitespace, thus no need for patternreplacement, like Jack points out. So the solution would be to set wich parser to use for the query. Also Jack has pointed out that the field queryparser should work in this particular setting - http://wiki.apache.org/solr/QueryParser My problem was though, that it was only for one of the fields in the schema that i needed this for, but for all the other fields, e.g. name, description etc., I would very much like to make use of the eDisMax functionality. And it seems that there can only be defined one query parser per query. in other words: for all fields. Jack, you may correct me if I'm wrong here :) This particular customer wanted a wildcard search at both ends of the phrase, and that sort of ambiguated the problem. And therefore I chose to replace all whitespace for this field in sql at index time, using the DIH. And then using EdgeNGramFilterFactory on both sides of the keyword like the config below, and that seemed to work pretty nicely. !-- WildCard search number -- fieldType name=keyword class= solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class= solr.LowerCaseFilterFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=25 side=front/ filter class= solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=25 side=back/ /analyzer analyzer type=query tokenizer class= solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType I also added a bit of extra weighting for the keyword field so that exact matches recieved a higher score. What this solution doesn't do is to exclude values like EE 009, when searching for FE 009, but they return far down on the list, which for the customer is ok, because usually these results are somewhat related og within the same category. *Aleksander Akerø* Systemkonsulent Mobil: 944 89 054 E-post: aleksan...@gurusoft.no *Gurusoft AS* Telefon: 92 44 09 99 Østre Kullerød www.gurusoft.no 2014-01-30 Jack Krupansky j...@basetechnology.com The standard, keyword-oriented query parsers will all treat unquoted, unescaped white space as term delimiters and ignore the what space. There is no way to bypass that behavior. So, your regex will never even see the white space - unless you enclose the text and white space in quotes or use a backslash to quote each white space character. You can use the field and term query parsers to pass a query string as if it were fully enclosed in quotes, but that only handles a single term and does not allow for multiple terms or any query operators. For example: {!field f=myfield}Foo Bar See: http://wiki.apache.org/solr/QueryParser You can also pre-configure the field query parser with the defType=field parameter. -- Jack Krupansky -Original Message- From: Srinivasa7 Sent: Thursday, January 30, 2014 6:37 AM To: solr-user@lucene.apache.org Subject: Re: KeywordTokenizerFactory - trouble with exact matches Hi, I have similar kind of problem where I want search for a words with spaces in that. And I wanted to search by stripping all
Re: ant eclipse hangs - branch_4x
Hi I used Ivy 2.2.0. Upgraded to 2.3.0. Didnt help No lck files found in ~/.ivy2/cache, so nothing to delete Deleted the entire ~/.ivy2/cache folder. Didnt help Debugged a little and found that it was hanging due to org.apache.hadoop dependencies in solr/core/ivy.xml - if I commended out everything that had to do with hadoop in that ivy.xml it didnt hang in ant resolve (from solr/core) Finally the problem was solved when I tried to add http://central.maven.org/maven2 to our Artifactory. Do not understand why that was necessary, because we already had http://repo1.maven.org/maven2/ in our Artifactory. Well never mind - it works for me now. Thanks for the help! Regards, Per Steffensen On 1/30/14 1:11 PM, Steve Rowe wrote: Hi Per, You may be seeing the stale-Ivy-lock problem (see IVY-1388). LUCENE-4636 upgraded the bootstrapped Ivy to 2.3.0 to reduce the likelihood of this problem, so the first thing is to make sure you have that version in ~/.ant/lib/ - if not, remove the Ivy jar that’s there and run ‘ant ivy-bootstrap’ to download and put the 2.3.0 jar in place. You should run the following and remove any files it finds: find ~/.ivy2/cache -name ‘*.lck’ That should stop ‘ant resolve’ from hanging. Steve On Jan 30, 2014, at 5:06 AM, Per Steffensen st...@designware.dk wrote: Hi Earlier in used to be able to successfully run ant eclipse from branch_4x. With the newest code (tip of branch_4x today) I cant. ant eclipse hangs forever at the point showed by console output below. I noticed that this problem has been around for a while - not something that happened today. Any idea about what might be wrong? A solution? Help to debug? Regards Per Steffensen --- console when running ant eclipse - ... resolve: [echo] Building solr-example-DIH... ivy-availability-check: [echo] Building solr-example-DIH... ivy-fail: ivy-configure: [ivy:configure] :: loading settings :: file = /Some/Path/ws_kepler_apache_lucene_solr_4x/branch_4x/lucene/ivy-settings.xml resolve: resolve: [echo] Building solr-core... ivy-availability-check: [echo] Building solr-core... ivy-fail: ivy-fail: ivy-configure: [ivy:configure] :: loading settings :: file = /Some/Path/ws_kepler_apache_lucene_solr_4x/branch_4x/lucene/ivy-settings.xml resolve: HERE IT JUST HANGS FOREVER -
Re: KeywordTokenizerFactory - trouble with exact matches
I've come across something like this as well, can't remember where, but it was often related to synonym functionality. The following link shows a 3rd party QueryParser that seems to deal with synonyms alongside edismax, and may be interesting to look at: http://wiki.apache.org/solr/QueryParser It is also mentioned as an issue while using the synonymFilterFactory: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words sea and biscit seperately, and will not know that they match a synonym. Maybe the extended support for synonym handling is what will give us the solution one day. For now I have solved my problem and will leave it at that. *Aleksander Akerø* Systemkonsulent Mobil: 944 89 054 E-post: aleksan...@gurusoft.no *Gurusoft AS* Telefon: 92 44 09 99 Østre Kullerød www.gurusoft.no 2014-01-30 Jack Krupansky j...@basetechnology.com: I vaguely recall that there was a Jira floating around for multi-word synonyms that dealt with parsing of spaces as well. And Robert Muir has (repeatedly) referred to this query parser feature as a bug. Somehow, eventually, I think it will be dealt with, but the difficulty remains for now. -- Jack Krupansky -Original Message- From: Aleksander Akerø Sent: Thursday, January 30, 2014 9:31 AM To: solr-user@lucene.apache.org Subject: Re: KeywordTokenizerFactory - trouble with exact matches Yes, I actually noted that about the filter vs. tokenizer. It's easy to get confused if you don't have a good understanding of the differences between tokenizers and filters. As for the query parser problem, there's always a workaround, but it was nice to be made aware of. It sort of was a ghost-like problem before. Allthough it would be great to have the opportunity to disable the splitting on whitespace even for DisMax, I understand that it probably not the most wanted feature for next solr release :) *Aleksander Akerø* Systemkonsulent Mobil: 944 89 054 E-post: aleksan...@gurusoft.no *Gurusoft AS* Telefon: 92 44 09 99 Østre Kullerød www.gurusoft.no 2014-01-30 Erick Erickson erickerick...@gmail.com: Note, the comments about lowercasetokenizer were a red herring. You were using LowerCaseFilterFactory. note Filter rather than Tokenizer. So it would just do what you expected, lowercase the entire input. You would have used LowerCaseTokenizerFactory in place of KeywordTokenizerFactory, not as a Filter. As for the rest, I expect Jack is right, it's the query parsing above the field input. Best Erick On Thu, Jan 30, 2014 at 6:29 AM, Aleksander Akerø aleksan...@gurusoft.no wrote: Hi Srinivasa Yes I've come to understand that the analyzers will never see the whitespace, thus no need for patternreplacement, like Jack points out. So the solution would be to set wich parser to use for the query. Also Jack has pointed out that the field queryparser should work in this particular setting - http://wiki.apache.org/solr/QueryParser My problem was though, that it was only for one of the fields in the schema that i needed this for, but for all the other fields, e.g. name, description etc., I would very much like to make use of the eDisMax functionality. And it seems that there can only be defined one query parser per query. in other words: for all fields. Jack, you may correct me if I'm wrong here :) This particular customer wanted a wildcard search at both ends of the phrase, and that sort of ambiguated the problem. And therefore I chose to replace all whitespace for this field in sql at index time, using the DIH. And then using EdgeNGramFilterFactory on both sides of the keyword like the config below, and that seemed to work pretty nicely. !-- WildCard search number -- fieldType name=keyword class= solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class= solr.LowerCaseFilterFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=25 side=front/ filter class= solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=25 side=back/ /analyzer analyzer type=query tokenizer class= solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType I also added a bit of extra weighting for the keyword field so that exact matches recieved a higher score. What this solution doesn't do is to exclude values like EE 009, when searching for FE 009, but they return far down on the list, which for the customer is ok, because usually these results are somewhat related og within the same category. *Aleksander Akerø* Systemkonsulent Mobil: 944 89 054 E-post: aleksan...@gurusoft.no *Gurusoft AS* Telefon: 92 44 09 99 Østre Kullerød
Error when restarting solr servers
Hello, Running solr cloud with 2 collections 5 shards and 3 replicas for each collection, 5 zookeeper instance. solr-4.6.0 apache-tomcat-7.0.39 zookeeper-3.4.5 jre1.7.0_21 When I try to restart a solr servers in my solr cloud I am receiving this errors : 1861449 [localhost-startStop-1-EventThread] INFO org.apache.solr.cloud.ShardLeaderElectionContext â Running the leader process for shard shard1 1861451 [localhost-startStop-1-EventThread] INFO org.apache.solr.cloud.ShardLeaderElectionContext â Checking if I should try and be the leader. 1861451 [localhost-startStop-1-EventThread] INFO org.apache.solr.cloud.ShardLeaderElectionContext â My last published State was down, I won't be the leader. 1861451 [localhost-startStop-1-EventThread] INFO org.apache.solr.cloud.ShardLeaderElectionContext â There may be a better leader candidate than us - going back into recovery 1861452 [localhost-startStop-1-EventThread] INFO org.apache.solr.update.DefaultSolrCoreState â Running recovery - first canceling any ongoing recovery 1861452 [localhost-startStop-1-EventThread] WARN org.apache.solr.cloud.RecoveryStrategy â Stopping recovery for zkNodeName=core_node3core=Current1_shard1_replica3 1862223 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy â Finished recovery process. core=Current1_shard1_replica3 1862223 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy â Starting recovery process. core=Current1_shard1_replica3 recoveringAfterStartup=false 1862223 [RecoveryThread] ERROR org.apache.solr.update.UpdateLog â Exception reading versions from log java.nio.channels.ClosedChannelException at sun.nio.ch.FileChannelImpl.ensureOpen(Unknown Source) at sun.nio.ch.FileChannelImpl.read(Unknown Source) at org.apache.solr.update.ChannelFastInputStream.readWrappedStream(TransactionLog.java:778) at org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89) at org.apache.solr.common.util.FastInputStream.readUnsignedByte(FastInputStream.java:71) at org.apache.solr.common.util.FastInputStream.readInt(FastInputStream.java:216) at org.apache.solr.update.TransactionLog$FSReverseReader.init(TransactionLog.java:696) at org.apache.solr.update.TransactionLog.getReverseReader(TransactionLog.java:575) at org.apache.solr.update.UpdateLog$RecentUpdates.update(UpdateLog.java:942) at org.apache.solr.update.UpdateLog$RecentUpdates.access$000(UpdateLog.java:885) at org.apache.solr.update.UpdateLog.getRecentUpdates(UpdateLog.java:1042) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:280) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:244) 1862223 [RecoveryThread] ERROR org.apache.solr.cloud.RecoveryStrategy â Error while trying to recover. core=Current1_shard1_replica3:org.apache.solr.common.SolrException: Cloud state sti ll says we are leader. at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:354) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:244) 1862224 [RecoveryThread] ERROR org.apache.solr.cloud.RecoveryStrategy â Recovery failed - trying again... (0) core=Current1_shard1_replica3 1862224 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy â Wait 2.0 seconds before trying to recover again (1) 1862541 [localhost-startStop-1-SendThread(10.0.5.230:2281)] WARN org.apache.zookeeper.ClientCnxn â Session 0x542fd3f2be100e6 for server 10.0.5.230/10.0.5.230:2281, unexpected error, clo sing socket connection and attempting reconnect java.io.IOException: Packet len11106511 is out of range! at org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:112) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068) 1862641 [localhost-startStop-1-EventThread] INFO org.apache.solr.cloud.DistributedQueue â Watcher fired on path: null state: Disconnected type None 1862641 [localhost-startStop-1-EventThread] INFO org.apache.solr.cloud.DistributedQueue â Watcher fired on path: null state: Disconnected type None 1862641 [localhost-startStop-1-EventThread] INFO org.apache.solr.cloud.DistributedQueue â Watcher fired on path: null state: Disconnected type None 1862641 [localhost-startStop-1-EventThread] INFO org.apache.solr.cloud.DistributedQueue â Watcher fired on path: null state: Disconnected type None .. 1270268 [http-bio-8201-exec-26] INFO org.apache.solr.handler.admin.CoreAdminHandler â Going to wait for coreNodeName: core_node10, state: recovering, checkLive: true, onlyIfLeader: true 1270268 [http-bio-8201-exec-10] INFO org.apache.solr.handler.admin.CoreAdminHandler â Going to wait for coreNodeName: core_node11, state: recovering,
Re: Required local configuration with ZK solr.xml?
On 1/29/2014 12:48 PM, Jeff Wartes wrote: And that, I think, is my misunderstanding. I had assumed that the link between a node and the collections it belongs to would be the (possibly chroot¹ed) zookeeper reference *itself*, not the node¹s directory structure. Instead, it appears that ZK is simply a repository for the collection configuration, where nodes may look up what they need based on filesystem core references. Work is underway towards a new mode where zookeeper is the ultimate source of truth, and each node will behave accordingly to implement and maintain that truth. I can't seem to locate a Jira issue for it, unfortunately. It's possible that one doesn't exist yet, or that it has an obscure title. Mark Miller is the one who really understands the full details, as he's a primary author of SolrCloud code. Currently, what SolrCloud considers to be truth is dictated by both zookeeper and an amalgamation of which cores each server actually has present. The collections API modifies both. With an older config (all current and future 4.x versions), the latter is in solr.xml. If you're using the new solr.xml format (available 4.4 and later, will be mandatory in 5.0), it's done with Core Discovery. Zookeeper has a list of everything and coordinates the cluster state, but has no real control over the cores that actually exist on each server. When the two sources of truth disagree, nothing happens to fix the situation, manual intervention is required. Any errors in my understanding of SolrCloud are my own. I don't claim that what I just wrote is error-free, but I am pretty sure that it's essentially correct. Thanks, Shawn
Re: Required local configuration with ZK solr.xml?
Work is underway towards a new mode where zookeeper is the ultimate source of truth, and each node will behave accordingly to implement and maintain that truth. I can't seem to locate a Jira issue for it, unfortunately. It's possible that one doesn't exist yet, or that it has an obscure title. Mark Miller is the one who really understands the full details, as he's a primary author of SolrCloud code. Currently, what SolrCloud considers to be truth is dictated by both zookeeper and an amalgamation of which cores each server actually has present. The collections API modifies both. With an older config (all current and future 4.x versions), the latter is in solr.xml. If you're using the new solr.xml format (available 4.4 and later, will be mandatory in 5.0), it's done with Core Discovery. Zookeeper has a list of everything and coordinates the cluster state, but has no real control over the cores that actually exist on each server. When the two sources of truth disagree, nothing happens to fix the situation, manual intervention is required. Thanks Shawn, this was exactly the confirmation I was looking for. I think I have a much better understanding now. The takeaway I have is that SolrCloud¹s current automation assumes relatively static clusters, and that if I want anything like dynamic scaling, I¹m going to have to write my own tooling to add nodes safely. Fortunately, it appears that the necessary CoreAdmin commands don¹t need much besides the collection name, so it smells like a simple thing to query zookeeper¹s /collections path (or clusterstate.json) and issue GET requests accordingly when I spin up a new node. If you (or anyone) does happen to recall a reference to the work you alluded to, I¹d certainly be interested. I googled around myself for a few minutes, but haven¹t found anything so far.
Re: Regarding Solr Faceting on the query response.
Hi Mikhail, I would like my faceting to run only on my resultset returned as in only on numFound, rather than the whole index. In the example, even when I specify the query 'company:Apple' .. it gives me faceted results for other companies. This means that it is querying against the whole index, rather than just the result set. Using facet.mincount=1 will give me faceted values which are greater than 1, but that will again to retrieve all the distinct values (Apple, Bose, Chevron, ..Oracle..) of facet field (company) query the whole index. What I would like to do is ... facet only on the resultset. i.e. my query (q= company:Apple AND technologies:java ) should return, only the facet details about 'Apple' since that is only present in the results set. But it provides me the list of other Company Names ... which makes me believe that it is querying the whole index to get the distinct value for the company.. docs: [ { id: ABC123, company: [ APPLE ] }, { id: ABC1234, company: [ APPLE ] }, { id: ABC1235, company: [ APPLE ] }, { id: ABC1236, company: [ APPLE ] } ] }, facet_counts: { facet_queries: { p_company:ucsf\n: 1 }, facet_fields: { company: [ APPLE, 4, ] }, facet_dates: {}, facet_ranges: {} } Thanks. Kuchekar, Nilesh On Thu, Jan 30, 2014 at 2:13 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello Do you mean setting http://wiki.apache.org/solr/SimpleFacetParameters#facet.mincount to 1 or you want to facet only returned page (rows) instead of full resultset (numFound) ? On Thu, Jan 30, 2014 at 6:24 AM, Nilesh Kuchekar kuchekar.nil...@gmail.comwrote: Yeah it's a typo... I meant company:Apple Thanks Nilesh On Jan 29, 2014, at 8:59 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: On Thu, Jan 30, 2014 at 3:43 AM, Kuchekar kuchekar.nil...@gmail.com wrote: company=Apple Did you mean company:Apple ? Otherwise, that could be the issue. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
RES: Regarding Solr Faceting on the query response.
Hi Nilesh, maybe Facetting is not the right thing for you, because 'faceting is the arrangement of search results into categories based on indexed terms' (https://cwiki.apache.org/confluence/display/solr/Faceting). Perhaps you could use Result Clustering (https://cwiki.apache.org/confluence/display/solr/Result+Clustering), for the clustering algorithm is applied to the search result of each single query. Hope this helps. Felipe Dantas de Souza Paiva De: Kuchekar [kuchekar.nil...@gmail.com] Enviado: quinta-feira, 30 de janeiro de 2014 15:35 Para: solr-user@lucene.apache.org Assunto: Re: Regarding Solr Faceting on the query response. Hi Mikhail, I would like my faceting to run only on my resultset returned as in only on numFound, rather than the whole index. In the example, even when I specify the query 'company:Apple' .. it gives me faceted results for other companies. This means that it is querying against the whole index, rather than just the result set. Using facet.mincount=1 will give me faceted values which are greater than 1, but that will again to retrieve all the distinct values (Apple, Bose, Chevron, ..Oracle..) of facet field (company) query the whole index. What I would like to do is ... facet only on the resultset. i.e. my query (q= company:Apple AND technologies:java ) should return, only the facet details about 'Apple' since that is only present in the results set. But it provides me the list of other Company Names ... which makes me believe that it is querying the whole index to get the distinct value for the company.. docs: [ { id: ABC123, company: [ APPLE ] }, { id: ABC1234, company: [ APPLE ] }, { id: ABC1235, company: [ APPLE ] }, { id: ABC1236, company: [ APPLE ] } ] }, facet_counts: { facet_queries: { p_company:ucsf\n: 1 }, facet_fields: { company: [ APPLE, 4, ] }, facet_dates: {}, facet_ranges: {} } Thanks. Kuchekar, Nilesh On Thu, Jan 30, 2014 at 2:13 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello Do you mean setting http://wiki.apache.org/solr/SimpleFacetParameters#facet.mincount to 1 or you want to facet only returned page (rows) instead of full resultset (numFound) ? On Thu, Jan 30, 2014 at 6:24 AM, Nilesh Kuchekar kuchekar.nil...@gmail.comwrote: Yeah it's a typo... I meant company:Apple Thanks Nilesh On Jan 29, 2014, at 8:59 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: On Thu, Jan 30, 2014 at 3:43 AM, Kuchekar kuchekar.nil...@gmail.com wrote: company=Apple Did you mean company:Apple ? Otherwise, that could be the issue. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com AVISO: A informação contida neste e-mail, bem como em qualquer de seus anexos, é CONFIDENCIAL e destinada ao uso exclusivo do(s) destinatário(s) acima referido(s), podendo conter informações sigilosas e/ou legalmente protegidas. Caso você não seja o destinatário desta mensagem, informamos que qualquer divulgação, distribuição ou cópia deste e-mail e/ou de qualquer de seus anexos é absolutamente proibida. Solicitamos que o remetente seja comunicado imediatamente, respondendo esta mensagem, e que o original desta mensagem e de seus anexos, bem como toda e qualquer cópia e/ou impressão realizada a partir destes, sejam permanentemente apagados e/ou destruídos. Informações adicionais sobre nossa empresa podem ser obtidas no site http://sobre.uol.com.br/. NOTICE: The information contained in this e-mail and any attachments thereto is CONFIDENTIAL and is intended only for use by the recipient named herein and may contain legally privileged and/or secret information. If you are not the e-mail´s intended recipient, you are hereby notified that any dissemination, distribution or copy of this e-mail, and/or any attachments thereto, is strictly prohibited. Please immediately notify the sender replying to the above mentioned e-mail address, and permanently delete and/or destroy the original and any copy of this e-mail and/or its attachments, as well as any printout thereof. Additional information about our company may be obtained through the site http://www.uol.com.br/ir/.
Adding DocValues in an existing field
Hi, Can I add to an existing field the docvalue feature without wipe the actual? The modification on the schema will be something like this: field name=surrogate_id type=tlong indexed=true stored=true multiValued=false / field name=surrogate_id type=tlong indexed=true stored=true multiValued=false docValues=true/ I want use the actual data to reindex it again in the same collection but in the process create the docvalues too, it's possible? I'm using solr 4.6.1 - Best regards -- View this message in context: http://lucene.472066.n3.nabble.com/Adding-DocValues-in-an-existing-field-tp4114462.html Sent from the Solr - User mailing list archive at Nabble.com.
Geospatial clustering + zoom in/out help
Hi, I have an index with 300K docs with lat,lon. I need to cluster the docs based on lat,lon for display in the UI. The user then needs to be able to click on any cluster and zoom in (up to 11 levels deep). I'm using Solr 4.6 and I'm wondering how best to implement this efficiently? A bit more specific questions below. I need to: 1) cluster data points at different zoom levels 2) click on a specific cluster and zoom in 3) be able to select a region (bounding box or polygon) and show clusters in the selected area What's the best way to implement this so that queries are fast? What I thought I would try, but maybe there are better ways: * divide the world in NxM large squares and then each of these squares into 4 more squares, and so on - 11 levels deep * at index time figure out all squares (at all 11 levels) each data point belongs to and index that info into 11 different fields: e.g. id=1 name=foo lat=x lon=y zoom1=square1_62 zoom2=square1_62_47 zoom3=square1_62_47_33 * at search time, use field collapsing on zoomX field to get which docs belong to which square on particular level * calculate center point of each square (by calculating mean value of positions for all points in that square) using StatsComponent (facet on zoomX field, avg on lat and lon fields) - I would consider those squares as separate clusters (one square is one cluster) and center points of those squares as center points of clusters derived from them I *think* the problem with this approach is that: * there will be many unique fields for bigger zoom levels, which means field collapsing / StatsComponent maaay not work fast enough * clusters will not look very natural because I would have many clusters on each zoom level and what are real geographical clusters would be displayed as multiple clusters since their points would in some cases be dispersed into multiple squares. But that may be OK * a lot will depend on how the squares are calculated - linearly dividing 360 degrees by N to get equal size squares in degrees would produce issues with real square sizes and counts of points in each of them So I'm wondering if there is a better way? Thanks, Bojan
Is there a way to get Solr to delete an uploaded document after its been indexed?
Hi, My crawler uploads all the documents to Solr for indexing to a tomcat/temp folder. Over time this folder grows so large that I run out of disk space. So, I wrote a bash script to delete the files and put it in the crontab. However, if I delete the docs too soon, it doesn't get indexed; too late and I run out of disk. I'm still trying to find the right window... So, (and this is probably a long shot) I'm wondering if there's anything in Solr that can delete these docs from /temp after they've been indexed... Thank you, -- View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-get-Solr-to-delete-an-uploaded-document-after-its-been-indexed-tp4114463.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Required local configuration with ZK solr.xml?
Found it. In case anyone else cares, this appears to be the root issue: https://issues.apache.org/jira/browse/SOLR-5128 Thanks again. On 1/30/14, 9:01 AM, Jeff Wartes jwar...@whitepages.com wrote: Work is underway towards a new mode where zookeeper is the ultimate source of truth, and each node will behave accordingly to implement and maintain that truth. I can't seem to locate a Jira issue for it, unfortunately. It's possible that one doesn't exist yet, or that it has an obscure title. Mark Miller is the one who really understands the full details, as he's a primary author of SolrCloud code. Currently, what SolrCloud considers to be truth is dictated by both zookeeper and an amalgamation of which cores each server actually has present. The collections API modifies both. With an older config (all current and future 4.x versions), the latter is in solr.xml. If you're using the new solr.xml format (available 4.4 and later, will be mandatory in 5.0), it's done with Core Discovery. Zookeeper has a list of everything and coordinates the cluster state, but has no real control over the cores that actually exist on each server. When the two sources of truth disagree, nothing happens to fix the situation, manual intervention is required. Thanks Shawn, this was exactly the confirmation I was looking for. I think I have a much better understanding now. The takeaway I have is that SolrCloud¹s current automation assumes relatively static clusters, and that if I want anything like dynamic scaling, I¹m going to have to write my own tooling to add nodes safely. Fortunately, it appears that the necessary CoreAdmin commands don¹t need much besides the collection name, so it smells like a simple thing to query zookeeper¹s /collections path (or clusterstate.json) and issue GET requests accordingly when I spin up a new node. If you (or anyone) does happen to recall a reference to the work you alluded to, I¹d certainly be interested. I googled around myself for a few minutes, but haven¹t found anything so far.
JVM heap constraints and garbage collection
Greetings esteemed Solr-ites, I'm using Solr 3.5 over Tomcat 6. My index has reached 30G. Since my average load during peak hours is becoming quite high, and since I'm finally starting to notice a little bit of performance degradation and intermittent errors (e.g. Solr returned response 0 on perfectly valid reads during load spikes), I think it's time to tune my Slave box before things get out of control. In particular, *I am curious how others are tuning their JVM heap constraints (xms, xms, etc.) and garbage collection (parallel or concurrent) to meet the needs of Solr*. I am using the Sun JVM Version 6, not the fancy third party offerings. Some more info, FWIW: - Average document size in my index is probably around 6k - Using CentOS - Master-Slave setup. Master gets all the writes, Slave gets all the read requests. It is the *Slave* that is suffering-- the Master seems fine. - The box is an m1.large on AWS EC2. 2 virtual CPUs, 4 ECU, 7.5 GiB RAM - DaemonThreads skyrocket during the aforementioned load spikes Thanks for reading, and to the devs: thanks for an excellent product. -- - Joe
TemplateTransformer returns null values
Hi, I am trying a simple transformer on data input using DIH, Solr 4.6. when I run the below query while DIH I get null values for new_url. what is wrong? even tried with ${document_solr.id} the name is data-config.xml: entity name=document_solr transformer=TemplateTransformer,LogTransformer query=select DOC_IDN as id, BILL_IDN as bill_id from document_solr logTemplate=The name is ${document_solr.DOC_IDN} logLevel=debug field column=DOC_IDN name=id / field column=BILL_IDN name=bill_id / field column=new_url template=${document_solr.DOC_IDN} / /entity below stack trace: 8185946 [Thread-29] INFO org.apache.solr.search.SolrIndexSearcher û Opening Searcher@5a5f4cb7 realtime 8185960 [Thread-29] INFO org.apache.solr.handler.dataimport.JdbcDataSource û Creating a connection for entity document_solr with URL: jdbc:oracle:thin:@vluedb01:1521:iedwdev 8186225 [Thread-29] INFO org.apache.solr.handler.dataimport.JdbcDataSource û Time taken forgetConnection():265 8186226 [Thread-29] DEBUG org.apache.solr.handler.dataimport.JdbcDataSource û Executing SQL: select DOC_IDN as id, BILL_IDN as bill_id from document_solr 8186291 [Thread-29] TRACE org.apache.solr.handler.dataimport.JdbcDataSource û Time taken for sql :64 8186301 [Thread-29] DEBUG org.apache.solr.handler.dataimport.LogTransformer û The name is 8186303 [Thread-29] DEBUG org.apache.solr.handler.dataimport.LogTransformer û The name is 8186303 [Thread-29] DEBUG org.apache.solr.handler.dataimport.LogTransformer û The name is `Tom -- View this message in context: http://lucene.472066.n3.nabble.com/TemplateTransformer-returns-null-values-tp4114539.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is there a way to get Solr to delete an uploaded document after its been indexed?
Well, it's your crawler that submits them, so the crawler should know when to delete them. If you want some sort of trigger from Solr, look at postCommit hook defined in solrconfig.xml. Though all that gives you is timing, not which documents to deal with. You could probably also plug into UpdateRequestProcessor chain, where you do have access to the document content. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 31, 2014 at 3:40 AM, eShard zim...@yahoo.com wrote: Hi, My crawler uploads all the documents to Solr for indexing to a tomcat/temp folder. Over time this folder grows so large that I run out of disk space. So, I wrote a bash script to delete the files and put it in the crontab. However, if I delete the docs too soon, it doesn't get indexed; too late and I run out of disk. I'm still trying to find the right window... So, (and this is probably a long shot) I'm wondering if there's anything in Solr that can delete these docs from /temp after they've been indexed... Thank you, -- View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-get-Solr-to-delete-an-uploaded-document-after-its-been-indexed-tp4114463.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: TemplateTransformer returns null values
I think you have double mapping there: *) select DOC_IDN as id *) field column=DOC_IDN name=id / Both are mapping DOC_IDN to id, possibly with second overriding the first (or shadowing). Try not doing 'as' part in select and then look for .id . Or keep the 'as' part as just have explicit field definition in the second one: field column=id / Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 31, 2014 at 6:29 AM, tom praveen...@yahoo.com wrote: Hi, I am trying a simple transformer on data input using DIH, Solr 4.6. when I run the below query while DIH I get null values for new_url. what is wrong? even tried with ${document_solr.id} the name is data-config.xml: entity name=document_solr transformer=TemplateTransformer,LogTransformer query=select DOC_IDN as id, BILL_IDN as bill_id from document_solr logTemplate=The name is ${document_solr.DOC_IDN} logLevel=debug field column=DOC_IDN name=id / field column=BILL_IDN name=bill_id / field column=new_url template=${document_solr.DOC_IDN} / /entity below stack trace: 8185946 [Thread-29] INFO org.apache.solr.search.SolrIndexSearcher û Opening Searcher@5a5f4cb7 realtime 8185960 [Thread-29] INFO org.apache.solr.handler.dataimport.JdbcDataSource û Creating a connection for entity document_solr with URL: jdbc:oracle:thin:@vluedb01:1521:iedwdev 8186225 [Thread-29] INFO org.apache.solr.handler.dataimport.JdbcDataSource û Time taken forgetConnection():265 8186226 [Thread-29] DEBUG org.apache.solr.handler.dataimport.JdbcDataSource û Executing SQL: select DOC_IDN as id, BILL_IDN as bill_id from document_solr 8186291 [Thread-29] TRACE org.apache.solr.handler.dataimport.JdbcDataSource û Time taken for sql :64 8186301 [Thread-29] DEBUG org.apache.solr.handler.dataimport.LogTransformer û The name is 8186303 [Thread-29] DEBUG org.apache.solr.handler.dataimport.LogTransformer û The name is 8186303 [Thread-29] DEBUG org.apache.solr.handler.dataimport.LogTransformer û The name is `Tom -- View this message in context: http://lucene.472066.n3.nabble.com/TemplateTransformer-returns-null-values-tp4114539.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: TemplateTransformer returns null values
Thanks Alexandre for quick response, I tried both the ways but still no luck null values, anything I am doing fundamentally wrong? query=select DOC_IDN, BILL_IDN from document_fact field column=DOC_IDN name=id / and query=select DOC_IDN as id ,BILL_IDN as bill_id from document_fact field column=id / -- View this message in context: http://lucene.472066.n3.nabble.com/TemplateTransformer-returns-null-values-tp4114539p4114544.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Boosting documents by categorical preferences
Chris, Sounds good! Thanks for the tips.. I'll be glad to submit my talk to this as I have a writeup pretty much ready to go. Cheers Amit On Tue, Jan 28, 2014 at 11:24 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : The initial results seem to be kinda promising... of course there are many : more optimizations I could do like decay user ratings over time to indicate : that preferences decay over time so a 5 rating a year ago doesn't count as : much as a 5 rating today. : : Hope this helps others. I'll open source what I have soon and post back. If : there is feedback or other thoughts let me know! Hey Amit, Glad to hear your user based boosting experiments are paying off. I would definitely love to see a more detailed writeup down the road showing off how it affects your final user metrics -- or perhaps even give a session on your technique at ApacheCon? http://events.linuxfoundation.org/events/apachecon-north-america/program/cfp -Hoss http://www.lucidworks.com/
Re: JVM heap constraints and garbage collection
On 1/30/2014 3:20 PM, Joseph Hagerty wrote: I'm using Solr 3.5 over Tomcat 6. My index has reached 30G. snip - The box is an m1.large on AWS EC2. 2 virtual CPUs, 4 ECU, 7.5 GiB RAM One detail that you did not provide was how much of your 7.5GB RAM you are allocating to the Java heap for Solr, but I actually don't think I need that information, because for your index size, you simply don't have enough. If you're sticking with Amazon, you'll want one of the instances with at least 30GB of RAM, and you might want to consider more memory than that. An ideal RAM size for Solr is equal to the size of on-disk data plus the heap space used by Solr and other programs. This means that if your java heap for Solr is 4GB and there are no other significant programs running on the same server, you'd want a minimum of 34GB of RAM for an ideal setup with your index. 4GB of that would be for Solr itself, the remainder would be for the operating system to fully cache your index in the OS disk cache. Depending on your query patterns and how your schema is arranged, you *might* be able to get away as little as half of your index size just for the OS disk cache, but it's better to make it big enough for the whole index, plus room for growth. http://wiki.apache.org/solr/SolrPerformanceProblems Many people are *shocked* when they are told this information, but if you think about the relative speeds of getting a chunk of data from a hard disk vs. getting the same information from memory, it's not all that shocking. Thanks, Shawn
Re: TemplateTransformer returns null values
Hmm, Try the variable reference without scope: ${id}. I can't remember if the scope is required only for higher level items. It might also be worth writing a very basic All fields logger to see what your in-progress map looks like. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 31, 2014 at 7:10 AM, tom praveen...@yahoo.com wrote: Thanks Alexandre for quick response, I tried both the ways but still no luck null values, anything I am doing fundamentally wrong? query=select DOC_IDN, BILL_IDN from document_fact field column=DOC_IDN name=id / and query=select DOC_IDN as id ,BILL_IDN as bill_id from document_fact field column=id / -- View this message in context: http://lucene.472066.n3.nabble.com/TemplateTransformer-returns-null-values-tp4114539p4114544.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Regarding Solr Faceting on the query response.
Hi Nilesh, I am not sure the faceting code does what you think it does. However, there are different options and you can experiment with whichever one is best for you. They are controlled by the facet.method parameter: http://wiki.apache.org/solr/SimpleFacetParameters#facet.method Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 31, 2014 at 12:51 AM, Felipe Dantas de Souza Paiva cad_fpa...@uolinc.com wrote: Hi Nilesh, maybe Facetting is not the right thing for you, because 'faceting is the arrangement of search results into categories based on indexed terms' (https://cwiki.apache.org/confluence/display/solr/Faceting). Perhaps you could use Result Clustering (https://cwiki.apache.org/confluence/display/solr/Result+Clustering), for the clustering algorithm is applied to the search result of each single query. Hope this helps. Felipe Dantas de Souza Paiva De: Kuchekar [kuchekar.nil...@gmail.com] Enviado: quinta-feira, 30 de janeiro de 2014 15:35 Para: solr-user@lucene.apache.org Assunto: Re: Regarding Solr Faceting on the query response. Hi Mikhail, I would like my faceting to run only on my resultset returned as in only on numFound, rather than the whole index. In the example, even when I specify the query 'company:Apple' .. it gives me faceted results for other companies. This means that it is querying against the whole index, rather than just the result set. Using facet.mincount=1 will give me faceted values which are greater than 1, but that will again to retrieve all the distinct values (Apple, Bose, Chevron, ..Oracle..) of facet field (company) query the whole index. What I would like to do is ... facet only on the resultset. i.e. my query (q= company:Apple AND technologies:java ) should return, only the facet details about 'Apple' since that is only present in the results set. But it provides me the list of other Company Names ... which makes me believe that it is querying the whole index to get the distinct value for the company.. docs: [ { id: ABC123, company: [ APPLE ] }, { id: ABC1234, company: [ APPLE ] }, { id: ABC1235, company: [ APPLE ] }, { id: ABC1236, company: [ APPLE ] } ] }, facet_counts: { facet_queries: { p_company:ucsf\n: 1 }, facet_fields: { company: [ APPLE, 4, ] }, facet_dates: {}, facet_ranges: {} } Thanks. Kuchekar, Nilesh On Thu, Jan 30, 2014 at 2:13 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello Do you mean setting http://wiki.apache.org/solr/SimpleFacetParameters#facet.mincount to 1 or you want to facet only returned page (rows) instead of full resultset (numFound) ? On Thu, Jan 30, 2014 at 6:24 AM, Nilesh Kuchekar kuchekar.nil...@gmail.comwrote: Yeah it's a typo... I meant company:Apple Thanks Nilesh On Jan 29, 2014, at 8:59 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: On Thu, Jan 30, 2014 at 3:43 AM, Kuchekar kuchekar.nil...@gmail.com wrote: company=Apple Did you mean company:Apple ? Otherwise, that could be the issue. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com AVISO: A informação contida neste e-mail, bem como em qualquer de seus anexos, é CONFIDENCIAL e destinada ao uso exclusivo do(s) destinatário(s) acima referido(s), podendo conter informações sigilosas e/ou legalmente protegidas. Caso você não seja o destinatário desta mensagem, informamos que qualquer divulgação, distribuição ou cópia deste e-mail e/ou de qualquer de seus anexos é absolutamente proibida. Solicitamos que o remetente seja comunicado imediatamente, respondendo esta mensagem, e que o original desta mensagem e de seus anexos, bem como toda e qualquer cópia e/ou impressão realizada a partir destes, sejam permanentemente apagados e/ou destruídos. Informações adicionais sobre nossa empresa podem ser obtidas no site http://sobre.uol.com.br/. NOTICE: The information contained in this e-mail and any attachments thereto is CONFIDENTIAL and is intended only for use by the recipient named herein and may contain legally privileged and/or secret information. If you are not the e-mail´s intended recipient, you are hereby notified that any dissemination, distribution or copy of this e-mail, and/or any attachments thereto, is strictly prohibited.