Boosting in version 1.2
Hello, Our documents contain three fields. title, keywords, content. What we want is to give priority to the field keywords, than title and last content. So we did the following in our xml file that is to be indexed we put the following doc field name=keywords boost=3.0letters/field field name=title boost=2.0This is a test/field field name=content![CDATA[This is a test]]/field /doc doc field name=keywords boost=3.0foobar/field field name=title boost=2.0This is a test letters/field field name=content![CDATA[This is a test]]/field /doc doc field name=keywords boost=3.0foobar/field field name=title boost=2.0This is a test/field field name=content![CDATA[This is a test letters]]/field /doc In our schema.xml we have put defaultSearchFieldtext/defaultSearchField copyField source=titlesearch dest=text/ copyField source=keywords dest=text/ copyField source=content dest=text/ No when we do a search like this http://localhost:8666/solr/select/?q=lettersversion=2.2start=0rows=10indent=on We don't always get the document with letters in keywords on top. To get this to work, we need to specify the 3 search fields like this http://localhost:8666/solr/select/?q=content%3Aletters+OR+titlesearch%3Aletters+OR+keywords%3Alettersversion=2.2start=0rows=10indent=on I was wondering if there is a way in Solr 1.2 to specify more than one default search field, or is the above solution still the way to go? Thank you, Thierry
How does HTMLStripWhitespaceTokenizerFactory work?
Hello, I am trying to use the solr.HTMLStripWhitespaceTokenizerFactory analyzer with no luck. I have a field content that contains the following field name=content![CDATA[test a href=testlink/a post]]/field When I do a search I get the following result name=response numFound=1 start=0 doc str name=contenttest lt;a href=testgt;linklt;/agt; post/str str name=idpo_1_NL/str str name=keywordspost/str str name=titlesearchThis is a test/str /doc /result Is this normal? Shouldn't the html code and the white spaces be removed from the field? This is my config in schema.xml fieldType name=text_ws class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.HTMLStripWhitespaceTokenizerFactory/ /analyzer /fieldType field name=content type=text_ws indexed=true stored=true omitNorms=false/ Can someone help me with this?
How can I use dates to boost my results?
Hi For my search use, the document freshness is a relevant aspect that should be considered to boost results. I have a field in my index like this: field name=created type=date indexed=true stored=true / How can I make a good use of this to boost my results? I'm using the DisMaxRequestHandler to boost other textual fields based on the query, but it would improve the results quality a lot if the date where considered to define the score. Best Regards, Daniel http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
Re: Multi-language indexing and searching
Hi Daniel, If it is functionally 'ok' to search in only one lang at a time, you could try having one index per lang. Each per-lang index would have one schema where you would describe field types (the lang part coming through stemming/snowball analyzers, per-lang stopwords al) and the same field name could be used in each of them. You could either deploy that solution through multiple web-apps (one per lang) (or try the patch for issue Solr-215). Regards, Henri Daniel Alheiros wrote: Hi, I'm just starting to use Solr and so far, it has been a very interesting learning process. I wasn't a Lucene user, so I'm learning a lot about both. My problem is: I have to index and search content in several languages. My scenario is a bit different from other that I've already read in this forum, as my client is the same to search any language and it could be accomplished using a field to define language. My questions are more focused on how to keep the benefits of all the protwords, stopwords and synonyms in a multilanguage situation Should I create new Analyzers that can deal with the language field of the document? What do you recommend? Regards, Daniel http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this. -- View this message in context: http://www.nabble.com/Multi-language-indexing-and-searching-tf3885324.html#a11027333 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multi-language indexing and searching
Hi Henri. Thanks for your reply. I've just looked at the patch you referred, but doing this I will lose the out of the box Solr installation... I'll have to create my own Solr application responsible for creating the multiple cores and I'll have to change my indexing process to something able to notify content for a specific core. Can't I have the same index, using one single core, same field names being processed by language specific components based on a field/parameter? I will try to draw what I'm thinking, please forgive me if I'm not using the correct terms but I'm not an IR expert. Thinking in a workflow: Indexing: Multilanguage indexer receives some documents for each document, verify the language field if language = English then process using the EnglishIndexer else if language = Chinese then process using the ChineseIndexer else if ... Querying: Multilanguage Request Handler receives a request if parameter language = English then process using the English Request Handler else if parameter language = Chinese then process using the Chinese Request Handler else if ... I can see that in the schema field definitions, we have some language dependent parameters... It can be a problem, as I would like to have the same fields for all requests... Sorry to bother, but before I split all my data this way I would like to be sure that it's the best approach for me. Regards, Daniel On 8/6/07 15:15, Henrib [EMAIL PROTECTED] wrote: Hi Daniel, If it is functionally 'ok' to search in only one lang at a time, you could try having one index per lang. Each per-lang index would have one schema where you would describe field types (the lang part coming through stemming/snowball analyzers, per-lang stopwords al) and the same field name could be used in each of them. You could either deploy that solution through multiple web-apps (one per lang) (or try the patch for issue Solr-215). Regards, Henri Daniel Alheiros wrote: Hi, I'm just starting to use Solr and so far, it has been a very interesting learning process. I wasn't a Lucene user, so I'm learning a lot about both. My problem is: I have to index and search content in several languages. My scenario is a bit different from other that I've already read in this forum, as my client is the same to search any language and it could be accomplished using a field to define language. My questions are more focused on how to keep the benefits of all the protwords, stopwords and synonyms in a multilanguage situation Should I create new Analyzers that can deal with the language field of the document? What do you recommend? Regards, Daniel http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this. http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
problem with schema.xml
Hi, I just started playing around with Solr 1.2. It has some nice improvements. I noticed that errors in the schema.xml get reported in a verbose way now, but the following steps cause a problem for me: 1. start with a correct schema.xml - Solr works fine 2. edit it in a way that is no longer correct (say, remove the /schema closing tag - Solr works fine 3. restart the webapp (through the Tomcat manager interface) - Solr complains that the schema.xml does not parse, fine. 4. now restart again (without fixing the schema.xml!) - Solr won't even start up 5. fix the above problem (add the closing tag) and restart via Tomcat's manager - the webapp cannot restart showing that there is a problem: FAIL - Application at context path /furness could not be started The following steps might seem artificial, but assume you don't manage to fix all the typos in your schema.xml for the first attempt. It seems after restart Solr gets stuck in some state and I cannot get it up and running by Tomcat's manager, only by restarting Tomcat. Am I missing something? Thanks, mirko
Re: How does HTMLStripWhitespaceTokenizerFactory work?
On 6/8/07, Thierry Collogne [EMAIL PROTECTED] wrote: I am trying to use the solr.HTMLStripWhitespaceTokenizerFactory analyzer with no luck. [...] Is this normal? Shouldn't the html code and the white spaces be removed from the field? For indexing purposes, yes. The stored field you get back will be unchanged though. If you want to see what will be indexed, try the analysis debugger in the admin pages. -Yonik
Cannot index '' this character using post.jar
Hi all, I tried to index a document that has '' using post.jar. But during the indexing it causes error and it wont finish the indexing. Can I know why is this and how to prevent this? Thanks! Jeffrey
Re: Boosting in version 1.2
On 8-Jun-07, at 2:07 AM, Thierry Collogne wrote: Hello, Our documents contain three fields. title, keywords, content. What we want is to give priority to the field keywords, than title and last content In our schema.xml we have put defaultSearchFieldtext/defaultSearchField copyField source=titlesearch dest=text/ copyField source=keywords dest=text/ copyField source=content dest=text/ No when we do a search like this http://localhost:8666/solr/select/? q=lettersversion=2.2start=0rows=10indent=on We don't always get the document with letters in keywords on top. To get this to work, we need to specify the 3 search fields like this I'm surprised that that finds anything--you've specified a defaultSearchField that doesn't exist in the documents you posted. http://localhost:8666/solr/select/?q=content%3Aletters+OR +titlesearch%3Aletters+OR+keywords% 3Alettersversion=2.2start=0rows=10indent=on I was wondering if there is a way in Solr 1.2 to specify more than one default search field, or is the above solution still the way to go? This is precisely the situation that the dismax handler was designed for. Plus, you don't have to fiddle around with document boosts. try: qt=dismax q=letters qf=keywords^3.0 title^2.0 content -Mike
Re: problem with schema.xml
I don't use tomcat, so I can't be particularly useful. The behavior you describe does not happen with resin or jetty... My guess is that tomcat is caching the error state. Since fixing the problem is outside the webapp directory, it does not think it has changed so it stays in a broken state. if you touch the .war file, does it restart ok? but i'm just guessing... [EMAIL PROTECTED] wrote: Hi, I just started playing around with Solr 1.2. It has some nice improvements. I noticed that errors in the schema.xml get reported in a verbose way now, but the following steps cause a problem for me: 1. start with a correct schema.xml - Solr works fine 2. edit it in a way that is no longer correct (say, remove the /schema closing tag - Solr works fine 3. restart the webapp (through the Tomcat manager interface) - Solr complains that the schema.xml does not parse, fine. 4. now restart again (without fixing the schema.xml!) - Solr won't even start up 5. fix the above problem (add the closing tag) and restart via Tomcat's manager - the webapp cannot restart showing that there is a problem: FAIL - Application at context path /furness could not be started The following steps might seem artificial, but assume you don't manage to fix all the typos in your schema.xml for the first attempt. It seems after restart Solr gets stuck in some state and I cannot get it up and running by Tomcat's manager, only by restarting Tomcat. Am I missing something? Thanks, mirko
Re: problem with schema.xml
Hi Ryan, I have my .war file located outside the webapps folder (I am using multiple Solr instances with a config as suggested on the wiki: http://wiki.apache.org/solr/SolrTomcat). Nevertheless, I touched the .war file, the config file, the directory under webapps, but nothing seems to be working. Any other suggestions? Is someone else experiencing the same problem? thanks, mirko Quoting Ryan McKinley [EMAIL PROTECTED]: I don't use tomcat, so I can't be particularly useful. The behavior you describe does not happen with resin or jetty... My guess is that tomcat is caching the error state. Since fixing the problem is outside the webapp directory, it does not think it has changed so it stays in a broken state. if you touch the .war file, does it restart ok? but i'm just guessing...
Re: To make sure XML is UTF-8
Tiong Jeffrey wrote: Thought this is not directly related to Solr, but I have a XML output from mysql database, but during indexing the XML output is not working. And the problem is part of the XML output is not in UTF-8 encoding, how can I convert it to UTF-8 and how do I know what kind of coding it uses in the first place (the data I export from the mysql database). Thanks! You won't have any problem with standard JAXP and java.util.* etc. classes, even with comlpex MySQL data (one column is LATIN1, another is LATIN2, another is ASCII, ...) In Java, use standard classes: String, Long, Date. And use JAXP. -- View this message in context: http://www.nabble.com/To-make-sure-XML-is-UTF-8-tf3891427.html#a11032117 Sent from the Solr - User mailing list archive at Nabble.com.
Re: To make sure XML is UTF-8
Thought this is not directly related to Solr, but I have a XML output from mysql database, but during indexing the XML output is not working. And the problem is part of the XML output is not in UTF-8 encoding, how can I convert it to UTF-8 and how do I know what kind of coding it uses in the first place (the data I export from the mysql database). Thanks! How do you generate XML output? Output itself is usually a raw byte array, it uses Transport and Encoding. If you save it in a file system and forget about transport-layer-encoding you will get some new problems... during indexing the XML output is not working - what exactly happens, which kind of error messages?
Re: Multi-language indexing and searching
: Can't I have the same index, using one single core, same field names being : processed by language specific components based on a field/parameter? yes, but you don't really need the complexity you describe below ... you don't need seperate request handlers per language, just seperate fields per language. asusming you care about 3 concepts: title, author, body .. in a single language index those might corrispond ot three fields, in your index they corrispond to 3*N fields where N is the number of languages you wnat to support... title_french title_english title_german ... author_french author_english ... documents which are in english only get values for th english fields, documents in french etc... ... unless perhaps you want to support translations of the documents in which case you can have values fields for multiple langagues, it's up to you. When a user wants to query in french, you take their input and query against the body_french field and display the title_french field, etc... -Hoss
Re: Solr 1.2 released
Hello Yonik, This is great news. Will it be a drop-in replacement for 1.1? I.e., do I need to make any changes other than replacing the jar files? I suppose the index files will still be good. Are 1.2 schema files and config files compatible with those of 1.1? -- Best regards, Jack Thursday, June 7, 2007, 7:32:18 AM, you wrote: Solr 1.2 is now available for download! This is the first release since Solr graduated from the Incubator, and includes many improvements, including CSV/delimited-text data loading, time based auto-commit, faster faceting, negative filters, a spell-check handler, sounds-like word filters, regex text filters, and more flexible plugins. Solr releases can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/solr/ -Yonik
Re: Solr 1.2 released
On 6/8/07, Jack L [EMAIL PROTECTED] wrote: This is great news. Will it be a drop-in replacement for 1.1? I.e., do I need to make any changes other than replacing the jar files? I suppose the index files will still be good. Are 1.2 schema files and config files compatible with those of 1.1? It should be easy to upgrade. See the release notes (CHANGES.txt)... there is a section on upgrading from 1.1 -Yonik
Re: Wildcards / Binary searches
: Do you mean something like below ? : field name=autocompletew wo wor word/field yeah, but there are some Tokenizers that make this trivial (EdgeNGramTokenizer i think is the name) : project, definitively not a good practice for portability of indexes. A : duplicate field with an analyser to produce a sortable ASCII version : would be better. exactly ... I think conceptually the methodology for solving the problem is very similar to the way the SpellChecker contrib works: use a very custom index designed for the application (not just look at the terms in the main corpus) and custom logic for using that index. -Hoss
RE: Solr 1.2 released
I noticed there is no example/ext directory or jars that was found there in 1.1 (commons-el.jar, commons-logging.jar, jasper-*.jar, mx4j-*.jar) I have a jar that my Solr plugin depends on. This jar contains a class that needs to be loaded only once per container because it is a JNI library. For that reason, it cannot be placed in a per-webapp lib directory. (I am assuming placing the jars in example/solr/lib is same as placing them in each web app's WEB-INF/lib, from the class loading point of view. Am I right?) I tried putting this jar in the top-level lib and example/solr/lib, but the jar wasn't recognized. Where should I put jars shared by multiple shared apps? BTW, in order to invsetigate this, I inspected the start.conf file inside start.jar and it seems the new start.jar is expecting to find ant.jar in this fixed location: /usr/share/java/ Is this intended? (I don't know why jetty needs ant anyway.) -kuro
Re: solr+hadoop = next solr
On 6/7/07, Rafael Rossini [EMAIL PROTECTED] wrote: Hi, Jeff and Mike. Would you mind telling us about the architecture of your solutions a little bit? Mike, you said that you implemented a highly-distributed search engine using Solr as indexing nodes. What does that mean? You guys implemented a master, multi-slave solution for replication? Or the whole index shards for high availability and fail over? Our solution doesn't use solr, but goes directly to lucene. It's built on windows, so the interop communication service is built on .net remoting (tcp based). Microsoft has deprecated ongoing development with .net remoting, in favor of other more standard mechanisms, i.e. http. So, we're looking to migrate our solution to a more community-supported model. The underlying structure sounds similar to what others have done: index shards distributed to various servers, each responsible for a subset of the index. A merging server handles coordination of concurrent thread requests and synchronizes the results as they're returned. The thread coordination and search results interleaving process is functional but not really scalable. It works for our user model, where users tend not to page deeply through results. We want to change that so we can use solr as our primary data source read mechanism for our site. -- j
RE: Solr 1.2 released
: I noticed there is no example/ext : directory or jars that was found there : in 1.1 (commons-el.jar, commons-logging.jar, : jasper-*.jar, mx4j-*.jar) the example/ext directory was an entirly Jetty based artifact. when we upgraded the Jetty used in the example setup, Jetty no longer had an ext directory, so it was removed. : I have a jar that my Solr plugin depends on. : This jar contains a class that needs to be : loaded only once per container because : it is a JNI library. For that reason, it : cannot be placed in a per-webapp lib : directory. (I am assuming placing the jars : in example/solr/lib is same as placing them : in each web app's WEB-INF/lib, from the : class loading point of view. Am I right?) not exactly, i custom classloader is constructed for the ${solr.home}/lib directory, but it is a child loader of the Servlet Context loader, so you are probably right about it being a poor place to put a JNI library. : Where should I put jars shared by multiple : shared apps? that really depends on your servlet container. The scaled down Jetty instance provided is purely an *example* so that people who want to try solr can do so without needing to download, install, and understand the configuration of a servlet container. If you want to use Jetty 6, then you should read the Jetty docs to learn more about loading classes in the system classloader. Alternately if you liked Jetty 5 (which is what was used in the Solr 1.1 example) you can use it ... but people really shouldn't count on the servlet container provided to power the example behaving consistent as new versions of Solr come out -- it might switch to tomcat in the next version, it all depneds on which was is simpler, smaller, easier to setup for the example, etc... : is expecting to find ant.jar in this : fixed location: : /usr/share/java/ : : Is this intended? (I don't know why jetty : needs ant anyway.) i really can't say ... it's purely a Jetty thing. We make no modifications to Jetty's start.jar -Hoss