Re: Fwd: Language detection for solr 3.6.1
On 07/08/2014 03:17 AM, Poornima Jay wrote: I'm using the google library which I has mentioned in my first mail saying Im usinghttp://code.google.com/p/language-detection/. I have downloaded the jar file from the below url https://www.versioneye.com/java/org.apache.solr:solr-langid/3.6.1 Please let me know from where I need to download the correct jar file. Regards, I don't think you need to download anything. It's included in Solr 3.6.1 package. $ ls contrib/langid/lib jsonic-1.2.7.jar jsonic-NOTICE.txt langdetect-LICENSE-ASL.txt jsonic-LICENSE-ASL.txt langdetect-1.1-20120112.jar langdetect-NOTICE.txt langdetect-1.1-20120112.jar is the one you find in the Googole Code site, which isn't developed by Google, but developed by a Japanese company Cybozu. I used this some years ago for a comparison purpose, but I don't remember how I did. You'd have to move the JARs in the lib directory to the lib directory, and use LangDetectLanguageIdentifierUpdateProcessorFactory instead of TikaLanguageIdentifierUpdateProcessorFactory in the commented out portion of example/solr/conf/solrconfig.xml (and you need to un-comment out that portion, of course) Hope this helps. -- T. Kuro Kurosaka • Senior Software Engineer Healthline - The Power of Intelligent Health www.healthline.com |@Healthline | @HealthlineCorp
Re: ICUTokenizer or StandardTokenizer or ??? for text_all type field that might include non-whitespace langs
On 06/20/2014 04:04 AM, Allison, Timothy B. wrote: Let's say a predominantly English document contains a Chinese sentence. If the English field uses the WhitespaceTokenizer with a basic WordDelimiterFilter, the Chinese sentence could be tokenized as one big token (if it doesn't have any punctuation, of course) and will be effectively unsearchable...barring use of wildcards. In my experiment with Solr 4.6.1, both StandardTokenizer and ICUTokenizer generates a token per han character. So they are searcheable though precision suffers. But in your scenario, Chinese text is rare, so some precision loss may not be a real issue. Kuro
Re: Strict mode at searching and indexing
On 05/30/2014 08:29 AM, Erick Erickson wrote: I see errors in both cases. Do you 1 have schemaless configured or 2 have a dynamic field pattern that matches your non_exist_field? Maybe !--dynamicField name=* type=ignored multiValued=true /-- is un-commented-out in schema.xml? Kuro
Re: Stemming for Chinese and Japanese
On 05/20/2014 11:31 AM, Geepalem wrote: Hi, What is the filter to be used to implement stemming for Chinese and Japanese language field types. For English, I have used filter class=solr.SnowballPorterFilterFactory language=English / and its working fine. What do you mean by working fine? Try analyzing this with text_en field type: 単語は何個ありますか? This Japanese sentence for How many tokens are there?, and the correct answer is 5, 6 or 7, depending on how to count some compound words. You should be seeing 10, using text_en, instead. Try using text_ja. You will see 7. I don't recommend to use text_cjk for Chinese, Japanese and Korean. They are *very* different languages, and you should be using a different analyzer for each. StandardTokenizer just doesn't work for Chinese and Japanese at all since there are no spaces between words in these languages. Kuro
Any Solrj API to obtain field list?
I'd like to write Solr client code that writes text to language specific field, say, myfield_es, for Spanish, if the field myfield_es is defined in schema.xml, and otherwise to a fall-back field myfield. To do this, I need to obtain a list of defined fields (and dynamic fields) from the server. But I cannot find a suitable Solrj API. Is there any? I'm using Solr 4.6.1. I could write code to use Schema REST API (https://wiki.apache.org/solr/SchemaRESTAPI) but I would much prefer to use the existing code if one exists. -- T. Kuro Kurosaka • Senior Software Engineer
Re: Any Solrj API to obtain field list?
On 05/27/2014 02:29 PM, Jack Krupansky wrote: You might consider an update request processor as an alternative. It runs on the server and might be simpler. You can even use the stateless script update processor to avoid having to write any custom Java code. -- Jack Krupansky That's an interesting approach. I'd consider it. On 05/27/2014 02:04 PM, Sujit Pal wrote: Have you looked at IndexSchema? That would offer you methods to query index metadata using SolrJ. http://lucene.apache.org/solr/4_7_2/solr-core/org/apache/solr/schema/IndexSchema.html -sujit The question was essentially how to get IndexSchema for Solrj client, without needing to parse the XML file, hopefully. On 05/27/2014 02:16 PM, Ahmet Arslan wrote: Hi, https://wiki.apache.org/solr/LukeRequestHandler make sure numTerms=0 for performance I'm afraid this won't work because when the index is empty, Luke won't return any fields. And for the fields that are written, this method returns more information than I'd like to know. I just want to know if a field is valid or not. Kuro
Re: Any Solrj API to obtain field list?
On 05/27/2014 02:55 PM, Steve Rowe wrote: You can call the Schema API from SolrJ - see Shawn Heisey’s example code here:http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201307.mbox/%3c51daecd2.6030...@elyograg.org%3e Steve It looks like this returns a Json representation of fields if I do query.setRequestHandler(/schema/fields); I guess this is the closest Solrj can do. Thank you, Steve.
Re: Any Solrj API to obtain field list?
On 05/27/2014 04:21 PM, Steve Rowe wrote: Shawn’s code shows that SolrJ parses the JSON for you into NamedList (response.getResponse()). - Steve Thank you for pointing it out. It wasn't apparent what get(key) returns since the method signature of getResponse() merely tells it would return a NamedListObject. After running a test code under debugger, I found out, for key=fields, the returned object is of ArrayListSimpleOrderedMapNameValuePair. This is what I came up with: private static final String url = http://localhost:8983/solr/hlbase;; private static final SolrServer server = new HttpSolrServer(url); ... SolrQuery query = new SolrQuery(); query.setRequestHandler(/schema/fields); QueryResponse response = server.query(query); ListSimpleOrderedMapNameValuePair fields = (ArrayListSimpleOrderedMapNameValuePair) response.getResponse().get(fields); for(SimpleOrderedMapNameValuePair fmap : fields) { System.out.println(fmap.get(name)); } Kuro
Re: Solr special characters like '(' and ''?
I don't think is special to the parser. Classic examples like ATT just work, as far as query parser is considered. https://wiki.apache.org/solr/SolrQuerySyntax even tells that you can escape the special meaning by the backslash. is special in the URL, however, and that has to be hex-escaped as %26. On 04/08/2014 06:37 AM, Peter Kirk wrote: Hi How to search for Solr special characters like '(' and ''? Kuro
Re: Analysis of Japanese characters
Tom, You should be using JapaneseAnalyzer (kuromoji). Neither CJK nor ICU tokenize at word boundaries. On 04/02/2014 10:33 AM, Tom Burton-West wrote: Hi Shawn, I'm not sure I understand the problem and why you need to solve it at the ICUTokenizer level rather than the CJKBigramFilter Can you perhaps give a few examples of the problem? Have you looked at the flags for the CJKBigramfilter? You can tell it to make bigrams of different Japanese character sets. For example the config given in the JavaDocs tells it to make bigrams across 3 of the different Japanese character sets. (Is the issue related to Romaji?) filter class=solr.CJKBigramFilterFactory han=true hiragana=true katakana=true hangul=true outputUnigrams=false / http://lucene.apache.org/core/4_7_1/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilterFactory.html Tom On Wed, Apr 2, 2014 at 1:19 PM, Shawn Heisey s...@elyograg.org wrote: My company is setting up a system for a customer from Japan. We have an existing system that handles primarily English. Here's my general text analysis chain: http://apaste.info/xa5 After talking to the customer about problems they are encountering with search, we have determined that some of the problems are caused because ICUTokenizer splits on *any* character set change, including changes between different Japanase character sets. Knowing the risk of this being an XY problem, here's my question: Can someone help me develop a rule file for the ICU Tokenizer that will *not* split when the character set changes from one of the japanese character sets to another japanese character set, but still split on other character set changes? Thanks, Shawn
Re: w/10 ? [was: Partial Counts in SOLR]
On 3/19/14 5:13 PM, Otis Gospodnetic wrote: Hi, Guessing it's surround query parser's support for within backed by span queries. Otis You mean this? http://wiki.apache.org/solr/SurroundQueryParser I guess this parser needs improvement in documentation area. It doesn't explain or have an example of the w/int syntax at all. (Is this the infix notation of W?) An example would help explaining difference between W and N; some readers may not understand what ordered and unordered in this context mean. Kuro
w/10 ? [was: Partial Counts in SOLR]
In the thread Partial Counts in SOLR, Salman gave us this sample query: ((stock or share*) w/10 (sale or sell* or sold or bought or buy* or purchase* or repurchase*)) w/10 (executive or director) I'm not familiar with this w/10 notation. What does this mean, and what parser(s) supports this syntax? Kuro
Re: Apache Solr Configuration Problem (Japanese Language)
Andy, I don't have a direct answer to your question but I have a question. On 03/05/2014 07:21 AM, Andy Alexander wrote: fq=ss_language:jaq=製品 I am guessing you have a field called ss_language where a language code of the document is stored, and you have Solr documents of different languages. str name=parsedquery+DisjunctionMaxQuery((content:製品)~0.01)/str This indicate your default query field is content. What does the analyzer for this field look like? Does the analyzer work for any languages that you want to support? Many analyzers have language dependency and won't work with multilingual fields. -- T. Kuro Kurosaka • Senior Software Engineer Healthline - The Power of Intelligent Health www.healthline.com |@Healthline | @HealthlineCorp
What types is supported by Solrj addBean() in the fields of POJO objects?
What are supported types of the POJO objects that are sent to SolrServer.addBean(obj)? A quick glance of DocumentObjectBinder seems to suggest that an arbitrary combination of an Collection, List, ArrayList, array ([]), Map, Hashmap, of primitive types, String and Date is supported, but I'm not too sure. I would also like to know what Solr field types are allowed for each object's (Java) field types. Is there documentation explaining this? Kuro
search across cores
If I want to search across cores, can I use (abuse?) the distributed search? My simple experiment seems to confirm this but I'd like to know if there is any drawbacks other than those of distributed search listed here? https://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations If all cores are served by the same machine, does a distributed search actually make sub-search requests over HTTP? Or is it clever enough to skip the HTTP connection? Kuro
Re: Escape \\n from getting highlighted - highlighter component
Your search expression means 'talk' OR 'n' OR 'text'. I think you want to do a phrase search. To do that, quote the whole thing with double-quotes talk n text, if you are using one of the Solr standard query parsers. On 02/17/2014 03:53 PM, Developer wrote: Hi, When searching for a text like 'talk n text' the highlighter component also adds the em tags to the special characters like \n. Is there a way to avoid highlighting the special characters? \\r\\n Family Messaging is getting replaced as \\r\\emn/em Family Messaging Kuro
Re: geo/spatial search performance comparison using different methods
Thank you, David. I believe the field doesn't need to be multivalued. Can you give me some idea how much query-time performance gain we can expect by switching to LatLonType from Solr-2155? On 11/06/2013 09:56 AM, Smiley, David W. wrote: Hi Kuro, I don't know of any benchmarks featuring distance-sort performance. Presumably you are using SOLR-2155 because you have multi-valued spatial fields? If so, LatLonType is not an option. SOLR-2155 sorting performance is *probably* about the same as the equivalent in Solr 4 RPT. If you actually do have single valued spatial to sort on, then definitely don't use SOLR-2155 or RPT for that, use LatLonType. It's surely faster but I haven't measured it. The best multi-valued distance sort option for Solr 4 is currently this: https://issues.apache.org/jira/browse/SOLR-5170 ~ David On 11/5/13 1:36 PM, T. Kuro Kurosaka k...@healthline.com wrote: Are there any performance comparison results available comparing various methods to sort result by distance (not just filtering) on Solr 3 and 4? We are using Solr 3.5 with Solr-2155 patch. I am particularly interested in learning performance difference among Solr 3 LatLongType, Solr-2155 GeoHash, Solr 4 implementation of GeoHash and Solr 4's SpatialRecursivePrefixTreeFieldType (location_rpt). I see comparison of Solr 3 LatLongType vs Solr-2155 3.6.2-work/example/solr/conf/ but it is 2 years old. -- - T. Kuro Kurosaka € Senior Software Engineer Healthline Networks, Inc. € Connect to Better Health www.healthline.com -- - T. Kuro Kurosaka • Senior Software Engineer p: 415-281-3100x3261 f: 415-281-3199 Healthline Networks, Inc. • Connect to Better Health 660 Third Street, San Francisco, CA 94107 www.healthline.com About Us: www.healthlinenetworks.net | Media Kit: mediakit.healthline.com
geo/spatial search performance comparison using different methods
Are there any performance comparison results available comparing various methods to sort result by distance (not just filtering) on Solr 3 and 4? We are using Solr 3.5 with Solr-2155 patch. I am particularly interested in learning performance difference among Solr 3 LatLongType, Solr-2155 GeoHash, Solr 4 implementation of GeoHash and Solr 4's SpatialRecursivePrefixTreeFieldType (location_rpt). I see comparison of Solr 3 LatLongType vs Solr-2155 3.6.2-work/example/solr/conf/ but it is 2 years old. -- - T. Kuro Kurosaka • Senior Software Engineer Healthline Networks, Inc. • Connect to Better Health www.healthline.com
Re: character encoding issue...
It sounds like the characters were mishandled at index build time. I would use Luke to see if a character that appear correctly when you change the output to be SHIFT JIS is actually stored as one Unicode. I bet it's stored as two characters, each having the character of the value that happened to be high and low bytes of the SHIFT JIS character. There are many possible cause of this. If you are indexing the HTML document from HTTP servers, HTTP server may be configured to send wrong charset= info in Content-Type header. If the document is directly from a file system, and if the document doesn't have META header declaring the charset, then the system assumes a default charset, which is typically ISO-8859-1 or UTF-8, and misinterprets SHIF-JIS encoded characters. You need to debug to find out where the characters get corrupted. On 11/04/2013 11:15 PM, Chris wrote: Sorry, was away a bit hence the delay. I am inserting java strings into a java bean class, and then doing a addBean() method to insert the POJO into Solr. When i Query using either tomcat/jetty, I get these special characters. But I have noted, if I change output to - Shift-JIS encoding then those characters appear as some japanese characters I think. But then this solution doesn't work for all special characters as I can still see some of them...isn't there an encoding that can cover all the characters whatever they might be? Any ideas on what do i do? Regards, Chris On Mon, Nov 4, 2013 at 6:27 PM, Erick Erickson erickerick...@gmail.comwrote: The problem is there are about a dozen places where the character encoding can be mis-configured. The problem you're seeing above actually looks like a problem with the character set configured in your browser, it may have nothing to do with what's actually in Solr. You might write small SolrJ program and see if you can dump the contents in binary and examine to see... Best Erick On Sun, Nov 3, 2013 at 6:39 AM, Rajani Maski rajinima...@gmail.com wrote: How are you extracting the text that is there in the website[1] you are referring to? Apache Nutch or any other crawler? If yes, initially check whether that crawler engine is giving you data in correct format before you invoke solr index method. [1]http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/ URI encoding should resolve this problem. On Fri, Nov 1, 2013 at 10:50 AM, Chris christu...@gmail.com wrote: Hi Rajani, I followed the steps exactly as in http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/ However, when i send a query to this new instance in tomcat, i again get the error - str name=fulltxtScheduled Groups Maintenance In preparation for the new release roll-out, Diigo groups won’t be accessible on Sept 28 (Mon) around midnight 0:00 PST for several hours. Stay tuned to say hello to Diigo V4 soon! location of the text - http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/ same problem at - http://cn.nytimes.com/business/20130926/c26alibaba/ All text in title comes like - - � /str arr name=text str - � /str /arr Can you please advice? Chris On Tue, Oct 29, 2013 at 11:33 PM, Rajani Maski rajinima...@gmail.com wrote: Hi, If you are using Apache Tomcat Server, hope you are not missing the below mentioned configuration: Connector port=”port Number″ protocol=”HTTP/1.1″ connectionTimeout=”2″ redirectPort=”8443″ *URIEncoding=”UTF-8″*/ I had faced similar issue with Chinese Characters and had resolved with the above config. Links for reference : http://zensarteam.wordpress.com/2011/11/25/6-steps-to-configure-solr-on-apache-tomcat-7-0-20/ http://blog.sidu.in/2007/05/tomcat-and-utf-8-encoded-uri-parameters.html#.Um_3P3Cw2X8 Thanks On Tue, Oct 29, 2013 at 9:20 PM, Chris christu...@gmail.com wrote: Hi All, I get characters like - �� - CTA - in the solr index. I am adding Java beans to solr by the addBean() function. This seems to be a character encoding issue. Any pointers on how to resolve this one? I have seen that this occurs mostly for japanese chinese characters. -- - T. Kuro Kurosaka • Senior Software Engineer
Phrase query with prefix query
Is there a query parser that supports a phrase query with prefix query at the end, such as San Fran* ? -- - T. Kuro Kurosaka • Senior Software Engineer
Re: predefined variables usable in schema.xml ?
I tried to use ${solr.core.instanceDir} in schema.xml with Solr 4.0, where every deployment is multi-core, and it didn't work. It must be that the description about pre-defined properties in CoreAdmin wiki page is wrong, or it only works in solrconfig.xml, perhaps? On 11/28/12 5:17 PM, T. Kuro Kurosaka wrote: Thank you, Hoss. I found this SolrWiki page talks about pre-defined properties such as solr.core.instanceDir: http://wiki.apache.org/solr/CoreAdmin I tried to use ${solr.core.instanceDir} in the default single-core schema.xml, and it didn't work. Is this page wrong, or these properties are available only in multi-core deployments? On 11/27/12 2:27 PM, Chris Hostetter wrote: : The default solrconfig.xml seems to suggest ${solr.data.dir} can be used. : So I am hoping there is another pre-defined variable like this that points to : the solr core directory. there's nothing special about solr.data.dir ... it's used i nthe example configs as a convinient way to let you override it on the command line when running the example, otherwise it defaults to the empty string which triggers the default dataDir logic (ie: ./data in the instanceDir)... dataDir${solr.data.dir:}/dataDir :charFilter class=com.basistech.rlp.solr.RCLUNormalizeCharFilterFactory : rlpContext=solr/conf/rlp-context-rclu.xml/ : : This only works if Solr is started from $SOLR_HOME/example, as it is relative : to the current working directory. if your factories are using the SolrResourceLoader.openResource to load those files then you can change that to just be 'rlpContext=rlp-context-rclu.xml' and it will just plain work -- the SolrResourceLoader is SolrCloud/ZooKeeper aware, and in stadalone mode checks the conf dir, the classpath, and as a last resort attempts to resolve it as an relative path -- if your custom factories just call new File(rlpContext) on the string, then you're stuck using absolute paths, or needing to define system properties at runtime. -Hoss
Re: predefined variables usable in schema.xml ?
Sorry, correction. ${solr.core.instanceDir} is working in a sense. It is replaced by the core name, rather than a directory path. In an earlier startup time Solr prints out: INFO: Creating SolrCore 'collection1' using instanceDir: solr/collection1 But judging from the error message I get, ${solr.core.instanceDir} is replaced by the value collection1 (no solr/). I was hoping that ${solr.core.instanceDir} would be replaced by the absolute path to the examples/core/collection1 directory. On 11/30/12 2:41 PM, T. Kuro Kurosaka wrote: I tried to use ${solr.core.instanceDir} in schema.xml with Solr 4.0, where every deployment is multi-core, and it didn't work. It must be that the description about pre-defined properties in CoreAdmin wiki page is wrong, or it only works in solrconfig.xml, perhaps? On 11/28/12 5:17 PM, T. Kuro Kurosaka wrote: Thank you, Hoss. I found this SolrWiki page talks about pre-defined properties such as solr.core.instanceDir: http://wiki.apache.org/solr/CoreAdmin I tried to use ${solr.core.instanceDir} in the default single-core schema.xml, and it didn't work. Is this page wrong, or these properties are available only in multi-core deployments? On 11/27/12 2:27 PM, Chris Hostetter wrote: : The default solrconfig.xml seems to suggest ${solr.data.dir} can be used. : So I am hoping there is another pre-defined variable like this that points to : the solr core directory. there's nothing special about solr.data.dir ... it's used i nthe example configs as a convinient way to let you override it on the command line when running the example, otherwise it defaults to the empty string which triggers the default dataDir logic (ie: ./data in the instanceDir)... dataDir${solr.data.dir:}/dataDir :charFilter class=com.basistech.rlp.solr.RCLUNormalizeCharFilterFactory : rlpContext=solr/conf/rlp-context-rclu.xml/ : : This only works if Solr is started from $SOLR_HOME/example, as it is relative : to the current working directory. if your factories are using the SolrResourceLoader.openResource to load those files then you can change that to just be 'rlpContext=rlp-context-rclu.xml' and it will just plain work -- the SolrResourceLoader is SolrCloud/ZooKeeper aware, and in stadalone mode checks the conf dir, the classpath, and as a last resort attempts to resolve it as an relative path -- if your custom factories just call new File(rlpContext) on the string, then you're stuck using absolute paths, or needing to define system properties at runtime. -Hoss
Re: predefined variables usable in schema.xml ?
Thank you, Hoss. I found this SolrWiki page talks about pre-defined properties such as solr.core.instanceDir: http://wiki.apache.org/solr/CoreAdmin I tried to use ${solr.core.instanceDir} in the default single-core schema.xml, and it didn't work. Is this page wrong, or these properties are available only in multi-core deployments? On 11/27/12 2:27 PM, Chris Hostetter wrote: : The default solrconfig.xml seems to suggest ${solr.data.dir} can be used. : So I am hoping there is another pre-defined variable like this that points to : the solr core directory. there's nothing special about solr.data.dir ... it's used i nthe example configs as a convinient way to let you override it on the command line when running the example, otherwise it defaults to the empty string which triggers the default dataDir logic (ie: ./data in the instanceDir)... dataDir${solr.data.dir:}/dataDir :charFilter class=com.basistech.rlp.solr.RCLUNormalizeCharFilterFactory : rlpContext=solr/conf/rlp-context-rclu.xml/ : : This only works if Solr is started from $SOLR_HOME/example, as it is relative : to the current working directory. if your factories are using the SolrResourceLoader.openResource to load those files then you can change that to just be 'rlpContext=rlp-context-rclu.xml' and it will just plain work -- the SolrResourceLoader is SolrCloud/ZooKeeper aware, and in stadalone mode checks the conf dir, the classpath, and as a last resort attempts to resolve it as an relative path -- if your custom factories just call new File(rlpContext) on the string, then you're stuck using absolute paths, or needing to define system properties at runtime. -Hoss
predefined variables usable in schema.xml ?
Is there a pre-defined variable that can be used in schema.xml to point to the solr core directory, or the conf subdirectory? I thought ${solr.home} or perhaps ${solr.solr.home} might work but they didn't (unless -Dsolr.home=/my/solr/home is supplied, that is). The default solrconfig.xml seems to suggest ${solr.data.dir} can be used. So I am hoping there is another pre-defined variable like this that points to the solr core directory. Use case, in case you wonder: We have our custom CharFilter, Tokenizer TokenFilter, and corresponding factories. Currently we ship schema.xml that contains lines like: charFilter class=com.basistech.rlp.solr.RCLUNormalizeCharFilterFactory rlpContext=solr/conf/rlp-context-rclu.xml/ This only works if Solr is started from $SOLR_HOME/example, as it is relative to the current working directory. Our customers have to adjust the value to the absolute path if they'd like to use Tomcat or any other web container other than Solr builtin jetty. We'd rather like to write something like this: charFilter class=com.basistech.rlp.solr.RCLUNormalizeCharFilterFactory rlpContext=${solr.conf.dir}/rlp-context-rclu.xml/ Kuro
Re: Any filter to map mutiple tokens into one ?
On 10/14/12 12:19 PM, Jack Krupansky wrote: There's a miscommunication here somewhere. Is Solr 4.0 still passing *:* to the analyzer? Show us the parsed query for *:*, as well as the debugQuery explain for the score. I'm not quite sure what you mean by the parsed query for *:*. This fake analyzer using NGramTokenizer divides *:* into three tokens *, :, and *, on purpose to simulate our Tokenizer's behavior. An excerpt of he XML results from the query is pasted in the bottom of this message. I mean, *:* (MatchAllDocsQuery) has a constant score, so there isn't any way for it to be suboptimal. That's exactly the point I'd like to raise. No matter what analyzers are assigned to fields, the hit score for *:* must remain 1.0, but it's not happening when an analyzer that divides *:* are in use. Here's an excerpt of the query response. Notice this element, which should not be there, in my opinion: DisjunctionMaxQuery((name:* : *^0.5)) There is a space between * and :, and another space between : and *. response lstname=responseHeader intname=status0/int intname=QTime33/int lstname=params strname=indenton/str strname=wt/ strname=version2.2/str strname=rows10/str strname=defTypeedismax/str strname=pfname^0.5/str strname=fl*,score/str strname=debugQueryon/str strname=start0/str strname=q*:*/str strname=qt/ strname=fq/ /lst /lst resultname=responsenumFound=32start=0maxScore=0.14764866 doc strname=idGB18030TEST/str strname=nameTest with some GB18030 encoded characters/str arrname=features strNo accents here/str str这是一个功能/str strThis is a feature (translated)/str str这份文件是很有光泽/str strThis document is very shiny (translated)/str /arr floatname=price0.0/float strname=price_c0,USD/str boolname=inStocktrue/bool longname=_version_1415830106215022592/long floatname=score0.14764866/float /doc ... /result lstname=debug strname=rawquerystring*:*/str strname=querystring*:*/str strname=parsedquery (+MatchAllDocsQuery(*:*) DisjunctionMaxQuery((name:* : *^0.5)))/no_coord /str strname=parsedquery_toString+*:* (name:* : *^0.5)/str lstname=explain strname=GB18030TEST 0.14764866 = (MATCH) sum of: 0.14764866 = (MATCH) MatchAllDocsQuery, product of: 0.14764866 = queryNorm /str /lst strname=QParserExtendedDismaxQParser/str nullname=altquerystring/ nullname=boostfuncs/ ... /lst /lst /lst /response
Re: Any filter to map mutiple tokens into one ?
On 10/15/12 10:35 AM, Jack Krupansky wrote: And you're absolutely certain you see *:* being passed to your analyzer in the final release of Solr 4.0??? I don't have a direct evidence. This is the only theory I have that explains why changing FieldType causes the sub-optimal scores. If you know of a way to tell if a tokenizer is really invoked, let me know. -- Jack Krupansky -Original Message- From: T. Kuro Kurosaka Sent: Monday, October 15, 2012 1:28 PM To: solr-user@lucene.apache.org Subject: Re: Any filter to map mutiple tokens into one ? On 10/14/12 12:19 PM, Jack Krupansky wrote: There's a miscommunication here somewhere. Is Solr 4.0 still passing *:* to the analyzer? Show us the parsed query for *:*, as well as the debugQuery explain for the score. I'm not quite sure what you mean by the parsed query for *:*. This fake analyzer using NGramTokenizer divides *:* into three tokens *, :, and *, on purpose to simulate our Tokenizer's behavior. An excerpt of he XML results from the query is pasted in the bottom of this message. I mean, *:* (MatchAllDocsQuery) has a constant score, so there isn't any way for it to be suboptimal. That's exactly the point I'd like to raise. No matter what analyzers are assigned to fields, the hit score for *:* must remain 1.0, but it's not happening when an analyzer that divides *:* are in use. Here's an excerpt of the query response. Notice this element, which should not be there, in my opinion: DisjunctionMaxQuery((name:* : *^0.5)) There is a space between * and :, and another space between : and *. response lstname=responseHeader intname=status0/int intname=QTime33/int lstname=params strname=indenton/str strname=wt/ strname=version2.2/str strname=rows10/str strname=defTypeedismax/str strname=pfname^0.5/str strname=fl*,score/str strname=debugQueryon/str strname=start0/str strname=q*:*/str strname=qt/ strname=fq/ /lst /lst resultname=responsenumFound=32start=0maxScore=0.14764866 doc strname=idGB18030TEST/str strname=nameTest with some GB18030 encoded characters/str arrname=features strNo accents here/str str这是一个功能/str strThis is a feature (translated)/str str这份文件是很有光泽/str strThis document is very shiny (translated)/str /arr floatname=price0.0/float strname=price_c0,USD/str boolname=inStocktrue/bool longname=_version_1415830106215022592/long floatname=score0.14764866/float /doc ... /result lstname=debug strname=rawquerystring*:*/str strname=querystring*:*/str strname=parsedquery (+MatchAllDocsQuery(*:*) DisjunctionMaxQuery((name:* : *^0.5)))/no_coord /str strname=parsedquery_toString+*:* (name:* : *^0.5)/str lstname=explain strname=GB18030TEST 0.14764866 = (MATCH) sum of: 0.14764866 = (MATCH) MatchAllDocsQuery, product of: 0.14764866 = queryNorm /str /lst strname=QParserExtendedDismaxQParser/str nullname=altquerystring/ nullname=boostfuncs/ ... /lst /lst /lst /response
Re: Any filter to map mutiple tokens into one ?
Jack, I don't think SOLR-3261 describes this issue. I ran the same experiment with Solr-3.6 and the score for all the matches was 0.1626374. The newly released Solr 4.0.0 also returns a suboptimal score of 0.14764866. Kuro On 10/12/12 2:03 PM, Jack Krupansky wrote: I don't have a Solr 3.5 to check, but SOLR-3261, which was fixed in Solr 3.6 may be your culprit. See: https://issues.apache.org/jira/browse/SOLR-3261 So, try SOlr 3.6 or 3.6.1 or 4.0 to see if your issue goes away. -- Jack Krupansky -Original Message- From: T. Kuro Kurosaka Sent: Friday, October 12, 2012 3:15 PM To: solr-user@lucene.apache.org Subject: Re: Any filter to map mutiple tokens into one ? Jack, It goes like this: http://myhost:8983/solr/select?indent=onversion=2.2q=*%3A*fq=start=0rows=10fl=*%2Cscoreqt=wt=debugQuery=on and edismax is the default query parser in solrconfig.xml. There is a field named text_jpn that uses a Tokenizer that we developed as a product, which we can't share here. But I can simulate our situation using NGramTokenizer. After indexing the Solr sample docs normally, stop the Solr and insert: fieldtype name=text_fake class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.NGramTokenizerFactory maxGramSize=1 minGramSize=1 / /analyzer /fieldtype Replace the field definition for name, for example: field name=name type=text_fake indexed=true stored=true/ In solrconfig.xml, change the default search handler's definition like this: str name=defTypeedismax/str str name=pfname^0.5/str (I guess I could just have these in the URL.) Start Solr and give this URL: http://localhost:8983/solr/select?indent=onversion=2.2q=*%3A*fq=start=0rows=10fl=*%2Cscoreqt=wt=debugQuery=onexplainOther=hl.fl= Hopefully you'll see floatname=score0.3663672/float and +MatchAllDocsQuery(*:*) DisjunctionMaxQuery((name:* : *^0.5)) in the debug output. The score calculation should not be done when the query is *:* which has the special meaning, should it ? And even if the score calculation is done, *:* shouldn't be fed to Tokenizers, should it? On 10/12/12 9:44 AM, Jack Krupansky wrote: Okay, let's back up. First, hold off mixing in your proposed solution until after we understand the actual, original problem: 1. What is your field and field type (with analyzer details)? 2. What is your query parser (defType)? 3. What is your query request URL? 4. What is the parsed query (add debugQuery=true to your query request)? (Actually, I think you gave us that) I just tried the following query with the fresh 4.0 release and it works fine: http://localhost:8983/solr/collection1/select?q=*:*wt=xmldebugQuery=truedefType=edismax str name=rawquerystring*:*/str The parsed query is: str name=parsedquery(+MatchAllDocsQuery(*:*))/no_coord/str And this was with the 4.0 example schema, adding *.xml and books.json documents. If you could try your scenario with 4.0 that would be a help. If it's a bug in 3.5 that is fixed now... oh well. I mean, feel free to check the revision history for edismax since the 3.5 release. -- Jack Krupansky -Original Message- From: T. Kuro Kurosaka Sent: Friday, October 12, 2012 11:54 AM To: solr-user@lucene.apache.org Subject: Re: Any filter to map mutiple tokens into one ? On 10/11/12 4:47 PM, Jack Krupansky wrote: The : which normally separates a field name from a term (or quoted string or parenthesized sub-query) is parsed by the query parser before analysis gets called, and *:* is recognized before analysis as well. So, any attempt to recreate *:* in analysis will be too late to affect query parsing and other pre-analysis processing. That's why I suspect a bug in Solr. Tokenizer shouldn't play any roles here but it is affecting the score calculation. I am seeing an evidence that *:* is being passed to my tokenizer. I'm trying to find a way to work around this by reconstructing *:* in the analysis chain. But, what is it you are really trying to do? What's the real problem? (This sounds like a proverbial XY Problem.) -- Jack Krupansky -Original Message- From: T. Kuro Kurosaka Sent: Thursday, October 11, 2012 7:35 PM To: solr-user@lucene.apache.org Subject: Any filter to map mutiple tokens into one ? I am looking for a way to fold a particular sequence of tokens into one token. Concretely, I'd like to detect a three-token sequence of *, : and *, and replace it with a token of the text *:*. I tried SynonymFIlter but it seems it can only deal with a single input token. * : * = *:* seems to be interpreted as one input token of 5 characters *, space, :, space and *. I'm using Solr 3.5. Background: My tokenizer separate the three character sequence *:* into 3 tokens of one character each. The edismax parser, when given the query *:*, i.e. find every doc, seems to pass the entire string *:* to the query analyzer (I suspect a bug
Re: Any filter to map mutiple tokens into one ?
On 10/11/12 4:47 PM, Jack Krupansky wrote: The : which normally separates a field name from a term (or quoted string or parenthesized sub-query) is parsed by the query parser before analysis gets called, and *:* is recognized before analysis as well. So, any attempt to recreate *:* in analysis will be too late to affect query parsing and other pre-analysis processing. That's why I suspect a bug in Solr. Tokenizer shouldn't play any roles here but it is affecting the score calculation. I am seeing an evidence that *:* is being passed to my tokenizer. I'm trying to find a way to work around this by reconstructing *:* in the analysis chain. But, what is it you are really trying to do? What's the real problem? (This sounds like a proverbial XY Problem.) -- Jack Krupansky -Original Message- From: T. Kuro Kurosaka Sent: Thursday, October 11, 2012 7:35 PM To: solr-user@lucene.apache.org Subject: Any filter to map mutiple tokens into one ? I am looking for a way to fold a particular sequence of tokens into one token. Concretely, I'd like to detect a three-token sequence of *, : and *, and replace it with a token of the text *:*. I tried SynonymFIlter but it seems it can only deal with a single input token. * : * = *:* seems to be interpreted as one input token of 5 characters *, space, :, space and *. I'm using Solr 3.5. Background: My tokenizer separate the three character sequence *:* into 3 tokens of one character each. The edismax parser, when given the query *:*, i.e. find every doc, seems to pass the entire string *:* to the query analyzer (I suspect a bug.), and feed the tokenized result to DisjunctionMaxQuery object, according to this debug output: lst name=debug str name=rawquerystring*:*/str str name=querystring*:*/str str name=parsedquery+MatchAllDocsQuery(*:*) DisjunctionMaxQuery((body:* : *~100^0.5 | title:* : *~100^1.2)~0.01)/str str name=parsedquery_toString+*:* (body:* : *~100^0.5 | title:* : *~100^1.2)~0.01/str Notice that there is a space between * and : in DisjunctionMaxQuery((body:* : * ) Probably because of this, the hit score is as low as 0.109, while it is 1.000 if an analyzer that doesn't break *:* is used. So I'd like to stitch together *, :, * into *:* again to make DisjunctionMaxQuery happy. Thanks. T. Kuro Kurosaka
Re: Any filter to map mutiple tokens into one ?
Jack, It goes like this: http://myhost:8983/solr/select?indent=onversion=2.2q=*%3A*fq=start=0rows=10fl=*%2Cscoreqt=wt=debugQuery=on and edismax is the default query parser in solrconfig.xml. There is a field named text_jpn that uses a Tokenizer that we developed as a product, which we can't share here. But I can simulate our situation using NGramTokenizer. After indexing the Solr sample docs normally, stop the Solr and insert: fieldtype name=text_fake class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.NGramTokenizerFactory maxGramSize=1 minGramSize=1 / /analyzer /fieldtype Replace the field definition for name, for example: field name=name type=text_fake indexed=true stored=true/ In solrconfig.xml, change the default search handler's definition like this: str name=defTypeedismax/str str name=pfname^0.5/str (I guess I could just have these in the URL.) Start Solr and give this URL: http://localhost:8983/solr/select?indent=onversion=2.2q=*%3A*fq=start=0rows=10fl=*%2Cscoreqt=wt=debugQuery=onexplainOther=hl.fl= Hopefully you'll see floatname=score0.3663672/float and +MatchAllDocsQuery(*:*) DisjunctionMaxQuery((name:* : *^0.5)) in the debug output. The score calculation should not be done when the query is *:* which has the special meaning, should it ? And even if the score calculation is done, *:* shouldn't be fed to Tokenizers, should it? On 10/12/12 9:44 AM, Jack Krupansky wrote: Okay, let's back up. First, hold off mixing in your proposed solution until after we understand the actual, original problem: 1. What is your field and field type (with analyzer details)? 2. What is your query parser (defType)? 3. What is your query request URL? 4. What is the parsed query (add debugQuery=true to your query request)? (Actually, I think you gave us that) I just tried the following query with the fresh 4.0 release and it works fine: http://localhost:8983/solr/collection1/select?q=*:*wt=xmldebugQuery=truedefType=edismax str name=rawquerystring*:*/str The parsed query is: str name=parsedquery(+MatchAllDocsQuery(*:*))/no_coord/str And this was with the 4.0 example schema, adding *.xml and books.json documents. If you could try your scenario with 4.0 that would be a help. If it's a bug in 3.5 that is fixed now... oh well. I mean, feel free to check the revision history for edismax since the 3.5 release. -- Jack Krupansky -Original Message- From: T. Kuro Kurosaka Sent: Friday, October 12, 2012 11:54 AM To: solr-user@lucene.apache.org Subject: Re: Any filter to map mutiple tokens into one ? On 10/11/12 4:47 PM, Jack Krupansky wrote: The : which normally separates a field name from a term (or quoted string or parenthesized sub-query) is parsed by the query parser before analysis gets called, and *:* is recognized before analysis as well. So, any attempt to recreate *:* in analysis will be too late to affect query parsing and other pre-analysis processing. That's why I suspect a bug in Solr. Tokenizer shouldn't play any roles here but it is affecting the score calculation. I am seeing an evidence that *:* is being passed to my tokenizer. I'm trying to find a way to work around this by reconstructing *:* in the analysis chain. But, what is it you are really trying to do? What's the real problem? (This sounds like a proverbial XY Problem.) -- Jack Krupansky -Original Message- From: T. Kuro Kurosaka Sent: Thursday, October 11, 2012 7:35 PM To: solr-user@lucene.apache.org Subject: Any filter to map mutiple tokens into one ? I am looking for a way to fold a particular sequence of tokens into one token. Concretely, I'd like to detect a three-token sequence of *, : and *, and replace it with a token of the text *:*. I tried SynonymFIlter but it seems it can only deal with a single input token. * : * = *:* seems to be interpreted as one input token of 5 characters *, space, :, space and *. I'm using Solr 3.5. Background: My tokenizer separate the three character sequence *:* into 3 tokens of one character each. The edismax parser, when given the query *:*, i.e. find every doc, seems to pass the entire string *:* to the query analyzer (I suspect a bug.), and feed the tokenized result to DisjunctionMaxQuery object, according to this debug output: lst name=debug str name=rawquerystring*:*/str str name=querystring*:*/str str name=parsedquery+MatchAllDocsQuery(*:*) DisjunctionMaxQuery((body:* : *~100^0.5 | title:* : *~100^1.2)~0.01)/str str name=parsedquery_toString+*:* (body:* : *~100^0.5 | title:* : *~100^1.2)~0.01/str Notice that there is a space between * and : in DisjunctionMaxQuery((body:* : * ) Probably because of this, the hit score is as low as 0.109, while it is 1.000 if an analyzer that doesn't break *:* is used. So I'd like to stitch together *, :, * into *:* again to make DisjunctionMaxQuery happy. Thanks
Any filter to map mutiple tokens into one ?
I am looking for a way to fold a particular sequence of tokens into one token. Concretely, I'd like to detect a three-token sequence of *, : and *, and replace it with a token of the text *:*. I tried SynonymFIlter but it seems it can only deal with a single input token. * : * = *:* seems to be interpreted as one input token of 5 characters *, space, :, space and *. I'm using Solr 3.5. Background: My tokenizer separate the three character sequence *:* into 3 tokens of one character each. The edismax parser, when given the query *:*, i.e. find every doc, seems to pass the entire string *:* to the query analyzer (I suspect a bug.), and feed the tokenized result to DisjunctionMaxQuery object, according to this debug output: lst name=debug str name=rawquerystring*:*/str str name=querystring*:*/str str name=parsedquery+MatchAllDocsQuery(*:*) DisjunctionMaxQuery((body:* : *~100^0.5 | title:* : *~100^1.2)~0.01)/str str name=parsedquery_toString+*:* (body:* : *~100^0.5 | title:* : *~100^1.2)~0.01/str Notice that there is a space between * and : in DisjunctionMaxQuery((body:* : * ) Probably because of this, the hit score is as low as 0.109, while it is 1.000 if an analyzer that doesn't break *:* is used. So I'd like to stitch together *, :, * into *:* again to make DisjunctionMaxQuery happy. Thanks. T. Kuro Kurosaka
Why does Solr (1.4.1) keep so many Tokenizer objects?
While investigating a bug, I found that Solr keeps many Tokenizer objects. This experimental 80-core Solr 1.4.1 system runs on Tomcat. It was continuously sent indexing requests in parallel, and it eventually died due to OutOfMemory. The heap dump that was taken by the JVM shows there were 14477 Tokenizer objects, or about 180 Tokenizer objects per core, at the time it died. Each core's schema.xml has only 5 Fields that uses this Tokenizer, so I'd think 5 Tokenizer per indexing thread are needed at most. Tomcat at its default configuration can run up to 200 threads. So at most 1000 Tokenizer objects should be enough. My colleague ran a similar experiment on 10-core Solr 3.6 system, and observed a fewer Tokenizer objects there, but still there are 48 Tokenizers per core. Why does Solr keep this many Tokenizer objects ? Kuro