Re: Umlauts as Char
On 2011-02-08, Prescott Nasser wrote: in the void subsitute function you'll see them: else if ( buffer.charAt( c ) == 'ü' ) { buffer.setCharAt( c, 'u' ); } This does not constitue a character in .net (that I can figure out) and thus it doesn't compile. The .java file says encoded in UTF-8. I was thinking maybe I could do the same thing in VS2010, but I'm not finding a way, and searching on this has been difficult. IIRC VS will recognize UTF-8 encoded files if they start with a byte order mark (BOM) but Java usually doesn't write one. I think I once found the setting for reading/writing UTF-8 in VS, will need to search for it when at work. If you have a JDK installed you can use its native2ascii tool that can be used to replace non-ASCII characters with Unicoce escape sequences that you can then use in C# as well (see Nicolas' post). If you have Ant installed (sorry, can't resist ;-) you can convert the whole tree in one (untested) go with something like copy todir=will-hold-translated-files encoding=utf8 fileset dir=holds-original-files/ escapeunicode/ /copy Stefan
RE: Umlauts as Char
Stefan somewhat nailed it on the head. My concerns where the java characters - I can't even search google or bing for them. So I can take the source codes word that 'ü' is the u with dots over it (becuase it says replace umlauts in the source notes). But, I guess, is that really true? Is that perhaps u with a carrot over it instead? I'm tempted to take the source at it's word and just replace them with the umlauts versions (via character map -thanks Aaron), and then make some comment expressing what originally it was in the java source. What are your guy's thoughts? ~P From: bode...@apache.org To: lucene-net-dev@lucene.apache.org Subject: Re: Umlauts as Char Date: Tue, 8 Feb 2011 06:01:27 +0100 On 2011-02-08, Nicholas Paldino [.NET/C# MVP] wrote: You can simply use the Unicode escape sequence in code and in string/character literals, as specified by section 2.4.2 of the C# spec (http://msdn.microsoft.com/en-us/library/aa664670(v=vs.71).aspx): I think in Prescott's case part of the problem is that he doesn't know which character the sequence seems to be. In this case it likely is an ü. else if ( buffer.charAt( c ) == 'ü' ) { buffer.setCharAt( c, 'u' ); } Would become: else if ( buffer.charAt( c ) == '\u00C3¼' ) { buffer.setCharAt( c, 'u' ); } No. The two bytes are part of a two byte UTF-8 sequence making up a single character. Stefan
[jira] Updated: (LUCENE-2910) Highlighter does not correctly highlight the phrase around 50th term
[ https://issues.apache.org/jira/browse/LUCENE-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shinya Kasatani updated LUCENE-2910: Attachment: HighlighterFix.patch A test case that describes the problem, along with a fix. Highlighter does not correctly highlight the phrase around 50th term Key: LUCENE-2910 URL: https://issues.apache.org/jira/browse/LUCENE-2910 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.9.4 Reporter: Shinya Kasatani Priority: Trivial Attachments: HighlighterFix.patch When you use the Highlighter combined with N-Gram tokenizers such as CJKTokenizer and try to highlight the phrase that appears around 50th term in the field, the highlighted phrase is shorter than expected. e.g. Highlighting fooo in the following text with bigram tokenizer: 0-1-2-3-4-fooo--- Expected: 0-1-2-3-4-Bfooo/B--- Actual: 0-1-2-3-4-fBooo/B--- -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2910) Highlighter does not correctly highlight the phrase around 50th term
Highlighter does not correctly highlight the phrase around 50th term Key: LUCENE-2910 URL: https://issues.apache.org/jira/browse/LUCENE-2910 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.9.4 Reporter: Shinya Kasatani Priority: Trivial Attachments: HighlighterFix.patch When you use the Highlighter combined with N-Gram tokenizers such as CJKTokenizer and try to highlight the phrase that appears around 50th term in the field, the highlighted phrase is shorter than expected. e.g. Highlighting fooo in the following text with bigram tokenizer: 0-1-2-3-4-fooo--- Expected: 0-1-2-3-4-Bfooo/B--- Actual: 0-1-2-3-4-fBooo/B--- -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2910) Highlighter does not correctly highlight the phrase around 50th term
[ https://issues.apache.org/jira/browse/LUCENE-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shinya Kasatani updated LUCENE-2910: Description: When you use the Highlighter combined with N-Gram tokenizers such as CJKTokenizer and try to highlight the phrase that appears around 50th term in the field, the highlighted phrase is shorter than expected. {noformat} e.g. Highlighting fooo in the following text with bigram tokenizer: 0-1-2-3-4-fooo--- Expected: 0-1-2-3-4-Bfooo/B--- Actual: 0-1-2-3-4-fBooo/B--- {noformat} was: When you use the Highlighter combined with N-Gram tokenizers such as CJKTokenizer and try to highlight the phrase that appears around 50th term in the field, the highlighted phrase is shorter than expected. e.g. Highlighting fooo in the following text with bigram tokenizer: 0-1-2-3-4-fooo--- Expected: 0-1-2-3-4-Bfooo/B--- Actual: 0-1-2-3-4-fBooo/B--- Highlighter does not correctly highlight the phrase around 50th term Key: LUCENE-2910 URL: https://issues.apache.org/jira/browse/LUCENE-2910 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.9.4 Reporter: Shinya Kasatani Priority: Trivial Attachments: HighlighterFix.patch When you use the Highlighter combined with N-Gram tokenizers such as CJKTokenizer and try to highlight the phrase that appears around 50th term in the field, the highlighted phrase is shorter than expected. {noformat} e.g. Highlighting fooo in the following text with bigram tokenizer: 0-1-2-3-4-fooo--- Expected: 0-1-2-3-4-Bfooo/B--- Actual: 0-1-2-3-4-fBooo/B--- {noformat} -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2666) ArrayIndexOutOfBoundsException when iterating over TermDocs
[ https://issues.apache.org/jira/browse/LUCENE-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991315#comment-12991315 ] Nick Pellow commented on LUCENE-2666: - Hi Michael, This issue was entirely a problem with our code, and I doubt Lucene could have done a better job. The problem was that on upgrade of the index (done when fields have changed etc), we recreate the index in the same location using {{IndexWriter.create(directory, analyzer, true, MAX_FIELD_LENGTH)}}. Some code was added just before this however, that deleted every single file in the directory. This meant that some other thread performing a search could have seen a corrupt index, thus causing the AIOOBE. The developer was paranoid that IndexWriter.create was leaving old files lying around. I'm glad we got to the bottom of this, and very much so that it was not a bug in Lucene! Thanks again for helping us track this down. Best Regards, Nick Pellow ArrayIndexOutOfBoundsException when iterating over TermDocs --- Key: LUCENE-2666 URL: https://issues.apache.org/jira/browse/LUCENE-2666 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 3.0.2 Reporter: Shay Banon Attachments: checkindex-out.txt A user got this very strange exception, and I managed to get the index that it happens on. Basically, iterating over the TermDocs causes an AAOIB exception. I easily reproduced it using the FieldCache which does exactly that (the field in question is indexed as numeric). Here is the exception: Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 114 at org.apache.lucene.util.BitVector.get(BitVector.java:104) at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127) at org.apache.lucene.search.FieldCacheImpl$LongCache.createValue(FieldCacheImpl.java:501) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:183) at org.apache.lucene.search.FieldCacheImpl.getLongs(FieldCacheImpl.java:470) at TestMe.main(TestMe.java:56) It happens on the following segment: _26t docCount: 914 delCount: 1 delFileName: _26t_1.del And as you can see, it smells like a corner case (it fails for document number 912, the AIOOB happens from the deleted docs). The code to recreate it is simple: FSDirectory dir = FSDirectory.open(new File(index)); IndexReader reader = IndexReader.open(dir, true); IndexReader[] subReaders = reader.getSequentialSubReaders(); for (IndexReader subReader : subReaders) { Field field = subReader.getClass().getSuperclass().getDeclaredField(si); field.setAccessible(true); SegmentInfo si = (SegmentInfo) field.get(subReader); System.out.println(-- + si); if (si.getDocStoreSegment().contains(_26t)) { // this is the probleatic one... System.out.println(problematic one...); FieldCache.DEFAULT.getLongs(subReader, __documentdate, FieldCache.NUMERIC_UTILS_LONG_PARSER); } } Here is the result of a check index on that segment: 8 of 10: name=_26t docCount=914 compound=true hasProx=true numFiles=2 size (MB)=1.641 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.18-194.11.1.el5.centos.plus, os=Linux, mergeDocStores=true, lucene.version=3.0.2 953716 - 2010-06-11 17:13:53, source=merge, os.arch=amd64, java.version=1.6.0, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_26t_1.del] test: open reader.OK [1 deleted docs] test: fields..OK [32 fields] test: field norms.OK [32 fields] test: terms, freq, prox...ERROR [114] java.lang.ArrayIndexOutOfBoundsException: 114 at org.apache.lucene.util.BitVector.get(BitVector.java:104) at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127) at org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:102) at org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:616) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:509) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299) at TestMe.main(TestMe.java:47) test: stored fields...ERROR [114] java.lang.ArrayIndexOutOfBoundsException: 114 at org.apache.lucene.util.BitVector.get(BitVector.java:104) at org.apache.lucene.index.ReadOnlySegmentReader.isDeleted(ReadOnlySegmentReader.java:34) at org.apache.lucene.index.CheckIndex.testStoredFields(CheckIndex.java:684) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:512) at
[jira] Commented: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text
[ https://issues.apache.org/jira/browse/LUCENE-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991316#comment-12991316 ] Robert Muir commented on LUCENE-2909: - Is the bug really in NGramTokenFilter? This seems to be a larger problem that would affect all tokenfilters that break larger tokens into smaller ones and recalculate offsets, right? For example: EdgeNGramTokenFilter, ThaiWordFilter, SmartChineseAnalyzer's WordTokenFilter, etc? I think WordDelimiterFilter has special code that might avoid the problem (line 352), so it might be ok. Is there any better way we could solve this: for example maybe instead of the tokenizer calling correctOffset() it gets called somewhere else? This seems to be what is causing the problem. NGramTokenFilter may generate offsets that exceed the length of original text - Key: LUCENE-2909 URL: https://issues.apache.org/jira/browse/LUCENE-2909 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.9.4 Reporter: Shinya Kasatani Assignee: Koji Sekiguchi Priority: Minor Attachments: TokenFilterOffset.patch Whan using NGramTokenFilter combined with CharFilters that lengthen the original text (such as ß - ss), the generated offsets exceed the length of the origianal text. This causes InvalidTokenOffsetsException when you try to highlight the text in Solr. While it is not possible to know the accurate offset of each character once you tokenize the whole text with tokenizers like KeywordTokenizer, NGramTokenFilter should at least avoid generating invalid offsets. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text
[ https://issues.apache.org/jira/browse/LUCENE-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991319#comment-12991319 ] Uwe Schindler commented on LUCENE-2909: --- The problem has nothing to do with CharFilters. This problem always occurs, if endOffset - startOffset != termAtt.length(). If you e.g. put a Stemmer before ngramming, that creates longer tokens (like Portugise -ã - -ão or German ß - ss) you have the same problem. A solution might be to use some factor to correct this in these offsets: (endOffset - startOffset) / termAtt.length() NGramTokenFilter may generate offsets that exceed the length of original text - Key: LUCENE-2909 URL: https://issues.apache.org/jira/browse/LUCENE-2909 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.9.4 Reporter: Shinya Kasatani Assignee: Koji Sekiguchi Priority: Minor Attachments: TokenFilterOffset.patch Whan using NGramTokenFilter combined with CharFilters that lengthen the original text (such as ß - ss), the generated offsets exceed the length of the origianal text. This causes InvalidTokenOffsetsException when you try to highlight the text in Solr. While it is not possible to know the accurate offset of each character once you tokenize the whole text with tokenizers like KeywordTokenizer, NGramTokenFilter should at least avoid generating invalid offsets. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text
[ https://issues.apache.org/jira/browse/LUCENE-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991320#comment-12991320 ] Robert Muir commented on LUCENE-2909: - You are right, some stemmers increase the size, so this assumption that end - start = termAtt.length is a problem. So, between this and LUCENE-2208, I think we need to add some more checks/asserts to BaseTokenStreamTestCase (at least to validate offset end, but maybe some other ideas?) If the highlighter hits this condition, it (rightfully) complains and throws an exception, among other problems. So I think we need to improve this situation everywhere. NGramTokenFilter may generate offsets that exceed the length of original text - Key: LUCENE-2909 URL: https://issues.apache.org/jira/browse/LUCENE-2909 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.9.4 Reporter: Shinya Kasatani Assignee: Koji Sekiguchi Priority: Minor Attachments: TokenFilterOffset.patch Whan using NGramTokenFilter combined with CharFilters that lengthen the original text (such as ß - ss), the generated offsets exceed the length of the origianal text. This causes InvalidTokenOffsetsException when you try to highlight the text in Solr. While it is not possible to know the accurate offset of each character once you tokenize the whole text with tokenizers like KeywordTokenizer, NGramTokenFilter should at least avoid generating invalid offsets. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text
[ https://issues.apache.org/jira/browse/LUCENE-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2909: Attachment: LUCENE-2909_assert.patch here's a check we can add to BaseTokenStreamTestCase for this condition. NGramTokenFilter may generate offsets that exceed the length of original text - Key: LUCENE-2909 URL: https://issues.apache.org/jira/browse/LUCENE-2909 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.9.4 Reporter: Shinya Kasatani Assignee: Koji Sekiguchi Priority: Minor Attachments: LUCENE-2909_assert.patch, TokenFilterOffset.patch Whan using NGramTokenFilter combined with CharFilters that lengthen the original text (such as ß - ss), the generated offsets exceed the length of the origianal text. This causes InvalidTokenOffsetsException when you try to highlight the text in Solr. While it is not possible to know the accurate offset of each character once you tokenize the whole text with tokenizers like KeywordTokenizer, NGramTokenFilter should at least avoid generating invalid offsets. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Threading of JIRA e-mails in gmail?
Just a follow-up to this one: no reply from infra yet, but I simply tried my config. on people.apache.org and it works like a charm, so for Apache committers and gmail users this is probably a life-saver. My config is described in a comment here: https://issues.apache.org/jira/browse/INFRA-3403 Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Threading of JIRA e-mails in gmail?
Thanks Dawid It is not working for me yet, looking for the reason for that... Doron On Mon, Feb 7, 2011 at 12:48 PM, Dawid Weiss dawid.we...@cs.put.poznan.plwrote: Just a follow-up to this one: no reply from infra yet, but I simply tried my config. on people.apache.org and it works like a charm, so for Apache committers and gmail users this is probably a life-saver. My config is described in a comment here: https://issues.apache.org/jira/browse/INFRA-3403 Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Threading of JIRA e-mails in gmail?
Looks like my action prompted a response from infra and it's not encouraging -- they're supposedly switching off procmail support on that server soon. Track INFRA-3403 to see what will come out of this, I don't want to spam this list. Eh. Dawid On Mon, Feb 7, 2011 at 1:14 PM, Doron Cohen cdor...@gmail.com wrote: Thanks Dawid It is not working for me yet, looking for the reason for that... Doron On Mon, Feb 7, 2011 at 12:48 PM, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote: Just a follow-up to this one: no reply from infra yet, but I simply tried my config. on people.apache.org and it works like a charm, so for Apache committers and gmail users this is probably a life-saver. My config is described in a comment here: https://issues.apache.org/jira/browse/INFRA-3403 Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Distributed Indexing
I'm saying that deterministic policies are a requirement that *some* people will want. Others might want a random spread. Thus, I'd have deterministic based on ID and random as the two initial implementations. Upayavira NB. In case folks haven't worked it out already, I have been tasked to mentor this group of students in this work, and had the fortune to be able to point them to a task I've already thought a lot about myself, but had no time to do :-) On Sun, 06 Feb 2011 21:57 +, William Mayor m...@williammayor.co.uk wrote: Hi Good call about the policies being deterministic, should've thought of that earlier. We've changed the patch to include this and I've removed the random assignment one (for obvious reasons). Take a look and let me know what's to do. ([1]https://issues.apache.org/jira/browse/SOLR-2341) Cheers William On Thu, Feb 3, 2011 at 5:00 PM, Upayavira [2]u...@odoko.co.uk wrote: On Thu, 03 Feb 2011 15:12 +, Alex Cowell [3]alxc...@gmail.com wrote: Hi all, Just a couple of questions that have arisen. 1. For handling non-distributed update requests (shards param is not present or is invalid), our code currently * assumes the user would like the data indexed, so gets the request handler assigned to /update * executes the request using core.execute() for the SolrCore associated with the original request Is this what we want it to do and is using core.execute() from within a request handler a valid method of passing on the update request? Take a look at how it is done in handler.component.SearchHandler.handleRequestBody(). I'd say try to follow as similar approach as possible. E.g. it is the SearchHandler that does much of the work, branching depending on whether it found a shards parameter. 2. We have partially implemented an update processor which actually generates and sends the split update requests to each specified shard (as designated by the policy). As it stands, the code shares a lot in common with the HttpCommComponent class used for distributed search. Should we look at opening up the HttpCommComponent class so it could be used by our request handler as well or should we continue with our current implementation and worry about that later? I agree that you are going to want to implement an UpdateRequestProcessor. However, it would seem to me that, unlike search, you're not going to want to bother with the existing processor and associated component chain, you're going to want to replace the processor with a distributed version. As to the HttpCommComponent, I'd suggest you make your own educated decision. How similar is the class? Could one serve both needs effectively? 3. Our update processor uses a MultiThreadedHttpConnectionManager to send parallel updates to shards, can anyone give some appropriate values to be used for the defaultMaxConnectionsPerHost and maxTotalConnections params? Won't the values used for distributed search be a little high for distributed indexing? You are right, these will likely be lower for distributed indexing, however I'd suggest not worrying about it for now, as it is easy to tweak later. Upayavira --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source References 1. https://issues.apache.org/jira/browse/SOLR-2341 2. mailto:u...@odoko.co.uk 3. mailto:alxc...@gmail.com --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source
Re: Distributed Indexing
Surely you want to be implementing an UpdateRequestProcessor, rather than a RequestHandler. The ContentStreamHandlerBase, in the handleRequestBody method gets an UpdateRequestProcessor and uses it to process the request. What we need is that handleRequestBody method to, as you have suggested, check on the shards parameter, and if necessary call a different UpdateRequestProcessor (a DistributedUpdateRequestProcessor). I don't think we really need it to be configurable at this point. The ContentStreamHandlerBase could just use a single hardwired implementation. If folks want choice of DistributedUpdateRequestProcessor, it can be added later. For configuration, the DistributedUpdateRequestProcessor should get its config from the parent RequestHandler. The configuration I'm most interested in is the DistributionPolicy. And that can be done with a distributionPolicyClass=solr.IDHashDistributionPolicy request parameter, which could potentially be configured in solrconfig.xml as an invariant, or provided in the request by the user if necessary. So, I'd avoid another thing that needs to be configured unless there are real benefits to it (which there don't seem to me to be right now). Upayavira On Sun, 06 Feb 2011 23:08 +, Alex Cowell alxc...@gmail.com wrote: Hey, We're making good progress, but our DistributedUpdateRequestHandler is having a bit of an identity crisis, so we thought we'd ask what other people's opinions are. The current situation is as follows: We've added a method to ContentStreamHandlerBase to check if an update request is distributed or not (based on the presence/validity of the 'shards' parameter). So a non-distributed request will proceed as normal but a distributed request would be passed on to the DistributedUpdateRequestHandler to deal with. The reason this choice is made in the ContentStreamHandlerBase is so that the DistributedUpdateRequestHandler can use the URL the request came in on to determine where to distribute update requests. Eg. an update request is sent to: [1]http://localhost:8983/solr/update/csv?shards=shard1,shard2. .. then the DistributedUpdateRequestHandler knows to send requests to: shard1/update/csv shard2/update/csv Alternatively, if the request wasn't distributed, it would simply be handled by whichever request handler /update/csv uses. Herein lies the problem. The DistributedUpdateRequestHandler is not really a request handler in the same way as the CSVRequestHandler or XmlUpdateRequestHandlers are. If anything, it's more like a plugin for the various existing update request handlers, to allow them to deal with distributed requests - a distributor if you will. It isn't designed to be able to receive and handle requests directly. We would like this DistributedUpdateRequestHandler to be defined in the solrconfig to allow flexibility for setting up multiple different DistributedUpdateRequestHandlers with different ShardDistributionPolicies etc.and also to allow us to get the appropriate instance from the core in the code. There seem to be two paths for doing this: 1. Leave it as an implementation of SolrRequestHandler and hope the user doesn't directly send update requests to it (ie. a request to [2]http://localhost:8983/solr/distrib update handler path would most likely cripple something). So it would be defined in the solrconfig something like: requestHandler name=distrib-update class=solr.DistributedUpdateRequestHandler / 2. Create a new plugin type for the solrconfig, say updateRequestDistributor which would involve creating a new interface for the DistributedUpdateRequestHandler to implement, then registering it with the core. It would be defined in the solrconfig something like: updateRequestDistributor name=distrib-update class=solr.DistributedUpdateRequestHandler lst name=defaults str name=policysolr.HashedDistributionPolicy/str /lst /updateRequestDistributor This would mean that it couldn't directly receive requests, but that an instance could still easily be retrieved from the core to handle the distribution of update requests. Any thoughts on the above issue (or a more succinct, descriptive name for the class) are most welcome! Alex References 1. http://localhost:8983/solr/update/csv?shards=shard1,shard2. 2. http://localhost:8983/solr/ --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source
Maintain stopwords.txt and other files
Hello together, i am currently developing a search solution, based on Apache Solr. Currently I have the problem that I want to offer the user the possibility to maintain synonyms and stopwords in a userfriendy tool. But currently I could not find any possibility to write the stopwords.txt or synonyms.txt. Are there any other solutions? Currently I have some ideas how to handle it: 1. Implement another SynonymFilterFactory to allow other datasources like databases. I already saw approaches for that but no solutions yet. 2. Implement a fileWriter request handler to write the stopwords.txt Are there other solutions which are maybe already implemented? Thanks and best regards Timo Timo Schmidt Entwickler (Diplom Informatiker FH) AOE media GmbH Borsigstr. 3 65205 Wiesbaden Germany Tel. +49 (0) 6122 70 70 7 - 234 Fax. +49 (0) 6122 70 70 7 -199 e-Mail: timo.schm...@aoemedia.de Web: http://www.aoemedia.de/ Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a USt-ID Nr.: DE250247455 Handelsregister: Wiesbaden B Handelsregister Nr.: 22567 Stammsitz: Wiesbaden Creditreform: 625.0209354 Geschäftsführer: Kian Toyouri Gould Diese E-Mail Nachricht enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail. This e-mail message may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Tokenization and Fuzziness: How to Allow Multiple Strategies?
Hey everyone, Tokenization seems inherently fuzzy and imprecise, yet Lucene does not appear to provide an easy mechanism to account for this fuzziness. Let's take an example, where the document I'm indexing is v1.1.0 mr. jones da...@gmail.com I may want to tokenize this as follows: [v1.1.0, mr, jones, da...@gmail.com] ...or I may want to tokenize this as follows: [v1, 1.0, mr, jones, david, gmail.com] ...or I may want to tokenize it another way. I would think that the best approach would be indexing using multiple strategies, such as: [v1.1.0, v1, 1.0, mr, jones, da...@gmail.com, david, gmail.com] However, this would destroy phrase queries. And while Lucene lets you index multiple tokens at the same position, I haven't found a way to deal with cases where you want to index a set of tokens at one position: nor does that even make sense. For instance, I can't index [david, gmail.com] in the same position as da...@gmail.com. So: - Any thoughts, in general, about how you all approach this fuzziness? Do you just choose one tokenization strategy and hope for the best? - Might there be a way to use multiple strategies and *not* break phrase queries that I'm overlooking? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenization-and-Fuzziness-How-to-Allow-Multiple-Strategies-tp2444956p2444956.html Sent from the Solr - Dev mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Scoring: Precedent for a Better, Less Fragile Approach?
Hey everyone, I have a question about Lucene/Solr scoring in general. It really feels like a wobbly house of cards that falls down whenever I make the slightest tweak. There are many factors at play in Lucene scoring: they're all fighting with each other, and very often one will completely dominate everything else, when that may not really be the intention. ** The question: might there be a way to enforce strict requirements that certain factors are higher priority than other factors, and/or certain factors shouldn't overtake other factors? Perhaps a set of rules where one factor is considered before even examining another factor? Tuning boost numbers around and hoping for the best seems imprecise and very fragile. ** To make this more concrete, an example: We previously added the scores of multi-field matches together via an OR, so: score(query apple) = score(field1:apple) + score(field2:apple). I changed that to be more in-line with DisMaxParser, namely a max: score(query apple) = max(score(field1:apple), score(field2:apple)). I also modified coord such that coord would only consider actual unique terms (apple vs. orange), rather than terms across multiple fields (field1:apple vs. field2:apple). This seemed like a good idea, but it actually introduced a bug that was previously hidden. Suddenly, documents matching apple in the title and *nothing* in the body were being boosted over apple in the title and apple in the body! I investigated, and it was due to lengthNorm: previously, documents matching apple in both title and body were getting higher scores thanks to to summing the field scores (vs. max) as well as a higher coord score. Now that they were no longer getting these boosts, which was beneficial in many respects, the playing field was leveled. And this leveling of the playing field allowed lengthNorm to dominate everything else. Any help would be much appreciated. Thanks! Tavi -- View this message in context: http://lucene.472066.n3.nabble.com/Scoring-Precedent-for-a-Better-Less-Fragile-Approach-tp2445112p2445112.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [REINDEX] Note: re-indexing required !
Lucene maintains compatibility with earlier stable release index versions, and to some extent transparently upgrades them. But there is no guaranteed compatibility between different in-development indexes. E.g. 3.2 reads 3.1 indexes and upgrades them, but 3.2-dev-snapshot-10 (while happily handling 3.1) may fail reading 3.2-dev-snapshot-8 index, as they have the same version tag, yet different formats. On Sun, Jan 23, 2011 at 19:18, Earl Hood e...@earlhood.com wrote: On Sat, Jan 22, 2011 at 11:14 PM, Shai Erera ser...@gmail.com wrote: Under LUCENE-2720 the index format of both trunk and 3x has changed. You should re-index any indexes created with either of these code streams. Does the 3x refer to the 3.x development branch? I.e. For those of using the stable 3.x release of Lucene, will a future 3.x release require rebuilding indexes? --ewh - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Tokenization and Fuzziness: How to Allow Multiple Strategies?
Hi Tavi, solr-...@lucene.apache.org has been deprecated since the Lucene and Solr source trees merged last year. Please use dev@lucene.apache.org instead. However, your question is about *usage* of Lucene/Solr, rather than *development*, so you should be using solr-u...@lucene.apache.org or lucene-u...@lucene.apache.org. Please repost your question to one of these lists. Steve -Original Message- From: Tavi Nathanson [mailto:tavi.nathan...@gmail.com] Sent: Monday, February 07, 2011 12:12 PM To: solr-...@lucene.apache.org Subject: Tokenization and Fuzziness: How to Allow Multiple Strategies? Hey everyone, Tokenization seems inherently fuzzy and imprecise, yet Lucene does not appear to provide an easy mechanism to account for this fuzziness. Let's take an example, where the document I'm indexing is v1.1.0 mr. jones da...@gmail.com I may want to tokenize this as follows: [v1.1.0, mr, jones, da...@gmail.com] ...or I may want to tokenize this as follows: [v1, 1.0, mr, jones, david, gmail.com] ...or I may want to tokenize it another way. I would think that the best approach would be indexing using multiple strategies, such as: [v1.1.0, v1, 1.0, mr, jones, da...@gmail.com, david, gmail.com] However, this would destroy phrase queries. And while Lucene lets you index multiple tokens at the same position, I haven't found a way to deal with cases where you want to index a set of tokens at one position: nor does that even make sense. For instance, I can't index [david, gmail.com] in the same position as da...@gmail.com. So: - Any thoughts, in general, about how you all approach this fuzziness? Do you just choose one tokenization strategy and hope for the best? - Might there be a way to use multiple strategies and *not* break phrase queries that I'm overlooking? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenization-and-Fuzziness-How-to- Allow-Multiple-Strategies-tp2444956p2444956.html Sent from the Solr - Dev mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2666) ArrayIndexOutOfBoundsException when iterating over TermDocs
[ https://issues.apache.org/jira/browse/LUCENE-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991545#comment-12991545 ] Michael McCandless commented on LUCENE-2666: Ahh, thanks for bringing closure Nick! Although, I'm a little confused how removing files from the index while readers are using it, could lead to those exceptions... Note that it's perfectly fine to pass create=true to IW, over an existing index, even while readers are using it; IW will gracefully remove the old files itself, even if open IRs are still using them. IW just makes a new commit point that drops all references to prior segments... ArrayIndexOutOfBoundsException when iterating over TermDocs --- Key: LUCENE-2666 URL: https://issues.apache.org/jira/browse/LUCENE-2666 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 3.0.2 Reporter: Shay Banon Attachments: checkindex-out.txt A user got this very strange exception, and I managed to get the index that it happens on. Basically, iterating over the TermDocs causes an AAOIB exception. I easily reproduced it using the FieldCache which does exactly that (the field in question is indexed as numeric). Here is the exception: Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 114 at org.apache.lucene.util.BitVector.get(BitVector.java:104) at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127) at org.apache.lucene.search.FieldCacheImpl$LongCache.createValue(FieldCacheImpl.java:501) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:183) at org.apache.lucene.search.FieldCacheImpl.getLongs(FieldCacheImpl.java:470) at TestMe.main(TestMe.java:56) It happens on the following segment: _26t docCount: 914 delCount: 1 delFileName: _26t_1.del And as you can see, it smells like a corner case (it fails for document number 912, the AIOOB happens from the deleted docs). The code to recreate it is simple: FSDirectory dir = FSDirectory.open(new File(index)); IndexReader reader = IndexReader.open(dir, true); IndexReader[] subReaders = reader.getSequentialSubReaders(); for (IndexReader subReader : subReaders) { Field field = subReader.getClass().getSuperclass().getDeclaredField(si); field.setAccessible(true); SegmentInfo si = (SegmentInfo) field.get(subReader); System.out.println(-- + si); if (si.getDocStoreSegment().contains(_26t)) { // this is the probleatic one... System.out.println(problematic one...); FieldCache.DEFAULT.getLongs(subReader, __documentdate, FieldCache.NUMERIC_UTILS_LONG_PARSER); } } Here is the result of a check index on that segment: 8 of 10: name=_26t docCount=914 compound=true hasProx=true numFiles=2 size (MB)=1.641 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.18-194.11.1.el5.centos.plus, os=Linux, mergeDocStores=true, lucene.version=3.0.2 953716 - 2010-06-11 17:13:53, source=merge, os.arch=amd64, java.version=1.6.0, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_26t_1.del] test: open reader.OK [1 deleted docs] test: fields..OK [32 fields] test: field norms.OK [32 fields] test: terms, freq, prox...ERROR [114] java.lang.ArrayIndexOutOfBoundsException: 114 at org.apache.lucene.util.BitVector.get(BitVector.java:104) at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127) at org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:102) at org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:616) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:509) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299) at TestMe.main(TestMe.java:47) test: stored fields...ERROR [114] java.lang.ArrayIndexOutOfBoundsException: 114 at org.apache.lucene.util.BitVector.get(BitVector.java:104) at org.apache.lucene.index.ReadOnlySegmentReader.isDeleted(ReadOnlySegmentReader.java:34) at org.apache.lucene.index.CheckIndex.testStoredFields(CheckIndex.java:684) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:512) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299) at TestMe.main(TestMe.java:47) test: term vectorsERROR [114] java.lang.ArrayIndexOutOfBoundsException: 114 at org.apache.lucene.util.BitVector.get(BitVector.java:104) at
[jira] Commented: (LUCENE-2908) clean up serialization in the codebase
[ https://issues.apache.org/jira/browse/LUCENE-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991559#comment-12991559 ] Michael McCandless commented on LUCENE-2908: +1 clean up serialization in the codebase -- Key: LUCENE-2908 URL: https://issues.apache.org/jira/browse/LUCENE-2908 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Fix For: 4.0 Attachments: LUCENE-2908.patch We removed contrib/remote, but forgot to cleanup serialization hell everywhere. this is no longer needed, never really worked (e.g. across versions), and slows development (e.g. i wasted a long time debugging stupid serialization of Similarity.idfExplain when trying to make a patch for the scoring system). -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (SOLR-2350) improve post.jar to handle non UTF-8 files
[ https://issues.apache.org/jira/browse/SOLR-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man resolved SOLR-2350. Resolution: Fixed Committed revision 1068149. - trunk Committed revision 1068152. - 3x improve post.jar to handle non UTF-8 files -- Key: SOLR-2350 URL: https://issues.apache.org/jira/browse/SOLR-2350 Project: Solr Issue Type: Improvement Reporter: Hoss Man Assignee: Hoss Man Fix For: 3.1, 4.0 Attachments: SOLR-2350.patch, SOLR-2350.patch thanks to all the awesomeness Uwe did in SOLR-96, some hard coded limitations/assumptions in the simple post.jar provided for the example files can be cleaned up. notably: it use to deal with Readers/Writers, and warned people there data had to be UTF-8 (because that's all Solr supported) and now it can deal with raw streams -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Keyword - search statistics
Solr doesn't keep meta data, so if you're asking for some kind of search logging your app has to provide that... Best Erick On Sun, Feb 6, 2011 at 10:46 PM, Selvaraj Varadharajan selvara...@gmail.com wrote: Hi Is there any way i can get 'no of times' a key word searched in SOLR ? *Here is my solr package details* Solr Specification Version: 1.4.0 Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06 12:33:40 Lucene Specification Version: 2.9.1 Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25 -Selvaraj
Re: Keyword - search statistics
You can also use Google Analytics or something like that too to get stats. Bill Bell Sent from mobile On Feb 7, 2011, at 4:31 PM, Erick Erickson erickerick...@gmail.com wrote: Solr doesn't keep meta data, so if you're asking for some kind of search logging your app has to provide that... Best Erick On Sun, Feb 6, 2011 at 10:46 PM, Selvaraj Varadharajan selvara...@gmail.com wrote: Hi Is there any way i can get 'no of times' a key word searched in SOLR ? Here is my solr package details Solr Specification Version: 1.4.0 Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06 12:33:40 Lucene Specification Version: 2.9.1 Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25 -Selvaraj
Re: Keyword - search statistics
Thanks Eric. What about having another core and interpret the request calls and pool it in that core.. ? Do we see any performance hit form your point of view. -Selvaraj On Mon, Feb 7, 2011 at 3:31 PM, Erick Erickson erickerick...@gmail.comwrote: Solr doesn't keep meta data, so if you're asking for some kind of search logging your app has to provide that... Best Erick On Sun, Feb 6, 2011 at 10:46 PM, Selvaraj Varadharajan selvara...@gmail.com wrote: Hi Is there any way i can get 'no of times' a key word searched in SOLR ? *Here is my solr package details* Solr Specification Version: 1.4.0 Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06 12:33:40 Lucene Specification Version: 2.9.1 Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25 -Selvaraj
Re: Keyword - search statistics
You have to explain your problem in *much* more detail for anyone to make a really relevant comment, all we can do so far is guess what you're *really* after Best Erick On Mon, Feb 7, 2011 at 8:25 PM, Selvaraj Varadharajan selvara...@gmail.comwrote: Thanks Eric. What about having another core and interpret the request calls and pool it in that core.. ? Do we see any performance hit form your point of view. -Selvaraj On Mon, Feb 7, 2011 at 3:31 PM, Erick Erickson erickerick...@gmail.comwrote: Solr doesn't keep meta data, so if you're asking for some kind of search logging your app has to provide that... Best Erick On Sun, Feb 6, 2011 at 10:46 PM, Selvaraj Varadharajan selvara...@gmail.com wrote: Hi Is there any way i can get 'no of times' a key word searched in SOLR ? *Here is my solr package details* Solr Specification Version: 1.4.0 Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06 12:33:40 Lucene Specification Version: 2.9.1 Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25 -Selvaraj
Umlauts as Char
Hey all, So while digging into the code a bit (and pushed by digy's Arabic conversion yesterday). I started looking at the various other languages we were missing from java. I started porting the GermanAnalyzer, but ran into an issue of the Umlauts... http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_9_4/contrib/analyzers/common/src/java/org/apache/lucene/analysis/de/GermanStemmer.java?revision=1040993view=co in the void subsitute function you'll see them: else if ( buffer.charAt( c ) == 'ü' ) { buffer.setCharAt( c, 'u' ); } This does not constitue a character in .net (that I can figure out) and thus it doesn't compile. The .java file says encoded in UTF-8. I was thinking maybe I could do the same thing in VS2010, but I'm not finding a way, and searching on this has been difficult. Any ideas? ~Prescott
Re: Keyword - search statistics
You can have a custom SearchComponent and configure a listener to the same. Checkout example/solr/config/solrconfig.xml , regarding configuring custom query components , before and after the default list of components, that can help provide some of this 'aspect' behavior. arr name=first-components strmyFirstComponentName/str /arr More details, about SearchComponent, available in the javadoc here: http://bit.ly/giz0b1 . Override the method - prepare(ResponseBuilder rb) , and do a count , based on the q / other parameters that you get access to , from the ResponseBuilder. Be aware that this gets in the way of every search request that hits the solr server. So - you need to be careful about how this is persisted (how frequently to the datastore etc.) , without being intrusive and adding to the query time. -- Vijay From: Selvaraj Varadharajan selvara...@gmail.com To: dev@lucene.apache.org Sent: Mon, February 7, 2011 5:25:18 PM Subject: Re: Keyword - search statistics Thanks Eric. What about having another core and interpret the request calls and pool it in that core.. ? Do we see any performance hit form your point of view. -Selvaraj On Mon, Feb 7, 2011 at 3:31 PM, Erick Erickson erickerick...@gmail.com wrote: Solr doesn't keep meta data, so if you're asking for some kind of search logging your app has to provide that... Best Erick On Sun, Feb 6, 2011 at 10:46 PM, Selvaraj Varadharajan selvara...@gmail.com wrote: Hi Is there any way i can get 'no of times' a key word searched in SOLR ? Here is my solr package details Solr Specification Version: 1.4.0 Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06 12:33:40 Lucene Specification Version: 2.9.1 Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25 -Selvaraj
[jira] Commented: (SOLR-2348) No error reported when using a FieldCached backed ValueSource for a field Solr knows won't work
[ https://issues.apache.org/jira/browse/SOLR-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991771#comment-12991771 ] Hoss Man commented on SOLR-2348: My hope had been that this would be really straightforward, and a simple inspection of the SchemaField (or FieldType) could be done inside the FieldCacheSource to cover all cases -- except that FieldCacheSource (and it's subclasses) is only ever given a field name, and never gets a copy of the FieldType, SchemaField of even the IndexSchema to inspect to ensure that using the FieldCache is viable. This means that we have to take the same basic approach as SOLR-2339 - every FieldType impl that utilizes a FieldCacheSource in it's getValueSource method needs to check if FieldCache is viable for that field (ie: indexed, not multivalued) We could rename the checkSortability method I just added in SOLR-2339 into a checkFieldCachibility type method and use it for both purposes, but: * it currently throws exceptions with specific messages like can not sort on unindexed field: * I seem to recall some folks talking about the idea of expanding FieldCache to support more things like UninvertedField does, in which case being multivalued won't prevent you from using the FieldCache on a field which would ultimately mean the pre-conditions for using a FieldCacheSource would change. We could imagine the user specifing a function that takes in vector args to use to collapse the multiple values per doc on a per usage basis (ie: in this function query case, use the max value of the multiple values for each doc; in this function query, use the average value of the multiple values for each doc; etc...) with that in mind, i think for now the most straight forward thing to do is to add a checkFieldCacheSource(QParser) method to SchemaField that would be cut/paste of checkSortability (with the error message wording changed) and make all of the (applicable) FieldType.getValueSource methods call it. In the future, it could evolve differently then checkSortability -- either removing the !multivalued assertion completley, or introspecting the Qparser context in some way to determine that neccessary info has been provided to know how to use the (hypothetical) multivalued FieldCache (hard t ospeculate at this point) No error reported when using a FieldCached backed ValueSource for a field Solr knows won't work --- Key: SOLR-2348 URL: https://issues.apache.org/jira/browse/SOLR-2348 Project: Solr Issue Type: Bug Reporter: Hoss Man Assignee: Hoss Man Fix For: 3.1, 4.0 For the same reasons outlined in SOLR-2339, Solr FieldTypes that return FieldCached backed ValueSources should explicitly check for situations where knows the FieldCache is meaningless. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: CustomScoreQueryWithSubqueries
Robert: I'm trying to follow the steps that are mentioned in: http://wiki.apache.org/lucene-java/HowToContribute in order to make a patch with my contribution. But, in the source code that I get from: http://svn.apache.org/repos/asf/lucene/dev/trunk/ the class org.apache.lucene.search.Searcher is missing and the only method available to obtain a Scorer from a Weight object is scorer(IndexReader.AtomicReaderContext, ScorerContext) I just checked and class Searcher still exists in Lucene 3.0.3. In which version the trunk that I've checkout is based? The patch that I want to submit is based on Lucene 2.9.1. Thanks in advance. Regards. Fernando. De: Robert Muir rcm...@gmail.com Para: dev@lucene.apache.org Enviado: miércoles, 2 de febrero, 2011 16:52:58 Asunto: Re: CustomScoreQueryWithSubqueries On Wed, Feb 2, 2011 at 2:37 PM, Fernando Wasylyszyn ferw...@yahoo.com.ar wrote: Hi everyone. My name is Fernando and I am a researcher and developer in the R+D lab at Snoop Consulting S.R.L. in Argentina. Based on the patch suggested in LUCENE-1608 (https://issues.apache.org/jira/browse/LUCENE-1608) and in the needs of one of our customers, for who we are developing a customized search engine on top of Lucene and Solr, we have developed the class CustomScoreQueryWithSubqueries, which is a variation of CustomScoreQuery that allows the use of arbitrary Query objects besides instances of ValueSourceQuery, without the need of wrapping the arbitrary/ies query/ies with the QueryValueSource proposed in Jira, which has the disadvantage of create an instance of an IndexSearcher in each invocation of the method getValues(IndexReader). If you think that this contribution can be useful for the Lucene community, please let me know the steps in order to contribute. Hi Fernando: https://issues.apache.org/jira/browse/LUCENE-1608 is still an open issue. If you have a better solution, please don't hesitate to upload a patch file to the issue! There are some more detailed instructions here: http://wiki.apache.org/lucene-java/HowToContribute - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Should ASCIIFoldingFilter be deprecated?
ISOLatin1AccentFilter is deprecated, presumably because you can (and should) use MappingCharFilter configured with mapping-ISOLatin1Accent.txt. By that same reasoning, shouldn't ASCIIFoldingFilter be deprecated in favor of using mapping-FoldToASCII.txt ? ~ David Smiley - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Should-ASCIIFoldingFilter-be-deprecated-tp2448919p2448919.html Sent from the Solr - Dev mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Should ASCIIFoldingFilter be deprecated?
AFAIK, ISOLatin1AccentFilter was deprecated because ASCIIFoldingFilter provides a superset of it mappings. I haven't done any benchmarking, but I'm pretty sure that ASCIIFoldingFilter can achieve a significantly higher throughput rate than MappingCharFilter, and given that, it probably makes sense to keep both, to allow people to make the choice about the tradeoff between the flexibility provided by the human-readable (and editable) mapping file and the speed provided by ASCIIFoldingFilter. Steve -Original Message- From: David Smiley (@MITRE.org) [mailto:dsmi...@mitre.org] Sent: Monday, February 07, 2011 10:34 PM To: solr-...@lucene.apache.org Subject: Should ASCIIFoldingFilter be deprecated? ISOLatin1AccentFilter is deprecated, presumably because you can (and should) use MappingCharFilter configured with mapping-ISOLatin1Accent.txt. By that same reasoning, shouldn't ASCIIFoldingFilter be deprecated in favor of using mapping-FoldToASCII.txt ? ~ David Smiley - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Should- ASCIIFoldingFilter-be-deprecated-tp2448919p2448919.html Sent from the Solr - Dev mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Should ASCIIFoldingFilter be deprecated?
: : ISOLatin1AccentFilter is deprecated, presumably because you can (and should) : use MappingCharFilter configured with mapping-ISOLatin1Accent.txt. By that : same reasoning, shouldn't ASCIIFoldingFilter be deprecated in favor of using : mapping-FoldToASCII.txt ? CharFilters and TokenFilters have different purposes though... http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#When_To_use_a_CharFilter_vs_a_TokenFilter (ie: If you use MappingCharFilter, you can't then tokenize on some of the characters you filtered away) : : ~ David Smiley : : - : Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book : -- : View this message in context: http://lucene.472066.n3.nabble.com/Should-ASCIIFoldingFilter-be-deprecated-tp2448919p2448919.html : Sent from the Solr - Dev mailing list archive at Nabble.com. : : - : To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org : For additional commands, e-mail: dev-h...@lucene.apache.org : : -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[HUDSON] Lucene-Solr-tests-only-trunk - Build # 4621 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/4621/ 1 tests failed. REGRESSION: org.apache.lucene.index.TestIndexWriter.testOptimizeTempSpaceUsage Error Message: optimize used too much temporary space: starting usage was 60814 bytes; max temp usage was 244924 but should have been 243256 (= 4X starting usage) Stack Trace: junit.framework.AssertionFailedError: optimize used too much temporary space: starting usage was 60814 bytes; max temp usage was 244924 but should have been 243256 (= 4X starting usage) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1183) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1115) at org.apache.lucene.index.TestIndexWriter.testOptimizeTempSpaceUsage(TestIndexWriter.java:294) Build Log (for compile errors): [...truncated 3044 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: CustomScoreQueryWithSubqueries
Hi Fernando, The wiki indeed relates mainly to trunk development. For creating a 2.9 patch checkout code from /repos/asf/lucene/java/branches/lucene_2_9 Regards, Doron As the wiki page says Most development is done on the trunk You can either use that, or, in order On Tue, Feb 8, 2011 at 4:56 AM, Fernando Wasylyszyn ferw...@yahoo.com.arwrote: Robert: I'm trying to follow the steps that are mentioned in: http://wiki.apache.org/lucene-java/HowToContribute in order to make a patch with my contribution. But, in the source code that I get from: http://svn.apache.org/repos/asf/lucene/dev/trunk/ the class org.apache.lucene.search.Searcher is missing and the only method available to obtain a Scorer from a Weight object is scorer(IndexReader.AtomicReaderContext, ScorerContext) I just checked and class Searcher still exists in Lucene 3.0.3. In which version the trunk that I've checkout is based? The patch that I want to submit is based on Lucene 2.9.1. Thanks in advance. Regards. Fernando. -- *De:* Robert Muir rcm...@gmail.com *Para:* dev@lucene.apache.org *Enviado:* miércoles, 2 de febrero, 2011 16:52:58 *Asunto:* Re: CustomScoreQueryWithSubqueries On Wed, Feb 2, 2011 at 2:37 PM, Fernando Wasylyszyn ferw...@yahoo.com.ar wrote: Hi everyone. My name is Fernando and I am a researcher and developer in the R+D lab at Snoop Consulting S.R.L. in Argentina. Based on the patch suggested in LUCENE-1608 (https://issues.apache.org/jira/browse/LUCENE-1608) and in the needs of one of our customers, for who we are developing a customized search engine on top of Lucene and Solr, we have developed the class CustomScoreQueryWithSubqueries, which is a variation of CustomScoreQuery that allows the use of arbitrary Query objects besides instances of ValueSourceQuery, without the need of wrapping the arbitrary/ies query/ies with the QueryValueSource proposed in Jira, which has the disadvantage of create an instance of an IndexSearcher in each invocation of the method getValues(IndexReader). If you think that this contribution can be useful for the Lucene community, please let me know the steps in order to contribute. Hi Fernando: https://issues.apache.org/jira/browse/LUCENE-1608 is still an open issue. If you have a better solution, please don't hesitate to upload a patch file to the issue! There are some more detailed instructions here: http://wiki.apache.org/lucene-java/HowToContribute - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2351) Allow the MoreLikeThis component to accept filters and use the already parsed query from previous stages (if applicable) as seed.
Allow the MoreLikeThis component to accept filters and use the already parsed query from previous stages (if applicable) as seed. - Key: SOLR-2351 URL: https://issues.apache.org/jira/browse/SOLR-2351 Project: Solr Issue Type: Improvement Components: MoreLikeThis Affects Versions: 1.5 Reporter: Amit Nithian Priority: Minor Fix For: 1.5, 1.3 Currently the MLT component doesn't accept filter queries specified on the URL which my application needed (I needed to restrict similar results by a lat/long bounding box). This patch also attempts to solve the issue of allowing the boost functions of the dismax to be used in the MLT component by using the query object created by the QueryComponent to OR with the query created by the MLT as part of the final query. In a blank dismax query with no query/phrase clauses, this works although a separate BF definition/parsing would be ideal. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2351) Allow the MoreLikeThis component to accept filters and use the already parsed query from previous stages (if applicable) as seed.
[ https://issues.apache.org/jira/browse/SOLR-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Nithian updated SOLR-2351: --- Attachment: mlt.patch Allow the MoreLikeThis component to accept filters and use the already parsed query from previous stages (if applicable) as seed. - Key: SOLR-2351 URL: https://issues.apache.org/jira/browse/SOLR-2351 Project: Solr Issue Type: Improvement Components: MoreLikeThis Affects Versions: 1.5 Reporter: Amit Nithian Priority: Minor Fix For: 1.3, 1.5 Attachments: mlt.patch Currently the MLT component doesn't accept filter queries specified on the URL which my application needed (I needed to restrict similar results by a lat/long bounding box). This patch also attempts to solve the issue of allowing the boost functions of the dismax to be used in the MLT component by using the query object created by the QueryComponent to OR with the query created by the MLT as part of the final query. In a blank dismax query with no query/phrase clauses, this works although a separate BF definition/parsing would be ideal. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2155) Geospatial search using geohash prefixes
[ https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991818#comment-12991818 ] Lance Norskog commented on SOLR-2155: - The lat/long version has to be rotated away from the true. Then, the calculations don't blow up at the poles or the equator. The real answer to doing geo and have it always work is to use quaternions. A lat/lon pair is essentially a complex number: latitude is the scalar and longitude rotates back to 0. A quaternion is a 4-valued variation of complex numbers: a + bi + cj + dk where i,j,k are separate values of sqrt(-1), assuming an infinite number of such values. A geo position, projected onto quaternions, gives a subspace. There are a bunch of 3D algorithms which use quaternions because they don't have problems at the (0-1) boundary. The classic apocryphal story is the jet fighter pilot on a test flight: he crossed the equator and the plane flipped upside down. Quaternions don't have this problem. [SLERP|http://en.wikipedia.org/wiki/Slerp] explains the problem of distance on a sphere. How to do distances, box containment, etc. I don't know. I am _so_ not a math guy. Geospatial search using geohash prefixes Key: SOLR-2155 URL: https://issues.apache.org/jira/browse/SOLR-2155 Project: Solr Issue Type: Improvement Reporter: David Smiley Attachments: GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points. This scenario occurs when there is location extraction (i.e. via a gazateer) occurring on free text. None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area. I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter. A geohash refers to a lat-lon box on the earth. Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid. The first step in this scheme is figuring out which geohash grid squares cover the user's search query. I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose. The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index. Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches. I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox to support different queried shapes so that the filter need not care about these details. This work was presented at LuceneRevolution in Boston on October 8th. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org