[ANNOUNCE] Apache Solr 4.9.0 released
25 June 2014, Apache Solr™ 4.9.0 available The Lucene PMC is pleased to announce the release of Apache Solr 4.9.0 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.9.0 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html See the CHANGES.txt file included with the release for a full list of details. Solr 4.9.0 Release Highlights: * Numerous optimizations for doc values search-time performance * Allow a client application to request the minium achieved replication factor for an update request (single or batch) by sending an optional parameter min_rf. * Query re-ranking support with the new ReRankingQParserPlugin. * A new [child ...] DocTransformer for optionally including Block-Join decendent documents inline in the results of a search. * A new (default) Lucene49NormsFormat to better compress certain cases such as very short fields. Solr 4.9.0 also includes many other new features as well as numerous optimizations and bugfixes of the corresponding Apache Lucene release. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. On behalf of the Lucene PMC, Happy Searching
[ANNOUNCE] Apache Solr 4.8.1 released
May 2014, Apache Solr™ 4.8.1 available The Lucene PMC is pleased to announce the release of Apache Solr 4.8.1 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.8.1 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Solr 4.8.1 includes 10 bug fixes, as well as Lucene 4.8.1 and its bug fixes. See the CHANGES.txt file included with the release for a full list of changes and further details. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access.
[ANNOUNCE] Apache Solr 4.7.2 released.
April 2014, Apache Solr™ 4.7.2 available The Lucene PMC is pleased to announce the release of Apache Solr 4.7.2 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.7.2 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Solr 4.7.2 includes 2 bug fixes, as well as Lucene 4.7.2 and its bug fixes. See the CHANGES.txt file included with the release for a full list of changes and further details. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access.
Re: Unable to get offsets using AtomicReader.termPositionsEnum(Term)
Hello, I think you are confused between two different index structures, probably because of the name of the options in solr. 1. indexing term vectors: this means given a document, you can go lookup a miniature inverted index just for that document. That means each document has term vectors which has a term dictionary of the terms in that one document, and optionally things like positions and character offsets. This can be useful if you are examining *many terms* for just a few documents. For example: the MoreLikeThis use case. In solr this is activated with termVectors=true. To additionally store positions/offsets information inside the term vectors its termPositions and termOffsets, respectively. 2. indexing character offsets: this means given a term, you can get the offset information along with each position that matched. So really you can think of this as a special form of a payload. This is useful if you are examining *many documents* for just a few terms. For example, many highlighting use cases. In solr this is activated with storeOffsetsWithPositions=true. It is unrelated to term vectors. Hopefully this helps. On Mon, Mar 10, 2014 at 9:32 PM, Jefferson French jkfaus...@gmail.com wrote: This looks like a codec issue, but I'm not sure how to address it. I've found that a different instance of DocsAndPositionsEnum is instantiated between my code and Solr's TermVectorComponent. Mine: org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum Solr: org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVDocsEnum As far as I can tell, I've only used Lucene/Solr 4.6, so I'm not sure where the Lucene 4.1 reference comes from. I've searched through the Solr config files and can't see where to change the codec, but shouldn't the reader use the same codec as used when the index was created? On Fri, Mar 7, 2014 at 1:37 PM, Jefferson French jkfaus...@gmail.comwrote: We have an API on top of Lucene 4.6 that I'm trying to adapt to running under Solr 4.6. The problem is although I'm getting the correct offsets when the index is created by Lucene, the same method calls always return -1 when the index is created by Solr. In the latter case I can see the character offsets via Luke, and I can even get them from Solr when I access the /tvrh search handler, which uses the TermVectorComponent class. This is roughly how I'm reading character offsets in my Lucene code: AtomicReader reader = ... Term term = ... DocsAndPositionsEnum postings = reader.termPositionsEnum(term); while (postings.nextDoc() != DocsAndPositionsEnum.NO_MORE_DOCS) { for (int i = 0; i postings.freq(); i++) { System.out.println(start: + postings.startOffset()); System.out.println(end: + postings.endOffset()); } } Notice that I want the values for a single term. When run against an index created by Solr, the above calls to startOffset() and endOffset() return -1. Solr's TermVectorComponent prints the correct offsets like this (paraphrased): IndexReader reader = searcher.getIndexReader(); Terms vector = reader.getTermVector(docId, field); TermsEnum termsEnum = vector.iterator(termsEnum); int freq = (int) termsEnum.totalTermFreq(); DocsAndPositionsEnum dpEnum = null; while((text = termsEnum.next()) != null) { String term = text.utf8ToString(); dpEnum = termsEnum.docsAndPositions(null, dpEnum); dpEnum.nextDoc(); for (int i = 0; i freq; i++) { final int pos = dpEnum.nextPosition(); System.out.println(start: + dpEnum.startOffset()); System.out.println(end: + dpEnum.endOffset()); } } but in this case it is getting the offsets per doc ID, rather than a single term, which is what I want. Could anyone tell me: 1. Why I'm not able to get the offsets using my first example, and/or 2. A better way to get the offsets for a given term? Thanks. Jeff
Re: ANNOUNCE: Apache Solr Reference Guide for 4.7
I debugged the PDF a little. FWIW, the following code (using iText) takes it to 9MB: public static void main(String args[]) throws Exception { Document document = new Document(); PdfSmartCopy copy = new PdfSmartCopy(document, new FileOutputStream(/home/rmuir/Downloads/test.pdf)); //copy.setCompressionLevel(9); //copy.setFullCompression(); document.open(); PdfReader reader = new PdfReader(/home/rmuir/Downloads/apache-solr-ref-guide-4.7.pdf); int pages = reader.getNumberOfPages(); for (int i = 0; i pages; i++) { PdfImportedPage page = copy.getImportedPage(reader, i+1); copy.addPage(page); } copy.freeReader(reader); reader.close(); document.close(); } On Wed, Mar 5, 2014 at 10:17 AM, Steve Rowe sar...@gmail.com wrote: Not sure if it’s relevant anymore, but a few years ago Atlassian resolved as won’t fix” a request to configure exported PDF compression ratio: https://jira.atlassian.com/browse/CONF-21329. Their suggestion: zip the PDF. I tried that - the resulting zip size is roughly 9MB, so it’s definitely compressible. Steve On Mar 5, 2014, at 10:03 AM, Cassandra Targett casstarg...@gmail.com wrote: You know, I didn't even notice that. It did go up to 30M. I've made a note to look into that before we release the 4.8 version to see if it can be reduced at all. I suspect the screenshots are causing it to balloon - we made some changes to the way they appear in the PDF for 4.7 which may be the cause, but also the software was upgraded and maybe the newer version is handling them differently. Thanks for pointing that out. On Tue, Mar 4, 2014 at 6:43 PM, Alexandre Rafalovitch arafa...@gmail.comwrote: Has it really gone up in size from 5Mb for 4.6 version to 30Mb for 4.7 version? Or some mirrors are playing tricks (mine is: http://www.trieuvan.com/apache/lucene/solr/ref-guide/ ) Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Mar 5, 2014 at 1:39 AM, Cassandra Targett ctarg...@apache.org wrote: The Lucene PMC is pleased to announce that we have a new version of the Solr Reference Guide available for Solr 4.7. The 395 page PDF serves as the definitive user's manual for Solr 4.7. It can be downloaded from the Apache mirror network: https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/ Cassandra
Re: Problems with ICUCollationField
you need the solr analysis-extras jar in your classpath, too. On Wed, Feb 19, 2014 at 6:45 AM, Thomas Fischer fischer...@aon.at wrote: Hello, I'm migrating to solr 4.6.1 and have problems with the ICUCollationField (apache-solr-ref-guide-4.6.pdf, pp. 31 and 100). I get consistently the error message Error loading class 'solr.ICUCollationField'. even after INFO: Adding 'file:/srv/solr4.6.1/contrib/analysis-extras/lib/icu4j-49.1.jar' to classloader and INFO: Adding 'file:/srv/solr4.6.1/contrib/analysis-extras/lucene-libs/lucene-analyzers-icu-4.6.1.jar' to classloader. Am I missing something? I solr's subversion I found /SVN/solr/contrib/analysis-extras/src/java/org/apache/solr/schema/ICUCollationField.java but no corresponding class in solr4.6.1's contrib folder. Best Thomas
Re: Problems with ICUCollationField
you need the solr analysis-extras jar itself, too. On Wed, Feb 19, 2014 at 8:25 AM, Thomas Fischer fischer...@aon.at wrote: Hello Robert, I already added contrib/analysis-extras/lib/ and contrib/analysis-extras/lucene-libs/ via lib directives in solrconfig, this is why the classes mentioned are loaded. Do you know which jar is supposed to contain the ICUCollationField? Best regards Thomas Am 19.02.2014 um 13:54 schrieb Robert Muir: you need the solr analysis-extras jar in your classpath, too. On Wed, Feb 19, 2014 at 6:45 AM, Thomas Fischer fischer...@aon.at wrote: Hello, I'm migrating to solr 4.6.1 and have problems with the ICUCollationField (apache-solr-ref-guide-4.6.pdf, pp. 31 and 100). I get consistently the error message Error loading class 'solr.ICUCollationField'. even after INFO: Adding 'file:/srv/solr4.6.1/contrib/analysis-extras/lib/icu4j-49.1.jar' to classloader and INFO: Adding 'file:/srv/solr4.6.1/contrib/analysis-extras/lucene-libs/lucene-analyzers-icu-4.6.1.jar' to classloader. Am I missing something? I solr's subversion I found /SVN/solr/contrib/analysis-extras/src/java/org/apache/solr/schema/ICUCollationField.java but no corresponding class in solr4.6.1's contrib folder. Best Thomas
Re: Problems with ICUCollationField
Hmm, for standardization of text fields, collation might be a little awkward. For your german umlauts, what do you mean by standardize? is this to achieve equivalency of e.g. oe to ö in your search terms? In that case, a simpler approach would be to put GermanNormalizationFilterFactory in your chain: http://lucene.apache.org/core/4_6_1/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html On Wed, Feb 19, 2014 at 9:16 AM, Thomas Fischer fischer...@aon.at wrote: Thanks, that helps! I'm trying to migrate from the now deprecated ICUCollationKeyFilterFactory I used before to the ICUCollationField. Is there any description how to achieve this? First tries now yield ICUCollationField does not support specifying an analyzer. which makes it complicated since I used the ICUCollationKeyFilterFactory to standardize my text fields (in particular because of German Umlauts). But an ICUCollationField without LowerCaseFilter, a WhitespaceTokenizer, a LetterTokenizer, etc. doesn't do me much good, I'm afraid. Or is this somehow wrapped into the ICUCollationField? I didn't find ICUCollationField in the solr wiki and not much information in the reference. And the hint solr.ICUCollationField is included in the Solr analysis-extras contrib - see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your SOLR_HOME/lib in order to use it. is misleading insofar as this README.txt doesn't mention the solr-analysis-extras-4.6.1.jar in dist. Best Thomas Am 19.02.2014 um 14:27 schrieb Robert Muir: you need the solr analysis-extras jar itself, too. On Wed, Feb 19, 2014 at 8:25 AM, Thomas Fischer fischer...@aon.at wrote: Hello Robert, I already added contrib/analysis-extras/lib/ and contrib/analysis-extras/lucene-libs/ via lib directives in solrconfig, this is why the classes mentioned are loaded. Do you know which jar is supposed to contain the ICUCollationField? Best regards Thomas Am 19.02.2014 um 13:54 schrieb Robert Muir: you need the solr analysis-extras jar in your classpath, too. On Wed, Feb 19, 2014 at 6:45 AM, Thomas Fischer fischer...@aon.at wrote: Hello, I'm migrating to solr 4.6.1 and have problems with the ICUCollationField (apache-solr-ref-guide-4.6.pdf, pp. 31 and 100). I get consistently the error message Error loading class 'solr.ICUCollationField'. even after INFO: Adding 'file:/srv/solr4.6.1/contrib/analysis-extras/lib/icu4j-49.1.jar' to classloader and INFO: Adding 'file:/srv/solr4.6.1/contrib/analysis-extras/lucene-libs/lucene-analyzers-icu-4.6.1.jar' to classloader. Am I missing something? I solr's subversion I found /SVN/solr/contrib/analysis-extras/src/java/org/apache/solr/schema/ICUCollationField.java but no corresponding class in solr4.6.1's contrib folder. Best Thomas
Re: Problems with ICUCollationField
On Wed, Feb 19, 2014 at 10:33 AM, Thomas Fischer fischer...@aon.at wrote: Hmm, for standardization of text fields, collation might be a little awkward. I arrived there after using custom rules for a while (see RuleBasedCollator on http://wiki.apache.org/solr/UnicodeCollation) and then being told For better performance, less memory usage, and support for more locales, you can add the analysis-extras contrib and use ICUCollationKeyFilterFactory instead. (on the same page under ICU Collation). For your german umlauts, what do you mean by standardize? is this to achieve equivalency of e.g. oe to ö in your search terms? That is the main point, but I might also need the additional normalization of combined characters like o+ ̈ = ö and probably similar constructions for other languages (like Hungarian). Sure but using collation to get normalization is pretty overkill too. Maybe try ICUNormalizer2Filter? This gives you better control over the normalization anyway. In that case, a simpler approach would be to put GermanNormalizationFilterFactory in your chain: http://lucene.apache.org/core/4_6_1/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html I'll see how far I get with this, but from the description • 'ä', 'ö', 'ü' are replaced by 'a', 'o', 'u', respectively. • 'ae' and 'oe' are replaced by 'a', and 'o', respectively. this seems to be too far-reaching a reduction: while the identification ä=ae is not very serious and rarely misleading, ä=a might pack words together that shouldn't be, Äsen and Asen are quite different concepts, I'm not sure thats a mainstream opinion: not only do the default german collation rules conflate these two characters as equivalent at primary level, but so do many german stemming algorithms. Similar arguments could be made for 'résumé' versus 'resume' and so on. Search isn't an exact science.
[ANNOUNCE] Apache Solr 4.6.1 released.
January 2014, Apache Solr™ 4.6.1 available The Lucene PMC is pleased to announce the release of Apache Solr 4.6.1Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites.Solr 4.6.1 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Solr 4.6.1 includes 29 bug fixes and one optimization as well as Lucene 4.6.1 and its bug fixes.See the CHANGES.txt file included with the release for a full list of changes and further details.Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html)Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access.
Re: Tracking down the input that hits an analysis chain bug
This exception comes from OffsetAttributeImpl (e.g. you dont need to index anything to reproduce it). Maybe you have a missing clearAttributes() call (your tokenizer 'returns true' without calling that first)? This could explain it, if something like a StopFilter is also present in the chain: basically the offsets overflow. the test stuff in BaseTokenStreamTestCase should be able to detect this as well... On Fri, Jan 3, 2014 at 1:56 PM, Benson Margulies ben...@basistech.com wrote: Using Solr Cloud with 4.3.1. We've got a problem with a tokenizer that manifests as calling OffsetAtt.setOffsets() with invalid inputs. OK, so, we want to figure out what input provokes our code into getting into this pickle. The problem happens on SolrCloud nodes. The problem manifests as this sort of thing: Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log SEVERE: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be = startOffset, startOffset=-1811581632,endOffset=-1811581632 How could we get a document ID so that we can tell which document was being processed?
Re: Bad fieldNorm when using morphologic synonyms
no, its turned on by default in the default similarity. as i said, all that is necessary is to fix your analyzer to emit the proper position increments. On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: In order to set discountOverlaps to true you must have added the similarity class=solr.DefaultSimilarityFactory to the schema.xml, which is commented out by default! As by default this param is false, the above situation is expected with correct positioning, as said. In order to fix the field norms you'd have to reindex with the similarity class which initializes the param to true. Cheers, Manu
Re: Bad fieldNorm when using morphologic synonyms
its accurate, you are wrong. please, look at setDiscountOverlaps in your similarity. This is really easy to understand. On Sun, Dec 8, 2013 at 7:23 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Robert, you last reply is not accurate. It's true that the field norms and termVectors are independent. But this issue of higher norms for this case is expected with well assigned positions. The LengthNorm is assigned as FieldInvertState.length which is the count of incrementToken and not num of positions! It is the case for wordDelimiterFilter or ReversedWildcardFilter which do change the norm when expanding a term.
Re: Bad fieldNorm when using morphologic synonyms
Your analyzer needs to set positionIncrement correctly: sounds like its broken. On Thu, Dec 5, 2013 at 1:53 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi, we implemented a morphologic analyzer, which stems words on index time. For some reasons, we index both the original word and the stem (on the same position, of course). The stemming is done on a specific language, so other languages are not stemmed at all. Because of that, two documents with the same amount of terms, may have different termVector size. document which contains many words that being stemmed, will have a double sized termVector. This behaviour affects the relevance score in a BAD way. the fieldNorm of these documents reduces thier score. This is NOT the wanted behaviour in our case. We are looking for a way to mark the stemmed words (on index time, of course) so they won't affect the fieldNorm. Do such a way exist? Do you have another idea?
Re: Bad fieldNorm when using morphologic synonyms
termvectors have nothing to do with any of this. please, fix your analyzer first. if you want to add a synonym, it should be position increment of zero. i bet exact phrase queries aren't working correctly either. On Fri, Dec 6, 2013 at 12:50 AM, Isaac Hebsh isaac.he...@gmail.com wrote: 1) positions look all right (for me). 2) fieldNorm is determined by the size of the termVector, isn't it? the termVector size isn't affected by the positions. On Fri, Dec 6, 2013 at 10:46 AM, Robert Muir rcm...@gmail.com wrote: Your analyzer needs to set positionIncrement correctly: sounds like its broken. On Thu, Dec 5, 2013 at 1:53 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi, we implemented a morphologic analyzer, which stems words on index time. For some reasons, we index both the original word and the stem (on the same position, of course). The stemming is done on a specific language, so other languages are not stemmed at all. Because of that, two documents with the same amount of terms, may have different termVector size. document which contains many words that being stemmed, will have a double sized termVector. This behaviour affects the relevance score in a BAD way. the fieldNorm of these documents reduces thier score. This is NOT the wanted behaviour in our case. We are looking for a way to mark the stemmed words (on index time, of course) so they won't affect the fieldNorm. Do such a way exist? Do you have another idea?
Re: Why do people want to deploy to Tomcat?
which example? there are so many. On Wed, Nov 13, 2013 at 1:00 PM, Mark Miller markrmil...@gmail.com wrote: RE: the example folder It’s something I’ve been pushing towards moving away from for a long time - see https://issues.apache.org/jira/browse/SOLR-3619 Rename 'example' dir to 'server' and pull examples into an 'examples’ directory Part of a push I’ve been on to own the Container level (people are now on board with that for 5.0), add start scripts, and other niceties that we should have but don’t yet. Even our config files should move away from being an “example” and end up more like a default starting template. Like a database, it should be simple to create a collection without needing to deal with config - you want to deal with the config when you need to, not face it all up front every time it is time to create a new collection. IMO, the name example is historical - most people already use it this way, the name just confuses matters. - Mark On Nov 13, 2013, at 12:30 PM, Shawn Heisey s...@elyograg.org wrote: On 11/13/2013 5:29 AM, Dmitry Kan wrote: Reading that people have considered deploying example folder is slightly strange to me. No wonder they are confused and confuse their ops. I do use the stripped jetty included in the example, but my setup is not a straight copy of the example directory. I removed a lot of it and changed how jars get loaded. I built my own init script from scratch, tailored for my setup. I'll start a new thread with my init script and some info about how I installed Solr. Thanks, Shawn
Re: Background merge errors with Solr 4.4.0 on Optimize call
I think its a bug, but thats just my opinion. i sent a patch to dev@ for thoughts. On Tue, Oct 29, 2013 at 6:09 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, so you're saying that merging indexes where a field has been removed isn't handled. So you have some documents that do have a what field, but your schema doesn't have it, is that true? It _seems_ like you could get by by putting the _what_ field back into your schema, just not sending any data to it in new docs. I'll let others who understand merging better than me chime in on whether this is a case that should be handled or a bug. I pinged the dev list to see what the opinion is Best, Erick On Mon, Oct 28, 2013 at 6:39 PM, Matthew Shapiro m...@mshapiro.net wrote: Sorry for reposting after I just sent in a reply, but I just looked at the error trace closer and noticed 1. Caused by: java.lang.IllegalArgumentException: no such field what The 'what' field was removed by request of the customer as they wanted the logic behind what gets queried in the what field to be code side instead of solr side (for easier changing without having to re-index everything. I didn't feel strongly either way and since they are paying me, I took it out). This makes me wonder if its crashing while merging because a field that used to be there is now gone. However, this seems odd to me as Solr doesn't even let me delete the old data and instead its leaving my collection in an extremely bad state, with the only remedy I can think of is to nuke the index at the filesystem level. If this is indeed the cause of the crash, is the only way to delete a field to first completely empty your index first? On Mon, Oct 28, 2013 at 6:34 PM, Matthew Shapiro m...@mshapiro.net wrote: Thanks for your response. You were right, solr is logging to the catalina.out file for tomcat. When I click the optimize button in solr's admin interface the following logs are written: http://apaste.info/laup About JVM memory, solr's admin interface is listing JVM memory at 3.1% (221.7MB is dark grey, 512.56MB light grey and 6.99GB total). On Mon, Oct 28, 2013 at 6:29 AM, Erick Erickson erickerick...@gmail.com wrote: For Tomcat, the Solr is often put into catalina.out as a default, so the output might be there. You can configure Solr to send the logs most anywhere you please, but without some specific setup on your part the log output just goes to the default for the servlet. I took a quick glance at the code but since the merges are happening in the background, there's not much context for where that error is thrown. How much memory is there for the JVM? I'm grasping at straws a bit... Erick On Sun, Oct 27, 2013 at 9:54 PM, Matthew Shapiro m...@mshapiro.net wrote: I am working at implementing solr to work as the search backend for our web system. So far things have been going well, but today I made some schema changes and now things have broken. I updated the schema.xml file and reloaded the core (via the admin interface). No errors were reported in the logs. I then pushed 100 records to be indexed. A call to Commit afterwards seemed fine, however my next call for Optimize caused the following errors: java.io.IOException: background merge hit exception: _2n(4.4):C4263/154 _30(4.4):C134 _32(4.4):C10 _31(4.4):C10 into _37 [maxNumSegments=1] null:java.io.IOException: background merge hit exception: _2n(4.4):C4263/154 _30(4.4):C134 _32(4.4):C10 _31(4.4):C10 into _37 [maxNumSegments=1] Unfortunately, googling for background merge hit exception came up with 2 thing: a corrupt index or not enough free space. The host machine that's hosting solr has 227 out of 229GB free (according to df -h), so that's not it. I then ran CheckIndex on the index, and got the following results: http://apaste.info/gmGU As someone who is new to solr and lucene, as far as I can tell this means my index is fine. So I am coming up at a loss. I'm somewhat sure that I could probably delete my data directory and rebuild it but I am more interested in finding out why is it having issues, what is the best way to fix it, and what is the best way to prevent it from happening when this goes into production. Does anyone have any advice that may help? As an aside, i do not have a stacktrace for you because the solr admin page isn't giving me one. I tried looking in my logs file in my solr directory, but it does not contain any logs. I opened up my ~/tomcat/lib/log4j.properties file and saw http://apaste.info/0rTL, which didnt really help me find log files. Doing a 'find . | grep solr.log' didn't really help either. Any help for finding log files (which may help find the actual cause of this) would also be appreciated.
Re: Problems installing Solr4 in Jetty9
On Sat, Aug 17, 2013 at 3:59 AM, Chris Collins ch...@geekychris.com wrote: I am using 4.4 in an embedded mode and found that it has a dependency on hadoop 2.0.5. alpha that in turn depends on jetty 6.1.26 which I think pre-dates electricity :-} I think this is only a test dependency ?
Re: PostingsHighlighter returning fields which don't match
On Wed, Aug 14, 2013 at 3:53 AM, ses stew...@ssims.co.uk wrote: We are trying out the new PostingsHighlighter with Solr 4.2.1 and finding that the highlighting section of the response includes self-closing tags for all the fields in hl.fl (by default for edismax it is all fields in qf) where there are no highlighting matches. In contrast the same query on Solr 4.0.0 without PostingsHighlighter it returns only the fields containing highlighting matches. here is a simplified example of the highlighting response for a document with no matches in the fields specified by hl.fl: with PostingsHighlighter: response ... lst name=highlighting lst name=Z123456 arr name=A1/ arr name=A2/ arr name=A3/ ... /lst /lst /response without PostingsHighlighter: response ... lst name=highlighting lst name=Z123456/ /lst /response Do you want to open a JIRA issue to just change the behavior? This is a big problem for us as we have a large number of fields in a dynamic field and we believe every time a highlighted response comes back it is sending us a very large number of self-closing tags which bloats the response to an unreasonable size (in some cases 100MB+). Unrelated: If your queries actually go against a large number of fields, I'm not sure how efficient this highlighter will be. Thats because at some number of N fields, it will be much more efficient to use a document-oriented term vector approach (e.g. standard highlighter/fast-vector-highlighter).
Re: Who's cleaning the Fieldcache?
On Wed, Aug 14, 2013 at 5:29 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : why? Those are my sort fields and they are occupying a lot of space (doubled : in this case but I see that sometimes I have three or four old segment : references) : : Is there something I can do to remove those old references? I tried to reload : the core and it seems the old references are discarded (i.e. garbage : collected) but I believe is not a good workaround, I would avoid to reload the : core for every replication cycle. You don't need to reload the core to get rid of the old FieldCaches -- in fact, there is nothing about reloading the core that will garuntee old FieldCaches get removed. FieldCaches are managed using a WeakHashMap - so once the IndexReader's associated with those FieldCaches are no logner used, they will be garbage collected when and if the JVMs garbage collector get arround to it. if they sit arround after you are done with them, they might look like the ytake upa log of memory, but that just means your JVM Heap has that memory to spare and hasn't needed to clean them up yet. I don't think this is correct. When you register an entry in the fieldcache, it registers event listeners on the segment's core so that when its close()d, any entries are purged rather than waiting on GC. See FieldCacheImpl.java
Re: Who's cleaning the Fieldcache?
On Wed, Aug 14, 2013 at 5:58 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : FieldCaches are managed using a WeakHashMap - so once the IndexReader's : associated with those FieldCaches are no logner used, they will be garbage : collected when and if the JVMs garbage collector get arround to it. : : if they sit arround after you are done with them, they might look like the : ytake upa log of memory, but that just means your JVM Heap has that memory : to spare and hasn't needed to clean them up yet. : : I don't think this is correct. : : When you register an entry in the fieldcache, it registers event : listeners on the segment's core so that when its close()d, any entries : are purged rather than waiting on GC. : : See FieldCacheImpl.java Ah ... sweet. I didn't realize that got added. (In any case: it looks like a WeakHashMap is still used in case the listeners never get called, correct?) I think it might be the other way around: i think it was weakmap before always, the close listeners were then added sometime in 3.x series, so we registered purge events as an optimization. But one way to look at it is: readers should really get closed, so why have the weak map and not just a regular hashmap. Even if we want to keep the weak map (seriously i dont care, and i dont want to be the guy fielding complaints on this), I'm going to open with an issue with a patch that removes it and fails tests in @afterclass if there is any entries. This way its totally clear if/when/where anything is relying on GC today here and we can at least look at that.
Re: Split Shard Error - maxValue must be non-negative
did you do a (real) commit before trying to use this? I am not sure how this splitting works, but at least the merge option requires that. i can't see this happening unless you are somehow splitting a 0 document index (or, if the splitter is creating 0 document splits) so this is likely just a symptom of https://issues.apache.org/jira/browse/LUCENE-5116 On Tue, Aug 13, 2013 at 6:46 AM, Srivatsan ranjith.venkate...@gmail.com wrote: Hi, I am experimenting with solr 4.4.0 split shard feature. When i split the shard i am getting the following exception. /java.lang.IllegalArgumentException: maxValue must be non-negative (got: -1) at org.apache.lucene.util.packed.PackedInts.bitsRequired(PackedInts.java:1184) at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:140) at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92) at org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112) at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119) at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:2488) at org.apache.solr.update.SolrIndexSplitter.split(SolrIndexSplitter.java:125) at org.apache.solr.update.DirectUpdateHandler2.split(DirectUpdateHandler2.java:766) at org.apache.solr.handler.admin.CoreAdminHandler.handleSplitAction(CoreAdminHandler.java:284) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:611) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:209) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:679)/ How to resolve this problem? -- View this message in context: http://lucene.472066.n3.nabble.com/Split-Shard-Error-maxValue-must-be-non-negative-tp4084220.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Split Shard Error - maxValue must be non-negative
Well, i meant before, but i just took a look and this is implemented differently than the merge one. In any case, i think its the same bug, because I think the only way this can happen is if somehow this splitter is trying to create a 0-document split (or maybe a split containing all deletions). On Tue, Aug 13, 2013 at 8:22 AM, Srivatsan ranjith.venkate...@gmail.com wrote: Ya i am performing commit after split request is submitted to server. -- View this message in context: http://lucene.472066.n3.nabble.com/Split-Shard-Error-maxValue-must-be-non-negative-tp4084220p4084256.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Split Shard Error - maxValue must be non-negative
On Tue, Aug 13, 2013 at 11:39 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: The splitting code calls commit before it starts the splitting. It creates a LiveDocsReader using a bitset created by the split. This reader is merged to an index using addIndexes. Shouldn't the addIndexes code then ignore all such 0-document segments? Not in 4.4: https://issues.apache.org/jira/browse/LUCENE-5116
Re: Is there a way to store binary data (byte[]) in DocValues?
On Mon, Aug 12, 2013 at 8:38 AM, Mathias Lux m...@itec.uni-klu.ac.at wrote: Hi! I'm basically searching for a method to put byte[] data into Lucene DocValues of type BINARY (see [1]). Currently only primitives and Strings are supported according to [1]. I know that this can be done with a custom update handler, but I'd like to avoid that. Can you describe a little bit what kind of operations you want to do with it? I don't really know how BinaryField is typically used, but maybe it could support this option. On the other hand adding it to BinaryField might not buy you much without some additional stuff depending upon what you need to do. Like if you really want to do sort/facet on the thing, SORTED(SET) would probably be a better implementation: it doesnt care that the values are binary. BINARY, SORTED, and SORTED_SET actually all take byte[]: the difference is: * SORTED: deduplicates/compresses the unique byte[]'s and gives each document an ordinal number that reflects sort order (for sorting/faceting/grouping/etc) * SORTED_SET: similar, except each document has a set (which can be empty), of ordinal numbers (e.g. for faceting multivalued fields) * BINARY: just stores the byte[] for each document (no deduplication, no compression, no ordinals, nothing). So for sorting/faceting: BINARY is generally not very efficient unless there is something custom going on: for example lucene's faceting package stores the values elsewhere in a separate taxonomy index, so it uses this type just to encode a delta-compressed ordinal list for each document. For scoring factors/function queries: encoding the values inside NUMERIC(s) [up to 64 bits each] might still be best on average: the compression applied here is surprisingly efficient.
Re: Is there a way to store binary data (byte[]) in DocValues?
On Mon, Aug 12, 2013 at 12:25 PM, Mathias Lux m...@itec.uni-klu.ac.at wrote: Another thing for not using the the SORTED_SET and SORTED implementations is, that Solr currently works with Strings on that and I want to have a small memory footprint for millions of images ... which does not go well with immutables. Just as a side note, again these work with byte[]. It happens to be the case that solr uses these for its StringField (converting the strings to bytes), but if you wanted to use these with BinaryField you could (they just take BytesRef).
Re: Purging unused segments.
On Fri, Aug 9, 2013 at 7:48 PM, Erick Erickson erickerick...@gmail.com wrote: So is there a good way, without optimizing, to purge any segments not referenced in the segments file? Actually I doubt that optimizing would even do it if I _could_, any phantom segments aren't visible from the segments file anyway... I dont know why you have these files (windows? deletion policy?) but maybe you are interested in this: http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/index/IndexWriter.html#deleteUnusedFiles%28%29
Re: Invalid UTF-8 character 0xfffe during shard update
On Mon, Aug 5, 2013 at 11:42 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : I agree with you, 0xfffe is a special character, that is why I was asking : how it's handled in solr. : In my document, 0xfffe does not appear at the beginning, it's in the : content. Unless i'm missunderstanding something (and it's very likely that i am)... 0xfffe is not a special character -- it is explicitly *not* a character in Unicode at all, it is set asside as not a character. specifically so that the character 0xfeff can be used as a BOM, and if the BOM is read incorrectly, it will cause an error. XML doesnt allow control character like this, it defines character as: Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10] /* any Unicode character, excluding the surrogate blocks, FFFE, and . */
Re: Invalid UTF-8 character 0xfffe during shard update
On Mon, Aug 5, 2013 at 3:03 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : 0xfffe is not a special character -- it is explicitly *not* a character in : Unicode at all, it is set asside as not a character. specifically so : that the character 0xfeff can be used as a BOM, and if the BOM is read : incorrectly, it will cause an error. : : XML doesnt allow control character like this, it defines character as: But is that even relevant? I thought FFFE was *not* a control character? I thought it was completely invaid in Unicode. its totally relevant. FFFE is a unicode codepoint, but its a noncharacter. Its just that XML disallows FFFE and noncharacters, but allows other noncharacters (like 9) These are allowed but discouraged: http://www.w3.org/TR/xml11/#charsets
Re: WikipediaTokenizer for Removing Unnecesary Parts
If you use wikipediatokenizer it will tag different wiki elements with different types (you can see it in the admin UI). so then followup with typetokenfilter to only filter the types you care about, and i think it will do what you want. On Tue, Jul 23, 2013 at 7:53 AM, Furkan KAMACI furkankam...@gmail.comwrote: Hi; I have indexed wikipedia data with Solr DIH. However when I look data that is indexed at Solr I something like that as well: {| style=text-align: left; width: 50%; table-layout: fixed; border=0 |- valign=top | style=width: 50%| :*[[Ubuntu]] :*[[Fedora]] :*[[Mandriva]] :*[[Linux Mint]] :*[[Debian]] :*[[OpenSUSE]] | *[[Red Hat]] *[[Mageia]] *[[Arch Linux]] *[[PCLinuxOS]] *[[Slackware]] |} However I want to remove them before indexing. I know that there is a WikipediaTokenizer in Lucene but how can I remove unnecessary parts ( as like links, style, etc..) with Solr?
Re: Using per-segment FieldCache or DocValues in custom component?
Where do you get the docid from? Usually its best to just look at the whole algorithm, e.g. docids come from per-segment readers by default anyway so ideally you want to access any per-document things from that same segmentreader. As far as supporting docvalues, FieldCache API passes thru to docvalues transparently if its enabled for the field. On Mon, Jul 1, 2013 at 4:55 PM, Michael Ryan mr...@moreover.com wrote: I have some custom code that uses the top-level FieldCache (e.g., FieldCache.DEFAULT.getLongs(reader, foobar, false)). I'd like to redesign this to use the per-segment FieldCaches so that re-opening a Searcher is fast(er). In most cases, I've got a docId and I want to get the value for a particular single-valued field for that doc. Is there a good place to look to see example code of per-segment FieldCache use? I've been looking at PerSegmentSingleValuedFaceting, but hoping there might be something less confusing :) Also thinking DocValues might be a better way to go for me... is there any documentation or example code for that? -Michael
Re: Are there any plans to change example directory layout?
If you have a good idea... Just do it. Open an issue On Jun 11, 2013 9:34 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: I think it is quite hard for beginners that basic solr example directory is competing for attention with other - nested - examples. I see quite a lot of questions on which directory inside 'example' to pay attention to and which to ignore, etc. Actually, this is so confusing, I am not even sure how to put this in writing. Basically, is anybody aware of people looking into example directory structure? A JIRA maybe? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Requesting to add into a Contributor Group
done. let us know if you have any problems. On Sat, May 4, 2013 at 10:12 AM, Krunal jariwalakru...@gmail.com wrote: Dear Sir, Kindly add me to the contributor group to help me contribute to the Solr wiki. My Email id: jariwalakru...@gmail.com Login Name: Krunal Specific changes I would like to make to begin with are: - Correct Link of Ajax Solr here http://wiki.apache.org/solr/SolrJS which is wrong, the correct link should be https://github.com/evolvingweb/ajax-solr/wiki - Add our company data here http://wiki.apache.org/solr/Support We offer Solr integration service on Dot Net Platform at Xcellence-IT. And business division of ours, i.e. nopAccelerate - offers a Solr Integration Plugin for nopCommerce along with other nopCommerce performance optimization services. We have been working on Solr since last 1 years and will be happy to contribute back by helping community maintain update Wiki. If this is not allowed, then kindly let us know so I will send you our Company details so you can make changes too. Thanks, Awaiting your response. Krunal *Krunal Jariwala* *Cell:* +91-98251-07747 *Best time to Call:* 9am to 7pm (IST) GMT +5.30
Re: Solr using a ridiculous amount of memory
On Sun, Mar 24, 2013 at 4:19 AM, John Nielsen j...@mcb.dk wrote: Schema with DocValues attempt at solving problem: http://pastebin.com/Ne23NnW4 Config: http://pastebin.com/x1qykyXW This schema isn't using docvalues, due to a typo in your config. it should not be DocValues=true but docValues=true. Are you not getting an error? Solr needs to throw exception if you provide invalid attributes to the field. Nothing is more frustrating than having a typo or something in your configuration and solr just ignores this, reports no error, and doesnt work the way you want. I'll look into this (I already intend to add these checks to analysis factories for the same reason). Separately, if you really want the terms data and so on to remain on disk, it is not enough to just enable docvalues for the field. The default implementation uses the heap. So if you want that, you need to set docValuesFormat=Disk on the fieldtype. This will keep the majority of the data on disk, and only some key datastructures in heap memory. This might have significant performance impact depending upon what you are doing so you need to test that.
Re: Fuzzy Suggester and exactMatchFirst
On Sun, Mar 17, 2013 at 8:19 PM, Eoghan Ó Carragáin eoghan.ocarrag...@gmail.com wrote: I can see why the Fuzzy Suggester sees college as a match for colla but expected the exactMatchFirst parameter to ensure that suggestions beginning with colla to be weighted higher than fuzzier matches. I have spellcheck.onlyMorePopular set to true, in case this makes a difference. Am I misunderstanding what exactMatchFirst is supposed to do? Is there a way to ensure suggestions matching exactly what the user has entered rank higher than fuzzy matches? I think exactMatchFirst is unrelated to typo-correction: it only ensures that if you type the whole suggestion exactly that the weight is completely ignored. This means if you type 'college' and there is an actual suggestion of 'college' it will be weighted above 'colleges' even if colleges has a much higher weight. On the other hand what you want (i think) is to punish the weights of suggestions that required some corrections. Currently I don't think there is any way to do that: * NOTE: This suggester does not boost suggestions that * required no edits over suggestions that did require * edits. This is a known limitation. I think the trickiest part about this is how the punishment formula should work. Because today this thing makes no assumptions as to how you came up with your suggestion weights... But feel free to open a JIRA issue if you have ideas !
Re: Out of Memory doing a query Solr 4.2
On Fri, Mar 15, 2013 at 6:46 AM, raulgrande83 raulgrand...@hotmail.com wrote: Thank you for your help. I'm afraid it won't be so easy to change de jvm version, because it is required at the moment. It seems that Solr 4.2 supports Java 1.6 at least. Is that correct? Could you find any clue of what is happening in the attached traces? It would be great to know why it is happening now, because it was working for Solr 3.5. Its probably not an OOM at all. instead its more likely IBM JVM is probably miscompiling our code and producing large integers, like it does quite often. For example, we had to disable testing it completely recently for this reason. If someone were to report a JIRA issue that mentioned IBM, I'd make the same comment there but in general not take it seriously at all due to the kind of bugs i've seen from that JVM. The fact that IBM JVM didnt miscompile 3.5's code is irrelevant.
Re: Out of Memory doing a query Solr 4.2
On Thu, Mar 14, 2013 at 12:07 PM, raulgrande83 raulgrand...@hotmail.com wrote: JVM: IBM J9 VM(1.6.0.2.4) I don't recommend using this JVM.
Re: Using suggester for smarter phrase autocomplete
On Wed, Mar 13, 2013 at 11:07 AM, Eric Wilson wilson.eri...@gmail.com wrote: I'm trying to use the suggester for auto-completion with Solr 4. I have followed the example configuration for phrase suggestions at the bottom of this wiki page: http://wiki.apache.org/solr/Suggesterhttps://mail.manta.com/owa/redir.aspx?C=a570b5bb74f64f4fb810ba260e304ec5URL=http%3a%2f%2fwiki.apache.org%2fsolr%2fSuggester This shows how to use a text file with the following text for phrase suggestions: # simple auto-suggest phrase dictionary for testing # note this uses tabs as separator! the first phrase1.0 the second phrase 2.0 testing 12343.0 foo 5.0 the fifth phrase2.0 the final phrase4.0 This seems to be working in the expected way. If I query for the f I receive the following suggestions: strthe final phrase/str strthe fifth phrase/str strthe first phrase/str I would like to deal with the case where the user is interested in the foo. When the fo is entered, there will be no suggestions. Is it possible to provide both the phrase matches, and the matches for individual words, so that when the user entered text is no longer part of any actual phrase, there are still suggestions to be made for the final word? Is it really the case that you want matches for individual words, or just to handle e.g. the stopwords case like 'the fo' - foo ? the latter can be done with analyzingsuggester (configure a stopfilter on the analyzer).
Re: It seems a issue of deal with chinese synonym for solr
I agree. Actually that top-level logic is fine. its the loop that follows thats wrong: it needs to look at position increment and do the right thing. Want to open a JIRA issue? On Mon, Mar 11, 2013 at 9:15 PM, 李威 li...@antvision.cn wrote: in org.apache.solr.parser.SolrQueryParserBase, there is a function: protected Query newFieldQuery(Analyzer analyzer, String field, String queryText, boolean quoted) throws SyntaxError The below code can't process chinese rightly. BooleanClause.Occur occur = positionCount 1 operator == AND_OPERATOR ? BooleanClause.Occur.MUST : BooleanClause.Occur.SHOULD; For example, “北京市 and “北京 are synonym, if I seach 北京市动物园, the expected parse result is +(北京市 北京) +动物园, but actually it would be parsed to +北京市 +北京 +动物园. The code can process English, because English word is seperate by space, and only one position. In order to process Chinese, I think it can charge by position increment, but not by position count. Could you help take a look? Thanks, Wei Li
[ANNOUNCE] Apache Solr 4.2 released
March 2013, Apache Solr™ 4.2 available The Lucene PMC is pleased to announce the release of Apache Solr 4.2 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.2 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html See the CHANGES.txt file included with the release for a full list of details. Solr 4.2 Release Highlights: * A read side REST API for the schema. Always wanted to introspect the schema over http? Now you can. Looks like the write side will be coming next. * DocValues have been integrated into Solr. DocValues can be loaded up a lot faster than the field cache and can also use different compression algorithms as well as in RAM or on Disk representations. Faceting, sorting, and function queries all get to benefit. How about the OS handling faceting and sorting caches off heap? No more tuning 60 gigabyte heaps? How about a snappy new per segment DocValues faceting method? Improved numeric faceting? Sweet. * Collection Aliasing. Got time based data? Want to re-index in a temporary collection and then swap it into production? Done. Stay tuned for Shard Aliasing. * Collection API responses. The collections API was still very new in 4.0, and while it improved a fair bit in 4.1, responses were certainly needed, but missed the cut off. Initially, we made the decision to make the Collection API super fault tolerant, which made responses tougher to do. No one wants to hunt through logs files to see how things turned out. Done in 4.2. * Interact with any collection on any node. Until 4.2, you could only interact with a node in your cluster if it hosted at least one replica of the collection you wanted to query/update. No longer - query any node, whether it has a piece of your intended collection or not and get a proxied response. * Allow custom shard names so that new host addresses can take over for retired shards. Working on Amazon without elastic ips? This is for you. * Lucene 4.2 optimizations such as compressed term vectors. Solr 4.2 also includes many other new features as well as numerous optimizations and bugfixes. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Happy searching, Lucene/Solr developers
Re: MockAnalyzer in Lucene: attach stemmer or any custom filter?
For 3.4, extend ReusableAnalyzerBase On Fri, Feb 15, 2013 at 12:06 PM, Dmitry Kan solrexp...@gmail.com wrote: Thanks a lot, Robert. I need to study a bit more closely the link you have sent. I have tried to override the Analyzer class, but couldn't find a method createComponents(String fieldName,Reader reader) in LUCENE_34. Instead, there is a method required to override: tokenStream(String fieldName, Reader reader). Is there a way of incorporating the custom filter into the TokenStream? Dmitry On Thu, Feb 14, 2013 at 5:37 PM, Robert Muir rcm...@gmail.com wrote: MockAnalyzer is really just MocKTokenizer+MockTokenFilter+ Instead you just define your own analyzer chain using MockTokenizer. This is the way all lucene's own analysis tests work: e.g. http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/common/src/test/org/apache/lucene/analysis/en/TestEnglishMinimalStemFilter.java On Thu, Feb 14, 2013 at 7:40 AM, Dmitry Kan solrexp...@gmail.com wrote: Hello, Asked a question on SO: http://stackoverflow.com/questions/14873207/mockanalyzer-in-lucene-attach-stemmer-or-any-custom-filter Is there a way to configure a stemmer or a custom filter with the MockAnalyzer class? Version: LUCENE_34 Dmitry
Re: MockAnalyzer in Lucene: attach stemmer or any custom filter?
MockAnalyzer is really just MocKTokenizer+MockTokenFilter+ Instead you just define your own analyzer chain using MockTokenizer. This is the way all lucene's own analysis tests work: e.g. http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/common/src/test/org/apache/lucene/analysis/en/TestEnglishMinimalStemFilter.java On Thu, Feb 14, 2013 at 7:40 AM, Dmitry Kan solrexp...@gmail.com wrote: Hello, Asked a question on SO: http://stackoverflow.com/questions/14873207/mockanalyzer-in-lucene-attach-stemmer-or-any-custom-filter Is there a way to configure a stemmer or a custom filter with the MockAnalyzer class? Version: LUCENE_34 Dmitry
Re: Exception when trying to save to a field with storeOffsetsWithPositions=true
On Tue, Jan 22, 2013 at 12:23 PM, Meng Muk meng@uniqueinteractive.com wrote: If I set the field type to text_en however it works, I'm guessing something in the way the text is being analyzed is causing this exception to appear? Is there a limitation in how storeOffsetsWithPositions should be used? IndexWriter will refuse broken offsets up-front if you use this feature: Its strict about this and will throw an exception at index-time if the analyzer is broken. You can see the list of broken analysis components here: https://issues.apache.org/jira/browse/LUCENE-4641 If you really want to use one of these broken analysis components, you can use another highlighter, but it probably just means you wont see these analyzer bugs until search time (InvalidTokenOffsetsExceptions and so on)
[ANNOUNCE] Apache Solr 3.6.2 released
25 December 2012, Apache Solr™ 3.6.2 available The Lucene PMC and Santa Claus are pleased to announce the release of Apache Solr 3.6.2. Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites. This release is a bug fix release for version 3.6.1. It contains numerous bug fixes, optimizations, and improvements, some of which are highlighted below. The release is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-3x-redir.html (see note below). See the CHANGES.txt file included with the release for a full list of details. Solr 3.6.2 Release Highlights: * Fixed ConcurrentModificationException during highlighting, if all fields were requested. * Fixed edismax queryparser to apply minShouldMatch to implicit boolean queries. * Several bugfixes to the DataImportHandler. * Bug fixes from Apache Lucene 3.6.2. Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Happy holidays and happy searching, Lucene/Solr developers
Re: Japanese exact match results do not show on top of results
I think you are hitting solr-3589. There is a vote underway for a 3.6.2 that contains this fix On Dec 20, 2012 6:29 PM, kirpakaro khem...@yahoo.com wrote: Hi folks, I am having couple of problems with Japanese data, 1. it is not properly indexing all the data 2. displaying the exact match result on top and then 90%match and 80%match etc. does not work. I am using solr3.6.1 and using text_ja as the fieldType here is the schema field name=q type=text_ja indexed=true stored=true / field name=qs type=text_general indexed=false stored=true multiValued=true/ field name=q_e type=string indexed=true stored=true / copyField source=q dest=q_e maxChars=250/ what I want to achieve is that if there is an exact query match it should provide the results from q_e followed by results from partial match from q field and if there is nothing in q_e field then partial matches should come from q field. This is how I specify the query http://localhost:7983/zoom/jp/select/?q=鹿児島 鹿児島銀行rows=10version=2.2qf=query+query_exact^1mm=90%25pf=q^1+q_e^10 OR version=2.2rows=10qf=q+q_e^1pf=query^10+query_exact^1 somehow the exact query matches results do not come on top, though the data contains it. It is puzzling that all the documents do not get indexed properly, but if I change the q field to string and q_e to text_ja then all the records are indexed properly, but that still does not solve the problem of exact match on top followed by partial matches. text_ja field uses: filter class=solr.JapaneseBaseFormFilterFactory/ filter class=solr.JapanesePartOfSpeechStopFilterFactory tags=../../../solr/conf/lang/stoptags_ja.txt enablePositionIncrements=true/ filter class=solr.CJKWidthFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=../../../solr/conf/lang/stopwords_ja.txt enablePositionIncrements=true / filter class=solr.JapaneseKatakanaStemFilterFactory minimumLength=4/ filter class=solr.LowerCaseFilterFactory/ How to solve this problem, Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Japanese-exact-match-results-do-not-show-on-top-of-results-tp4028422.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: ICUTokenizer labels number as Han character?
Your attachment didnt come through: I think the list strips them. Maybe just open a JIRA and attach your screenshots? or put them elsewhere and just include a link? As far as the ultimate behavior, I think its correct. Keep in mind tokens don't really get a script value: runs of untokenized text do. common is stuff like numbers/punctuation/etc that just keeps the run whatever it was before (e.g. Han). And the bigram filter only bigrams text with certain token types (NUM is not one of them), so making a singleton is correct. On Wed, Dec 19, 2012 at 5:10 PM, Tom Burton-West tburt...@umich.edu wrote: Hello, Don't know if the Solr admin panel is lying, or if this is a wierd bug. The string: 1986年 gets analyzed by the ICUTokenizer with 1986 being identified as type:NUM and script:Han. Then the CJKBigram filter identifies 1986 as type:Num and script:Han and 年 as type:Single and script: Common. This doesn't seem right. Couldn't fit the whole analysis output on one screen so there are two screenshots attached. Any clues as to what is going on and whether it is a problem? Tom
Re: order question on solr multi value field
I agree with James. Actually lucene tests will fail if a codec violates this. Actually it goes much deeper than this. From the lucene apis, when you call IndexReader.document() with your storedfieldVisitor, it must visit the fields in the original order added. so even if you do: add(title, title value 1); add(body, body value); add(title, title value 2); Currently stored fields must be returned in exactly this order: title1, then body, then title2. This is pretty annoying :) I don't think its truly necessary to maintain that crazy guarantee, but I'm pretty sure something tests it somewhere. In my opinion its too restrictive and prevents useful optimizations. But in my opinion, title1 should always come back before title2 in the order you added them just like today. On Tue, Dec 18, 2012 at 10:54 AM, Dyer, James james.d...@ingramcontent.com wrote: I would say such a guarantee is implied by the javadoc to Analyzer#getPositionIncrementGap . It says this value is an increment to be added to the next token emitted from tokenStream. http://lucene.apache.org/core/4_0_0-ALPHA/core/org/apache/lucene/analysis/Analyzer.html#getPositionIncrementGap%28java.lang.String%29 Also compare unofficial documentation such as Lucene In Action 2nd ed, section 4.7.1: Lucene logically appends the tokens...sequentially. Having multi-valued fields stay in the order in which they were added to the Document is a guarantee that many many users depend on. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Tuesday, December 18, 2012 9:30 AM To: solr-user@lucene.apache.org Subject: Re: order question on solr multi value field If there is no official guarantee in the Javadoc for the code then there is no official guarantee. Period. If somebody wants an official, contractual guarantee, a Jira should be filed to do so. To put it simple, are the values a list or a set? -- Jack Krupansky -Original Message- From: Erik Hatcher Sent: Tuesday, December 18, 2012 9:40 AM To: solr-user@lucene.apache.org Cc: solr-user@lucene.apache.org Subject: Re: order question on solr multi value field I don't know of an official guarantee of maintaining order but it's definitely guaranteed an relied upon to retain order. Many will scream if this changed. Indexed doesn't matter here because what you get back are the stored values no matter if the field is indexed or not. Erik On Dec 18, 2012, at 3:04, hellorsanjeev sanjeev.dhi...@3pillarglobal.com wrote: thank you for quick response :) I also have the same observation and I too believe that there is no reason for Solr to reorder a multi value field. But would you stay firm on your conclusion if I say that my multi value field was indexed? Please note - as per my one year experience with Solr, it always returned the values in the insertions order irrespective of the fact that field was indexed or not. My main concern is because I couldn't find it documented anywhere, it might happen that in Solr 4.0 or later, they start reordering them. If they do then there will be a big problem for us :) -- View this message in context: http://lucene.472066.n3.nabble.com/order-question-on-solr-multi-value-field-tp4027695p4027713.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Regexp and speed
On Fri, Nov 30, 2012 at 12:13 PM, Roman Chyla roman.ch...@gmail.com wrote: The code here: https://github.com/romanchyla/montysolr/blob/solr-trunk/contrib/adsabs/src/test/org/adsabs/lucene/BenchmarkAuthorSearch.java The benchmark should probably not be called 'benchmark', do you think it may be too simplistic? Can we expect some bad surprises somewhere? I think maybe a few surprises, since it extends LuceneTestCase and uses RandomIndexWriter, newSearcher and so on, the benchmark results can be confusing. This stuff is fantastic to use for tests but for benchmarks may cause confusion. For example you might run it and it gets SimpleText codec, maybe wraps the indexsearcher with slow things like ParallelReader, and maybe you get horrific merge parameters and so on.
Re: Skewed IDF in multi lingual index
Hi again Markus. Sorry for the slow reply here. I'm confused: are you saying the score goes negative? Are you sure there is no 3.x segments? Can you check that docCount is not -1? Do you happen to have a test, can you share your modified similarity, or give more details? I just want to make sure there isn't a bug in lucene here (we verify this statistic currently in checkindex and other places, but there is always the possibility) On Mon, Nov 12, 2012 at 7:39 AM, Markus Jelsma markus.jel...@openindex.iowrote: I'd like to add that multiplicative boosting on very scarce properties, e.g. you want to boost on a boolean value of which there are only very few, causes a problem in scoring when using docCount instead of maxDoc. If docCount is one IDF will be ~0.3, with the fieldWeight you'll end up with a score below 0. Because of this the product of all multiplicative boosts will be lower than the product of boosts similar boosts, lowering the document in rank instead of boosting it. -Original message- From:Markus Jelsma markus.jel...@openindex.io Sent: Fri 09-Nov-2012 10:23 To: solr-user@lucene.apache.org Subject: RE: Skewed IDF in multi lingual index Robert, Tom, That's it indeed! Using maxDoc as numerator opposed to docCount yields very skewed results for an unevenly distributed multi-lingual index. We have one language dominating the other twenty so the dominating language contains no rare terms compared to the others. We're now checking results using docCount and it seems alright. I do have to get used to the fact that document scores are now roughly 1000 times higher than before but i'm already very happy with CollectionStatistics and will see if all works well. Any other tips to share? Thanks, Markus -Original message- From:Robert Muir rcm...@gmail.com Sent: Thu 08-Nov-2012 17:44 To: solr-user@lucene.apache.org Subject: Re: Skewed IDF in multi lingual index Hi Markus: how are the languages distributed across documents? Imagine I have a text_en field and a text_fr field. Lets say I have 100 documents, 95 are english and only 5 are french. So the text_en field is populated 95% of the time, and the text_fr 5% of the time. But the default IDF computation doesnt look at things this way: it always uses '100' as maxDoc. So in such a situation, any terms against text_fr are rare :) The first thing i would look at, is treating this situation as merging results from a english index with 95 docs and a french index with 5 docs. So I would consider overriding the two idfExplain methods (term and phrase) to use CollectionStatistics.docCount() instead of CollectionStatistics.maxDoc() The former would be 95 for the english field (instead of 100), and 5 for the french field (instead of 100). I dont think this will solve all your problems: but it might help. Note: you must ensure your index is fully upgraded to 4.0 to try this statistic, otherwise it will return -1 if you have any 3.x segments in your index. On Thu, Nov 8, 2012 at 11:13 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, We're testing a large multi lingual index with _LANG fields for each language and using dismax to query them all. Users provide, explicit or implicit, language preferences that we use for either additive or multiplicative boosting on the language of the document. However, additive boosting is not adequate because it cannot overcome the extremely high IDF values for the same word in another language so regardless of the the preference, foreign documents are returned. Multiplicative boosting solves this problem but has the other downside as it doesn't allow us with standard qf=field^boost to prefer documents in another language above the preferred language because the multiplicative is so strong. We do use the def function (boost=def(query($qq),.3)) to prevent one boost query to return 0 and thus a product of 0 for all boost queries. But it doesn't help that much This all comes down to IDF differences between the languages, even common words such as country names like `india` show large differences in IDF. Is here anyone with some hints or experiences to share about skewed IDF in such an index? Thanks, Markus
Re: Error loading class solr.CJKBigramFilterFactory
On Wed, Nov 14, 2012 at 8:12 AM, Frederico Azeiteiro frederico.azeite...@cision.com wrote: Fo make some further testing I installed SOLR 3.5.0 using default Jetty server. When tried to start SOLR using the same schema I get: SEVERE: org.apache.solr.common.SolrException: Error loading class 'solr.CJKBigramFilterFactory' This filter was added in 3.6, so its expected that it wouldnt be found.
Re: Error loading class solr.CJKBigramFilterFactory
I'm sure. I added it to 3.6 ;) You must have something funky with your tomcat configuration, like an exploded war with different versions of jars or some other form of jar hell. On Wed, Nov 14, 2012 at 9:32 AM, Frederico Azeiteiro frederico.azeite...@cision.com wrote: Are you sure about that? We have it working on: Solr Specification Version: 3.5.0.2011.11.22.14.54.38 Solr Implementation Version: 3.5.0 1204988 - simon - 2011-11-22 14:54:38 Lucene Specification Version: 3.5.0 Lucene Implementation Version: 3.5.0 1204988 - simon - 2011-11-22 14:46:51 Current Time: Wed Nov 14 17:30:07 WET 2012 Server Start Time:Wed Nov 14 11:40:36 WET 2012 ?? Thanks, Frederico -Mensagem original- De: Robert Muir [mailto:rcm...@gmail.com] Enviada: quarta-feira, 14 de Novembro de 2012 16:28 Para: solr-user@lucene.apache.org Assunto: Re: Error loading class solr.CJKBigramFilterFactory On Wed, Nov 14, 2012 at 8:12 AM, Frederico Azeiteiro frederico.azeite...@cision.com wrote: Fo make some further testing I installed SOLR 3.5.0 using default Jetty server. When tried to start SOLR using the same schema I get: SEVERE: org.apache.solr.common.SolrException: Error loading class 'solr.CJKBigramFilterFactory' This filter was added in 3.6, so its expected that it wouldnt be found.
Re: Does ICUFoldingFilterFactory make CJKWidthFilterFactory unnecessary?
Yes, its a subset On Nov 14, 2012 1:18 PM, Shawn Heisey s...@elyograg.org wrote: I am using ICUFoldingFilterFactory in my Solr schema. Now I am looking at adding CJKBigramFilterFactory, and I've noticed that it often goes with CJKWidthFilterFactory. Here are the relevant Javadocs for my question: http://lucene.apache.org/core/**4_0_0/analyzers-common/org/** apache/lucene/analysis/cjk/**CJKWidthFilter.htmlhttp://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html http://lucene.apache.org/core/**4_0_0/analyzers-icu/org/** apache/lucene/analysis/icu/**ICUFoldingFilter.htmlhttp://lucene.apache.org/core/4_0_0/analyzers-icu/org/apache/lucene/analysis/icu/ICUFoldingFilter.html The descriptions of these two classes suggest that if I already have ICUFoldingFilter, I do not need CJKWidthFilter. Do I have that right or wrong? Thanks, Shawn
Re: URL parameters to use FieldAnalysisRequestHandler
I think the UI uses this behind the scenes, as in no more analysis.jsp like before? So maybe try using something like burpsuite and just using the analysis UI in your browser to see what requests its sending. On Tue, Nov 13, 2012 at 11:00 AM, Tom Burton-West tburt...@umich.edu wrote: Hello, I would like to send a request to the FieldAnalysisRequestHandler. The javadoc lists the parameter names such as analysis.field, but sending those as URL parameters does not seem to work: mysolr.umich.edu/analysis/field?analysis.name=titleq=fire-fly leaving out the analysis doesn't work either: mysolr.umich.edu/analysis/field?name=titleq=fire-fly No matter what field I specify, the analysis returned is for the default field. (See repsonse excerpt below) Is there a page somewhere that shows the correct syntax for sending get requests to the FieldAnalysisRequestHandler? Tom lst name=analysis lst name=field_types/ lst name=field_names lst name=ocr
Re: customize solr search/scoring for performance
Whenever I look at solr users' stacktraces for disjunctions, I always notice they get BooleanScorer2. Is there some reason for this or is it not intentional (e.g. maybe a in-order collector is always being used when its possible at least in simple cases to allow for out-of-order hits?) When I examine test contributions from clover reports (e.g. https://builds.apache.org/job/Lucene-Solr-Clover-4.x/49/clover-report/), I notice that only lucene tests, and solr spellchecking tests actually hit BooleanScorer's collect. All other solr tests hit BooleanScorer2. If its possible to allow for an out of order collector in some common cases (e.g. large disjunctions w/ minShouldMatch generated by solr queryparsers), it could be a nice performance improvement. On Mon, Nov 12, 2012 at 3:48 PM, jchen2000 jchen...@yahoo.com wrote: The following was generated from jvisualvm. Seems like the perf is related to scoring a lot. Any idea/pointer on how to customize that part? http://lucene.472066.n3.nabble.com/file/n4019850/profilingResult.png -- View this message in context: http://lucene.472066.n3.nabble.com/customize-solr-search-scoring-for-performance-tp4019444p4019850.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud, Zookeeper and Stopwords with Umlaute or other special characters
On Wed, Nov 7, 2012 at 11:45 AM, Daniel Brügge daniel.brue...@googlemail.com wrote: Hi, i am running a SolrCloud cluster with the 4.0.0 version. I have a stopwords file which is in the correct encoding. What makes you think that? Note: Because I can read it is not the correct answer. Ensure any of your stopwords files etc are in UTF-8. This is often different from the encoding your computer uses by default if you open a file, start typing in it, and press save.
Re: Skewed IDF in multi lingual index
Hi Markus: how are the languages distributed across documents? Imagine I have a text_en field and a text_fr field. Lets say I have 100 documents, 95 are english and only 5 are french. So the text_en field is populated 95% of the time, and the text_fr 5% of the time. But the default IDF computation doesnt look at things this way: it always uses '100' as maxDoc. So in such a situation, any terms against text_fr are rare :) The first thing i would look at, is treating this situation as merging results from a english index with 95 docs and a french index with 5 docs. So I would consider overriding the two idfExplain methods (term and phrase) to use CollectionStatistics.docCount() instead of CollectionStatistics.maxDoc() The former would be 95 for the english field (instead of 100), and 5 for the french field (instead of 100). I dont think this will solve all your problems: but it might help. Note: you must ensure your index is fully upgraded to 4.0 to try this statistic, otherwise it will return -1 if you have any 3.x segments in your index. On Thu, Nov 8, 2012 at 11:13 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, We're testing a large multi lingual index with _LANG fields for each language and using dismax to query them all. Users provide, explicit or implicit, language preferences that we use for either additive or multiplicative boosting on the language of the document. However, additive boosting is not adequate because it cannot overcome the extremely high IDF values for the same word in another language so regardless of the the preference, foreign documents are returned. Multiplicative boosting solves this problem but has the other downside as it doesn't allow us with standard qf=field^boost to prefer documents in another language above the preferred language because the multiplicative is so strong. We do use the def function (boost=def(query($qq),.3)) to prevent one boost query to return 0 and thus a product of 0 for all boost queries. But it doesn't help that much This all comes down to IDF differences between the languages, even common words such as country names like `india` show large differences in IDF. Is here anyone with some hints or experiences to share about skewed IDF in such an index? Thanks, Markus
Re: Where can I find an example of a 4.0 contraction file?
You have a character encoding issue: this is telling you the file is not correctly encoded as UTF-8. On Thu, Nov 1, 2012 at 6:11 PM, dm_tim dm_...@yahoo.com wrote: I should have mentioned I tried that. I get the following exception: SEVERE: Unable to create core: core0 java.lang.RuntimeException: java.nio.charset.MalformedInputException: Input length = 1 Any other suggestions? Regards, Tim -- View this message in context: http://lucene.472066.n3.nabble.com/Where-can-I-find-an-example-of-a-4-0-contraction-file-tp4017699p4017705.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Unable to build trunk
you will have to use 'find' on your .ivy2 ! On Wed, Oct 31, 2012 at 6:32 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, Where is that lock file located? I triggered it again (in another contrib) and wil trigger it again in the future and don't want to remove my ivy cache each time :) Thanks -Original message- From:Robert Muir rcm...@gmail.com Sent: Tue 30-Oct-2012 15:14 To: solr-user@lucene.apache.org Subject: Re: Unable to build trunk Its not wonky. you just have to ensure you have nothing else (like some IDE, or build somewhere else) using ivy, then its safe to remove the .lck file there. I turned on this locking so that it hangs instead of causing cache corruption, but ivy only has simplelockfactory so if you ^C at the wrong time, it might leave a .lck file. On Tue, Oct 30, 2012 at 9:27 AM, Erick Erickson erickerick...@gmail.com wrote: Not sure if it's relevant, but sometimes the ivy caches are wonky. Try deleting (on OS X) ~/.ivy2 recursively and building again? Of course your next build will download a bunch of jars... FWIW, Erick On Tue, Oct 30, 2012 at 5:38 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, Since yesterday we're unable to build trunk and also a clean check out from trunk. We can compile the sources but not the example or dist. It hangs on resolve and after a while prints the following: resolve: [ivy:retrieve] [ivy:retrieve] :: problems summary :: [ivy:retrieve] WARNINGS [ivy:retrieve] module not found: com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4 [ivy:retrieve] local: tried [ivy:retrieve] /home/markus/.ivy2/local/com.carrotsearch.randomizedtesting/randomizedtesting-runner/2.0.4/ivys/ivy.xml [ivy:retrieve]-- artifact com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4!randomizedtesting-runner.jar: [ivy:retrieve] /home/markus/.ivy2/local/com.carrotsearch.randomizedtesting/randomizedtesting-runner/2.0.4/jars/randomizedtesting-runner.jar [ivy:retrieve] shared: tried [ivy:retrieve] /home/markus/.ivy2/shared/com.carrotsearch.randomizedtesting/randomizedtesting-runner/2.0.4/ivys/ivy.xml [ivy:retrieve]-- artifact com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4!randomizedtesting-runner.jar: [ivy:retrieve] /home/markus/.ivy2/shared/com.carrotsearch.randomizedtesting/randomizedtesting-runner/2.0.4/jars/randomizedtesting-runner.jar [ivy:retrieve] public: tried [ivy:retrieve] http://repo1.maven.org/maven2/com/carrotsearch/randomizedtesting/randomizedtesting-runner/2.0.4/randomizedtesting-runner-2.0.4.pom [ivy:retrieve] sonatype-releases: tried [ivy:retrieve] http://oss.sonatype.org/content/repositories/releases/com/carrotsearch/randomizedtesting/randomizedtesting-runner/2.0.4/randomizedtesting-runner-2.0.4.pom [ivy:retrieve] working-chinese-mirror: tried [ivy:retrieve] http://mirror.netcologne.de/maven2/com/carrotsearch/randomizedtesting/randomizedtesting-runner/2.0.4/randomizedtesting-runner-2.0.4.pom [ivy:retrieve]-- artifact com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4!randomizedtesting-runner.jar: [ivy:retrieve] http://mirror.netcologne.de/maven2/com/carrotsearch/randomizedtesting/randomizedtesting-runner/2.0.4/randomizedtesting-runner-2.0.4.jar [ivy:retrieve] :: [ivy:retrieve] :: UNRESOLVED DEPENDENCIES :: [ivy:retrieve] :: [ivy:retrieve] :: com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4: not found [ivy:retrieve] :: [ivy:retrieve] ERRORS [ivy:retrieve] impossible to acquire lock for com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4 [ivy:retrieve] impossible to acquire lock for com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4 [ivy:retrieve] impossible to acquire lock for com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4 [ivy:retrieve] impossible to acquire lock for com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4 [ivy:retrieve] impossible to acquire lock for com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4 [ivy:retrieve] impossible to acquire lock for com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4 [ivy:retrieve] impossible to acquire lock for com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4 [ivy:retrieve] impossible to acquire lock for com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4 [ivy:retrieve] impossible to acquire lock for com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4 [ivy:retrieve] [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE
Re: Unable to build trunk
Its not wonky. you just have to ensure you have nothing else (like some IDE, or build somewhere else) using ivy, then its safe to remove the .lck file there. I turned on this locking so that it hangs instead of causing cache corruption, but ivy only has simplelockfactory so if you ^C at the wrong time, it might leave a .lck file. On Tue, Oct 30, 2012 at 9:27 AM, Erick Erickson erickerick...@gmail.com wrote: Not sure if it's relevant, but sometimes the ivy caches are wonky. Try deleting (on OS X) ~/.ivy2 recursively and building again? Of course your next build will download a bunch of jars... FWIW, Erick On Tue, Oct 30, 2012 at 5:38 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, Since yesterday we're unable to build trunk and also a clean check out from trunk. We can compile the sources but not the example or dist. It hangs on resolve and after a while prints the following: resolve: [ivy:retrieve] [ivy:retrieve] :: problems summary :: [ivy:retrieve] WARNINGS [ivy:retrieve] module not found: com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4 [ivy:retrieve] local: tried [ivy:retrieve] /home/markus/.ivy2/local/com.carrotsearch.randomizedtesting/randomizedtesting-runner/2.0.4/ivys/ivy.xml [ivy:retrieve]-- artifact com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4!randomizedtesting-runner.jar: [ivy:retrieve] /home/markus/.ivy2/local/com.carrotsearch.randomizedtesting/randomizedtesting-runner/2.0.4/jars/randomizedtesting-runner.jar [ivy:retrieve] shared: tried [ivy:retrieve] /home/markus/.ivy2/shared/com.carrotsearch.randomizedtesting/randomizedtesting-runner/2.0.4/ivys/ivy.xml [ivy:retrieve]-- artifact com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4!randomizedtesting-runner.jar: [ivy:retrieve] /home/markus/.ivy2/shared/com.carrotsearch.randomizedtesting/randomizedtesting-runner/2.0.4/jars/randomizedtesting-runner.jar [ivy:retrieve] public: tried [ivy:retrieve] http://repo1.maven.org/maven2/com/carrotsearch/randomizedtesting/randomizedtesting-runner/2.0.4/randomizedtesting-runner-2.0.4.pom [ivy:retrieve] sonatype-releases: tried [ivy:retrieve] http://oss.sonatype.org/content/repositories/releases/com/carrotsearch/randomizedtesting/randomizedtesting-runner/2.0.4/randomizedtesting-runner-2.0.4.pom [ivy:retrieve] working-chinese-mirror: tried [ivy:retrieve] http://mirror.netcologne.de/maven2/com/carrotsearch/randomizedtesting/randomizedtesting-runner/2.0.4/randomizedtesting-runner-2.0.4.pom [ivy:retrieve]-- artifact com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4!randomizedtesting-runner.jar: [ivy:retrieve] http://mirror.netcologne.de/maven2/com/carrotsearch/randomizedtesting/randomizedtesting-runner/2.0.4/randomizedtesting-runner-2.0.4.jar [ivy:retrieve] :: [ivy:retrieve] :: UNRESOLVED DEPENDENCIES :: [ivy:retrieve] :: [ivy:retrieve] :: com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4: not found [ivy:retrieve] :: [ivy:retrieve] ERRORS [ivy:retrieve] impossible to acquire lock for com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4 [ivy:retrieve] impossible to acquire lock for com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4 [ivy:retrieve] impossible to acquire lock for com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4 [ivy:retrieve] impossible to acquire lock for com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4 [ivy:retrieve] impossible to acquire lock for com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4 [ivy:retrieve] impossible to acquire lock for com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4 [ivy:retrieve] impossible to acquire lock for com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4 [ivy:retrieve] impossible to acquire lock for com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4 [ivy:retrieve] impossible to acquire lock for com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4 [ivy:retrieve] [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS BUILD FAILED /home/markus/src/solr/trunk/solr/build.xml:336: The following error occurred while executing this line: /home/markus/src/solr/trunk/solr/common-build.xml:345: The following error occurred while executing this line: /home/markus/src/solr/trunk/solr/common-build.xml:388: The following error occurred while executing this line: /home/markus/src/solr/trunk/lucene/common-build.xml:316: impossible to resolve dependencies: resolve failed - see output for details Total time: 18 minutes 19 seconds As you can
Re: Improving performance for use-case where large (200) number of phrase queries are used?
On Wed, Oct 24, 2012 at 11:09 AM, Aaron Daubman daub...@gmail.com wrote: Greetings, We have a solr instance in use that gets some perhaps atypical queries and suffers from poor (2 second) QTimes. Documents (~2,350,000) in this instance are mainly comprised of various descriptive fields, such as multi-word (phrase) tags - an average document contains 200-400 phrases like this across several different multi-valued field types. A custom QueryComponent has been built that functions somewhat like a very specific MoreLikeThis. A seed document is specified via the incoming query, its terms are retrieved, boosted both by query parameters as well as fields within the document that specify term weighting, sorted by this custom boosting, and then a second query is crafted by taking the top 200 (sorted by the custom boosting) resulting field values paired with their fields and searching for documents matching these 200 values. a few more ideas: * use shingles e.g. to turn two-word phrases into single terms (how long is your average phrase?). * in addition to the above, maybe for phrases with 2 terms, consider just a boolean conjunction of the shingled phrases instead of a real phrase query: e.g. more like this - (more_like AND like_this). This would have some false positives. * use a more aggressive stopwords list for your MorePhrasesLikeThis. * reduce this number 200, and instead work harder to prune out which phrases are the most descriptive from the seed document, e.g. based on some heuristics like their frequency or location within that seed document, so your query isnt so massive.
Re: ICUTokenizer ArrayIndexOutOfBounds
calling reset() is mandatory part of the consumer lifecycle before calling incrementToken(), see: https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html A lot of people don't consume these correctly, thats why these tokenizers now try to throw exceptions if you do it wrong, rather than wrong results otherwise. If you really want to test that your consumer code (queryparser, whatever) is doing this correctly, test your code with MockTokenizer/MockAnalyzer in the test-framework package. This has a little state machine with a lot more checks. On Wed, Oct 17, 2012 at 6:56 AM, Shane Perry thry...@gmail.com wrote: Hi, I've been playing around with using the ICUTokenizer from 4.0.0. Using the code below, I was receiving an ArrayIndexOutOfBounds exception on the call to tokenizer.incrementToken(). Looking at the ICUTokenizer source, I can see why this is occuring (usableLength defaults to -1). ICUTokenizer tokenizer = new ICUTokenizer(myReader); CharTermAttribute termAtt = tokenizer.getAttribute(CharTermAttribute.class); while(tokenizer.incrementToken()) { System.out.println(termAtt.toString()); } After poking around a little more, I found that I can just call tokenizer.reset() (initializes usableLength to 0) right after constructing the object (org.apache.lucene.analysis.icu.segmentation.TestICUTokenizer does a similar step in it's super class). I was wondering if someone could explain why I need to call tokenizer.reset() prior to using the tokenizer for the first time. Thanks in advance, Shane
[ANNOUNCE] Apache Solr 4.0 released.
October 12 2012, Apache Solr™ 4.0 available. The Lucene PMC is pleased to announce the release of Apache Solr 4.0. Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the searchand navigation features of many of the world's largest internet sites. Solr 4.0 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html See the CHANGES.txt file included with the release for a full list of details. Solr 4.0 Release Highlights: The largest set of features goes by the development code-name SolrCloud and involves bringing easy scalability to Solr. See http://wiki.apache.org/solr/SolrCloud for more details. * Distributed indexing designed from the ground up for near real-time (NRT) and NoSQL features such as realtime-get, optimistic locking, and durable updates. * High availability with no single points of failure. * Apache Zookeeper integration for distributed coordination and cluster metadata and configuration storage. * Immunity to split-brain issues due to Zookeeper's Paxos distributed consensus protocols. * Updates sent to any node in the cluster and are automatically forwarded to the correct shard and replicated to multiple nodes for redundancy. * Queries sent to any node automatically perform a full distributed search across the cluster with load balancing and fail-over. * A collection management API. * Smart SolrJ client (CloudSolrServer) that knows to send documents only to the shard leaders Solr 4.0 includes more NoSQL features for those using Solr as a primary data store: * Update durability – A transaction log ensures that even uncommitted documents are never lost. * Real-time Get – The ability to quickly retrieve the latest version of a document, without the need to commit or open a new searcher * Versioning and Optimistic Locking – combined with real-time get, this allows read-update-write functionality that ensures no conflicting changes were made concurrently by other clients. * Atomic updates - the ability to add, remove, change, and increment fields of an existing document without having to send in the complete document again. Many additional improvements include: * New spatial field types with polygon support. * Pivot Faceting – Multi-level or hierarchical faceting where the top constraints for one field are found for each top constraint of a different field. * Pseudo-fields – The ability to alias fields, or to add metadata along with returned documents, such as function query values and results of spatial distance calculations. * A spell checker implementation that can work directly from the main index instead of creating a sidecar index. * Pseudo-Join functionality – The ability to select a set of documents based on their relationship to a second set of documents. * Function query enhancements including conditional function queries and relevancy functions. * New update processors to facilitate modifying documents prior to indexing. * A brand new web admin interface, including support for SolrCloud and improved error reporting * Numerous bug fixes and optimizations. Noteworthy changes since 4.0-BETA: * New spatial field types with polygon support. * Various Admin UI improvements. * SolrCloud related performance optimizations in writing the transaction log, PeerSync recovery, Leader election, and ClusterState caching. * Numerous bug fixes and optimizations. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Happy searching, Apache Lucene/Solr Developers
Re: Using additional dictionary with DirectSolrSpellChecker
On Wed, Oct 10, 2012 at 9:02 AM, O. Klein kl...@octoweb.nl wrote: I don't want to tweak the threshold. For majority of cases it works fine. It's for cases where term has low frequency but is spelled correctly. If you lower the threshold you would also get incorrect spelled terms as suggestions. Yeah there is no real magic here when the corpus contains typos. this existing docFreq heuristic was just borrowed from the old index-based spellchecker. I do wonder if using # of occurrences (totalTermFreq) instead of # of documents with the term (docFreq) would improve the heuristic. In all cases I think if you want to also integrate a dictionary or something, it seems like this could somehow be done with the File-based spellchecker?
Re: Indexing in Solr: invalid UTF-8
On Tue, Sep 25, 2012 at 2:02 PM, Patrick Oliver Glauner patrick.oliver.glau...@cern.ch wrote: Hi Thanks. But I see that 0xd835 is missing in this list (see my exceptions). What's the best way to get rid of all of them in Python? I am new to unicode in Python but I am sure that this use case is quite frequent. I don't really know python either: so I could be wrong here but are you just taking these binary .PDF and .DOC files and treating them as UTF-8 text and sending them to Solr? If so, I don't think that will work very well. Maybe instead try parsing these binary files with something like Tika to get at the actual content and send that? (it seems some people have developed python integration for this, e.g. http://redmine.djity.net/projects/pythontika/wiki)
Re: SOLR memory usage jump in JVM
On Thu, Sep 20, 2012 at 3:09 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: By the way while looking for upgrading to JDK7, the release notes say under section known issues about the PorterStemmer bug: ...The recommended workaround is to specify -XX:-UseLoopPredicate on the command line. Is this still not fixed, or won't fix? How in the world can we fix it? Oracle released a broken java version: there's nothing we can do about that. Go take it up with them. -- lucidworks.com
Re: Solr - Lucene Debuging help
On Mon, Sep 10, 2012 at 4:43 PM, BadalChhatbar badal...@yahoo.com wrote: Steve, Those document tips didn't help. errors i m getting are like (_TestUtil cannot be resolved). Did you do these two steps: 1. ant eclipse 2. refresh your project -- lucidworks.com
Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor
On Fri, Sep 7, 2012 at 2:19 PM, Tom Burton-West tburt...@umich.edu wrote: Thanks Robert, I'll have to spend some time understanding the default codec for Solr 4.0. Did I miss something in the changes file? http://lucene.apache.org/core/4_0_0-BETA/ see the file formats section, especially http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Termdictionary (since blocktree covers term dictionary and terms index) I'll be digging into the default codec docs and testing sometime in next week or two (with a 2 billion term index) If I understand it well enough, I'll be happy to draft some changes up for either the wiki or Solr the example solrconfig.xml file. right i think we should remove these parameters. Does this mean that the default codec will reduce memory use for the terms index enough so I don't need to use either of these settings to deal with my 2 billion term indexes? probably. i dont know enough about your terms or how much RAM you have to say for sure. if not, just customize blocktree's params with a CodecFactory in solr, or even pick another implementation (FixedGap, VariableGap, whatever). the interval/divisor stuff is mostly only useful if you are not reindexing from scratch: e.g. if you are gonna plop your 3.x index into 4.x then you should set those to whatever you were using before, since it will be using PreflexCodec to read those. -- lucidworks.com
[ANNOUNCE] Apache Solr 4.0-beta released.
14 August 2012, Apache Solr™ 4.0-beta available The Lucene PMC is pleased to announce the release of Apache Solr 4.0-beta. Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.0-beta is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html?ver=4.0b See the CHANGES.txt file included with the release for a full list of details. Highlights of changes since 4.0-alpha: * Added a Collection management API for Solr Cloud. * Solr Admin UI now clearly displays failures related to initializing SolrCores * Updatable documents can create a document if it doesn't already exist, or you can force that the document must already exist. * Full delete-by-query support for Solr Cloud. * Default to NRTCachingDirectory for improved near-realtime performance. * Improved Solrj client performance with Solr Cloud: updates are only sent to leaders by default. * Various other API changes, optimizations and bug fixes. This is a beta for early adopters. The guarantee for this beta release is that the index format will be the 4.0 index format, supported through the 5.x series of Lucene/Solr, unless there is a critical bug (e.g. that would cause index corruption) that would prevent this. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Happy searching, Lucene/Solr developers
Re: how to retrieve total token count per collection/index
On Thu, Aug 9, 2012 at 10:20 AM, tech.vronk t...@vronk.net wrote: Hello, I wonder how to figure out the total token count in a collection (per index), i.e. the size of a corpus/collection measured in tokens. You want to use this statistic, which tells you number of tokens for an indexed field: http://lucene.apache.org/core/4_0_0-ALPHA/core/org/apache/lucene/index/Terms.html#getSumTotalTermFreq%28%29 -- lucidimagination.com
Re: how to retrieve total token count per collection/index
On Thu, Aug 9, 2012 at 4:24 PM, tech.vronk t...@vronk.net wrote: Is there any 3.6 equivalent for this, before I install and run 4.0? I can't seem to find a corresponding class (org.apache.lucene.index.Terms) in 3.6. unfortunately 3.6 does not carry this statistic, there is really no clear delineation of 'field' in 3.x, all the terms are just a big sorted list of field+term. so there are no field-level statistics at all! these are new in 4.0 -- lucidimagination.com
Re: Highlighting error InvalidTokenOffsetsException: Token oedipus exceeds length of provided text sized 11
On Fri, Aug 3, 2012 at 12:38 AM, Justin Engelman jus...@smalldemons.com wrote: I have an autocomplete index that I return highlighting information for but am getting an error with certain search strings and fields on Solr 3.5. try the 3.6 release: * LUCENE-3642, SOLR-2891, LUCENE-3717: Fixed bugs in CharTokenizer, n-gram tokenizers/filters, compound token filters, thai word filter, icutokenizer, pattern analyzer, wikipediatokenizer, and smart chinese where they would create invalid offsets in some situations, leading to problems in highlighting. -- lucidimagination.com
Re: Using Solr-319 with Solr 3.6.0
On Fri, Aug 3, 2012 at 12:57 PM, Himanshu Jindal himanshujin...@gmail.com wrote: filter class=solr.SynonymFilterFactory synonyms=synonyms_ja.txt ignoreCase=true expand=true tokenFactory=solr.JapaneseTokenizerFactory randomAttribute=randomValue/ I think you have a typo here, it should be tokenizerFactory, not tokenFactory -- lucidimagination.com
Re: Memory leak?? with CloseableThreadLocal with use of Snowball Filter
On Thu, Aug 2, 2012 at 3:13 AM, Laurent Vaills laurent.vai...@gmail.com wrote: Hi everyone, Is there any chance to get his backported for a 3.6.2 ? Hello, I personally have no problem with it: but its really technically not a bugfix, just an optimization. It also doesnt solve the actual problem if you have a tomcat threadpool configuration recycling threads too fast. There will be other performance problems. -- lucidimagination.com
Re: Memory leak?? with CloseableThreadLocal with use of Snowball Filter
On Tue, Jul 31, 2012 at 2:34 PM, roz dev rozde...@gmail.com wrote: Hi All I am using Solr 4 from trunk and using it with Tomcat 6. I am noticing that when we are indexing lots of data with 16 concurrent threads, Heap grows continuously. It remains high and ultimately most of the stuff ends up being moved to Old Gen. Eventually, Old Gen also fills up and we start getting into excessive GC problem. Hi: I don't claim to know anything about how tomcat manages threads, but really you shouldnt have all these objects. In general snowball stemmers should be reused per-thread-per-field. But if you have a lot of fields*threads, especially if there really is high thread churn on tomcat, then this could be bad with snowball: see eks dev's comment on https://issues.apache.org/jira/browse/LUCENE-3841 I think it would be useful to see if you can tune tomcat's threadpool as he describes. separately: Snowball stemmers are currently really ram-expensive for stupid reasons. each one creates a ton of Among objects, e.g. an EnglishStemmer today is about 8KB. I'll regenerate these and open a JIRA issue: as the snowball code generator in their svn was improved recently and each one now takes about 64 bytes instead (the Among's are static and reused). Still this wont really solve your problem, because the analysis chain could have other heavy parts in initialization, but it seems good to fix. As a workaround until then you can also just use the good old PorterStemmer (PorterStemFilterFactory in solr). Its not exactly the same as using Snowball(English) but its pretty close and also much faster. -- lucidimagination.com
Re: ICUCollation throws exception
) Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] analyzer/filter: class org.apache.solr.schema.ICUCollationField at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:168) at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:356) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:142) ... 34 more Caused by: java.lang.ClassCastException: class org.apache.solr.schema.ICUCollationField at java.lang.Class.asSubclass(Class.java:3018) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:409) at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:430) at org.apache.solr.util.plugin.AbstractPluginLoader.create(AbstractPluginLoader.java:86) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:142) ... 38 more Jul 16, 2012 5:27:48 PM org.apache.solr.core.CoreContainer create INFO: Creating SolrCore 'viaf' using instanceDir: /usr/local/swissbib/solr.versions/configs/current.home/viaf Jul 16, 2012 5:27:48 PM org.apache.solr.core.SolrResourceLoader init **end of Exception*** 2012/7/21 Robert Muir rcm...@gmail.com Can you include the entire exception? This is really necessary! On Tue, Jul 17, 2012 at 2:58 AM, Oliver Schihin oliver.schi...@unibas.ch wrote: Hello According to release notes from 4.0.0-ALPHA, SOLR-2396, I replaced ICUCollationKeyFilterFactory with ICUCollationField in our schema. But this throws an exception, see the following excerpt from the log: Jul 16, 2012 5:27:48 PM org.apache.solr.common.SolrException log SEVERE: null:org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType alphaOnlySort: Pl ugin init failure for [schema.xml] analyzer/filter: class org.apache.solr.schema.ICUCollationField at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:168) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:359) The deprecated filter of ICUCollationKeyFilterFactory is working without any problem. This is how I did the schema (with the deprecated filter): !-- field type for sort strings -- fieldType name=alphaOnlySort class=solr.TextField sortMissingLast=true omitNorms=true analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.ICUCollationKeyFilterFactory locale=de@collation=phonebook strength=primary / /analyzer /fieldType Do I have to replace jars in /contrib/analysis-extras/, or any other hints of what might be wrong in my install and configuration? Thanks a lot Oliver -- lucidimagination.com -- lucidimagination.com
Re: ICUCollation throws exception
Can you include the entire exception? This is really necessary! On Tue, Jul 17, 2012 at 2:58 AM, Oliver Schihin oliver.schi...@unibas.ch wrote: Hello According to release notes from 4.0.0-ALPHA, SOLR-2396, I replaced ICUCollationKeyFilterFactory with ICUCollationField in our schema. But this throws an exception, see the following excerpt from the log: Jul 16, 2012 5:27:48 PM org.apache.solr.common.SolrException log SEVERE: null:org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType alphaOnlySort: Pl ugin init failure for [schema.xml] analyzer/filter: class org.apache.solr.schema.ICUCollationField at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:168) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:359) The deprecated filter of ICUCollationKeyFilterFactory is working without any problem. This is how I did the schema (with the deprecated filter): !-- field type for sort strings -- fieldType name=alphaOnlySort class=solr.TextField sortMissingLast=true omitNorms=true analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.ICUCollationKeyFilterFactory locale=de@collation=phonebook strength=primary / /analyzer /fieldType Do I have to replace jars in /contrib/analysis-extras/, or any other hints of what might be wrong in my install and configuration? Thanks a lot Oliver -- lucidimagination.com
Re: Frustrating differences in fieldNorm between two different versions of solr indexing the same document
On Thu, Jul 19, 2012 at 12:10 AM, Aaron Daubman daub...@gmail.com wrote: Greetings, I've been digging in to this for two days now and have come up short - hopefully there is some simple answer I am just not seeing: I have a solr 1.4.1 instance and a solr 3.6.0 instance, both configured as identically as possible (given deprecations) and indexing the same document. Why did you do this? If you want the exact same scoring, use the exact same analysis. This means specifying luceneMatchVersion = 2.9, and the exact same analysis components (even if deprecated). I have taken the field values for the example below and run them through /admin/analysis.jsp on each solr instance. Even for the problematic docs/fields, the results are almost identical. For the example below, the t_tag values for the problematic doc: 1.4.1: 162 values 3.6.0: 164 values This is why: you changed your analysis. -- lucidimagination.com
Re: Frustrating differences in fieldNorm between two different versions of solr indexing the same document
On Thu, Jul 19, 2012 at 11:11 AM, Aaron Daubman daub...@gmail.com wrote: Apologies if I didn't clearly state my goal/concern: I am not looking for the exact same scoring - I am looking to explain scoring differences. Deprecated components will eventually go away, time moves on, etc... etc... I would like to be able to run current code, and should be able to - the part that is sticking is being able to *explain* the difference in results. OK: i totally missed that, sorry! to explain why you see such a large difference: The difference is that these length normalizations are computed at index time and fit inside a *single byte* by default. This is to keep ram usage low for many documents and many fields with norms (since its #fieldsWithNorms * #documents in bytes in ram). So this is lossy: basically you can think of there being only 256 possible values. So when you increased the number of terms only slightly by changing your analysis, this happened to bump you over the edge rounding you up to the next value. more information: http://lucene.apache.org/core/3_6_0/scoring.html http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html by the way: if you don't like this: 1. if you can still live with a single byte, maybe plug in your own Similarity class into 3.6, overriding decodeNormValue/encodeNormValue. For example, you could use a different SmallFloat configuration that has less range but more precision for your use case (if your docs are all short or whatever) 2. otherwise, if you feel you need more than a single byte, check out 4.0-ALPHA: you arent limited to a single byte there. -- lucidimagination.com
Re: Solr 4.0 IllegalStateException: this writer hit an OutOfMemoryError; cannot commit
On Tue, Jul 10, 2012 at 3:11 AM, Vadim Kisselmann v.kisselm...@gmail.com wrote: Hi folks, my Test-Server with Solr 4.0 from trunk(version 1292064 from late february) throws this exception... Can you run Lucene's checkIndex tool on your index? If that is clean, can you try a newer version? This could be a number of things, including something already fixed. auto commit error...:java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot commit at org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2650) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2804) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2786) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:391) at org.apache.solr.update.CommitTracker.run(CommitTracker.java:197) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) do you have another exception in your logs? To my knowledge, in all cases that IndexWriter throws an OutOfMemoryError, the original OutOfMemoryError is also rethrown (not just this IllegalStateException noting that at some point, it hit OOM. My Server has 24GB RAM, 8GB for JVM. I index round about 20 docs per seconds, my index is small with 10Mio docs. It runs about a couple of weeks and then suddenly i get this errors.. I can't see any problems in VisualVM with my GC. It's all ok, memory consumption is about 6GB, no swapping, no i/o problems..it's all green:) What's going on on this machine?:) My uncommitted docs are gone, right? Yes, your commit failed. -- lucidimagination.com
Re: problem adding new fields in DIH
Hello, This is because Solr's Codec implementation defers to the schema, to determine how the field should be indexed. When a core is reloaded, the IndexWriter is not closed but the existing writer is kept around: so you are basically trying to index to the old version of schema before the reload. I feel like we should fix this, but I only have two ideas: 1. turn off per-field codec support by default, so that if you want to e.g. set a field to use MemoryPostingsFormat or Pulsing, you must explicitly enable a per-field codec configuration in solrconfig.xml. This would parallel how Similarity works, and is probably ok since this is pretty expert stuff. Then you would have no issues, but if someone wanted per-field codec support they would have to make the tradeoff that reloading a core still leaves them indexing with the old configuration. 2. close and reopen the indexwriter on core reloads. On Mon, Jul 9, 2012 at 3:36 PM, Brent Mills bmi...@uship.com wrote: We're having an issue when we add or change a field in the db-data-config.xml and schema.xml files in solr. Basically whenever I add something new to index I add it to the database, then the data config, then add the field to the schema to index, reload the core, and do a full import. This has worked fine until we upgraded to an iteration of 4.0 (we are currently on 4.0 alpha). Now sometimes when we go through this process solr throws errors about the field not being found. The only way to fix this is to restart tomcat and everything immediately starts working fine again. The interesting thing is that this is only a problem if the database is returning a value for that field and only in the documents that have a value. The field shows up in the schema browser in solr, it just has no data in it. If I completely remove it from the database but leave it in the schema and dataconfig files there is no issue. Also of note, this is happening on 2 different machines. Here's the trace SEVERE: Exception while solr commit. java.lang.IllegalArgumentException: no such field test at org.apache.solr.core.DefaultCodecFactory$1.getPostingsFormatForField(DefaultCodecFactory.java:49) at org.apache.lucene.codecs.lucene40.Lucene40Codec$1.getPostingsFormatForField(Lucene40Codec.java:52) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.addField(PerFieldPostingsFormat.java:94) at org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:335) at org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85) at org.apache.lucene.index.TermsHash.flush(TermsHash.java:117) at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53) at org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:82) at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:480) at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:422) at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:554) at org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2547) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2683) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2663) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:414) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:82) at org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) at org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:919) at org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:154) at org.apache.solr.handler.dataimport.SolrWriter.commit(SolrWriter.java:107) at org.apache.solr.handler.dataimport.DocBuilder.finish(DocBuilder.java:304) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:256) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:399) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:380) -- lucidimagination.com
Re: problem adding new fields in DIH
Thanks again for reporting this Brent. I opened a JIRA issue: https://issues.apache.org/jira/browse/SOLR-3610 On Mon, Jul 9, 2012 at 3:36 PM, Brent Mills bmi...@uship.com wrote: We're having an issue when we add or change a field in the db-data-config.xml and schema.xml files in solr. Basically whenever I add something new to index I add it to the database, then the data config, then add the field to the schema to index, reload the core, and do a full import. This has worked fine until we upgraded to an iteration of 4.0 (we are currently on 4.0 alpha). Now sometimes when we go through this process solr throws errors about the field not being found. The only way to fix this is to restart tomcat and everything immediately starts working fine again. The interesting thing is that this is only a problem if the database is returning a value for that field and only in the documents that have a value. The field shows up in the schema browser in solr, it just has no data in it. If I completely remove it from the database but leave it in the schema and dataconfig files there is no issue. Also of note, this is happening on 2 different machines. Here's the trace SEVERE: Exception while solr commit. java.lang.IllegalArgumentException: no such field test at org.apache.solr.core.DefaultCodecFactory$1.getPostingsFormatForField(DefaultCodecFactory.java:49) at org.apache.lucene.codecs.lucene40.Lucene40Codec$1.getPostingsFormatForField(Lucene40Codec.java:52) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.addField(PerFieldPostingsFormat.java:94) at org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:335) at org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85) at org.apache.lucene.index.TermsHash.flush(TermsHash.java:117) at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53) at org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:82) at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:480) at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:422) at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:554) at org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2547) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2683) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2663) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:414) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:82) at org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) at org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:919) at org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:154) at org.apache.solr.handler.dataimport.SolrWriter.commit(SolrWriter.java:107) at org.apache.solr.handler.dataimport.DocBuilder.finish(DocBuilder.java:304) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:256) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:399) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:380) -- lucidimagination.com
[ANNOUNCE] Apache Solr 4.0-alpha released.
3 July 2012, Apache Solr™ 4.0-alpha available The Lucene PMC is pleased to announce the release of Apache Solr 4.0-alpha. Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.0-alpha is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html?ver=4.0a See the CHANGES.txt file included with the release for a full list of details. Solr 4.0-alpha Release Highlights: The largest set of features goes by the development code-name “Solr Cloud” and involves bringing easy scalability to Solr. See http://wiki.apache.org/solr/SolrCloud for more details. * Distributed indexing designed from the ground up for near real-time (NRT) and NoSQL features such as realtime-get, optimistic locking, and durable updates. * High availability with no single points of failure. * Apache Zookeeper integration for distributed coordination and cluster metadata and configuration storage. * Immunity to split-brain issues due to Zookeeper's Paxos distributed consensus protocols. * Updates sent to any node in the cluster and are automatically forwarded to the correct shard and replicated to multiple nodes for redundancy. * Queries sent to any node automatically perform a full distributed search across the cluster with load balancing and fail-over. Solr 4.0-alpha includes more NoSQL features for those using Solr as a primary data store: * Update durability – A transaction log ensures that even uncommitted documents are never lost. * Real-time Get – The ability to quickly retrieve the latest version of a document, without the need to commit or open a new searcher * Versioning and Optimistic Locking – combined with real-time get, this allows read-update-write functionality that ensures no conflicting changes were made concurrently by other clients. * Atomic updates - the ability to add, remove, change, and increment fields of an existing document without having to send in the complete document again. There are many other features coming in Solr 4, such as * Pivot Faceting – Multi-level or hierarchical faceting where the top constraints for one field are found for each top constraint of a different field. * Pseudo-fields – The ability to alias fields, or to add metadata along with returned documents, such as function query values and results of spatial distance calculations. * A spell checker implementation that can work directly from the main index instead of creating a sidecar index. * Pseudo-Join functionality – The ability to select a set of documents based on their relationship to a second set of documents. * Function query enhancements including conditional function queries and relevancy functions. * New update processors to facilitate modifying documents prior to indexing. * A brand new web admin interface, including support for SolrCloud. This is an alpha release for early adopters. The guarantee for this alpha release is that the index format will be the 4.0 index format, supported through the 5.x series of Lucene/Solr, unless there is a critical bug (e.g. that would cause index corruption) that would prevent this. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Happy searching, Lucene/Solr developers
Re: Exception when optimizing index
On Thu, Jun 7, 2012 at 5:50 AM, Rok Rejc rokrej...@gmail.com wrote: - java.runtime.nameOpenJDK Runtime Environment - java.runtime.version1.6.0_22-b22 ... As far as I see from the JIRA issue I have the patch attached (as mentioned I have a trunk version from May 12). Any ideas? its not guaranteed that the patch will workaround all hotspot bugs related to http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5091921 Since you can reproduce, is it possible for you to re-test the scenario with a newer JVM (e.g. 1.7.0_04) just to rule that out? -- lucidimagination.com
Re: Solr1.4 and threads ....
On Wed, Jun 13, 2012 at 4:38 PM, Benson Margulies bimargul...@gmail.com wrote: Does this suggest anything to anyone? Other than that we've misanalyzed the logic in the tokenizer and there's a way to make it burp on one thread? it might suggest the different tokenstream instances refer to some shared object that is not thread safe: we had bugs like this before (e.g. sharing a JDK collator is ok, but ICU ones are not thread-safe, so you must clone them). Because of this we beefed up our base analysis class (http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6/lucene/test-framework/src/java/org/apache/lucene/analysis/BaseTokenStreamTestCase.java) to find thread safety bugs like this. I recommend just grabbing the test-framework.jar (we release it as an artifact), extend that class and write a test like: public void testRandomStrings() throws Exception { checkRandomData(random, analyzer, 10); } (or use the one in the branch, its even been improved since 3.6) -- lucidimagination.com
Re: per-fieldtype similarity not working
On Fri, Jun 8, 2012 at 5:04 AM, Markus Jelsma markus.jel...@openindex.io wrote: Thanks Robert, The difference in scores is clear now so it shouldn't matter as queryNorm doesn't affect ranking but coord does. Can you explain why coord is left out now and why it is considered to skew results and why queryNorm skews results? And which specific new ranking algorithms they confuse, BM25F? I think its easiest to compare the two TF normalization functions, DefaultSimilarity really needs something like this because its function (sqrt) grows very fast for a single term. On the other hand, consider BM25's: tf/(tf+lengthNorm), it saturates rather quickly for a single term, so when multiple terms are being scored, huge numbers of occurrences of a single term won't dominate the overall score. You can see this visually here (give it a second to load, and imagine documentLength = averageDocumentLength and k=1.2): http://www.wolframalpha.com/input/?i=plot+sqrt%28x%29%2C+x%2F%28x%2B1.2%29%2C+x%3D1+to+100 Also, i would expect the default SchemaSimilarityFactory to behave the same as DefaultSimilarity this might raise some further confusion down the line. Thats ok: I'd rather the very expert case (Per-Field scoring) be trickier than have a trap for people that try to use any algorithm other than TFIDFSimilarity -- lucidimagination.com
Re: per-fieldtype similarity not working
On Fri, Jun 1, 2012 at 5:13 AM, Markus Jelsma markus.jel...@openindex.io wrote: Thanks but i am clearly missing something? We declare the similarity in the fieldType just as in the example and looking at the example again i don't see how it's being done differently. What am i missnig and where do i miss it? :) Hi Markus, checkout the last line at the bottom: !-- default similarity, defers to the fieldType -- similarity class=solr.SchemaSimilarityFactory/ When this is set, it means IndexSearcher/IndexWriter use a PerFieldSimilarityWrapper that delegates based to the Solr schema fieldtype. Note this is just a simple ordinary similarity impl (http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/search/similarities/SchemaSimilarityFactory.java), you could also write your own that works differently. -- lucidimagination.com
Re: per-fieldtype similarity not working
On Fri, Jun 1, 2012 at 11:39 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi! Ah, it makes sense now! This global configured similarity knows returns a fieldType defined similarity if available and if not the standard Lucene similarity. This would, i assume, mean that the two defined similarities below without per fieldType declared similarities would always yield the same results? Not true: note that two methods (coord and querynorm) are not perfield but global across the entire query tree. By default these are disabled in the wrapper, as they only skew or confuse most modern scoring algorithms (eg all the new ranking algorithms in lucene 4) respectively. So if you want to do per-field scoring where *all* of your sims are vector-space, it could make sense to customize (e.g. subclass) SchemaSimilarityFactory and do something useful for these methods. -- lucidimagination.com
Re: per-fieldtype similarity not working
On Thu, May 31, 2012 at 11:23 AM, Markus Jelsma markus.jel...@openindex.io wrote: We simply declare the following in our fieldType: similarity class=FQCN/ Thats not enough, see the example: http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/schema-sim.xml -- lucidimagination.com
Re: boost not showing up in Solr 3.6 debugQueries?
On Thu, May 17, 2012 at 4:51 PM, Tom Burton-West tburt...@umich.edu wrote: But in Solr 3.6 I am not seeing the boost factor called out. On the other hand it looks like it may now be incoroporated in the queryNorm (Please see example below). Is there a bug in Solr 3.6 debugQueries? Is there some new behavior regarding boosts and queryNorms? or am I missing something obvious? Your queries are different. your first example is a simple termquery. The second example is a boolean query. if you have a booleanquery(green frog) with a boost of 5, it incorporates its boost into the query norm passed down to its children. So when leaf nodes normalize their weight, it includes all the boosts from the parent hierarchy. You can see what I mean if you look at BooleanWeight.normalize() Because of how this is done, 3.x's explain confusingly only shows the leaf node's explicit boost, since thats all it really knows. To see what i mean try something like booleanquery(green^2 frog^3)^5 In 4.x these boosts are split apart from and kept separate from the query norm, so we could actually improve the explanations here I think. -- lucidimagination.com
Re: Language analyzers
On Wed, May 16, 2012 at 10:17 AM, anarchos78 rigasathanasio...@hotmail.com wrote: Hello, Is it possible to use two language analyzers for one fieldtype. Lets say Greek and English (for indexing and querying) For greek and english, its easy, they use totally different characters so none of their tokenfilters will conflict with each other. Just use standardtokenizer and two stopfilters (greek and english), two stemmers (greek and english) and so on. -- lucidimagination.com
Re: FrenchLightStemFilterFactory : normalizing tokens longer than 4 characters and having repeated characters in it
On Wed, May 16, 2012 at 8:28 AM, Tanguy Moal tanguy.m...@gmail.com wrote: Any idea someone ? I think this is important since this could produce weird results on collections with numbers mixed in text. I agree, i think we should just add ' Character.isLetter(ch)' to the undoublet check? Thanks for bringing this up. do you want to open a JIRA issue? http://wiki.apache.org/solr/HowToContribute -- lucidimagination.com
Re: apostrophe / ayn / alif
On Tue, May 15, 2012 at 2:47 PM, Naomi Dushay ndus...@stanford.edu wrote: We are using the ICUFoldingFilterFactory with great success to fold diacritics so searches with and without the diacritics get the same results. We recently discovered we have some Korean records that use an alif diacritic instead of an apostrophe, and this diacritic is NOT getting folded. Has anyone experienced this for alif or ayn characters? Do you have a solution? What do you mean alif diacritic in Korean? Alif (ا) isn't a diacritic and isn't used in Korean. Or did you mean arabic dagger alif ( ٰ ) ? This is not a diacritic in unicode (though its a combining mark). -- lucidimagination.com
Re: Implementing multiterm chain for ICUCollationKeyFilterFactory
On Thu, May 3, 2012 at 9:35 AM, OliverS oliver.schi...@unibas.ch wrote: Hello I read and tried a lot, but somehow I don't fully understand and it doesn't work. I'm working on solr 4.0 (latest trunk) and use ICUCollationKeyFilterFactory for my main field type. Now, wildcard queries don't work, even though ICUCollationKeyFilterFactory seems to be http://lucene.apache.org/solr/api/org/apache/solr/analysis/class-use/MultiTermAwareComponent.html this filter implements that interface solely to support rangequeries in collation order (in addition to sort), so that it has all the lucene functionality. wildcards and even prefix queries simply wont work, because these are binary keys intended just for this purpose. if you want to do textish queries like this, you need to use a text field. -- lucidimagination.com
Re: Error with distributed search and Suggester component (Solr 3.4)
On Wed, May 2, 2012 at 12:16 PM, Ken Krugler kkrugler_li...@transpac.com wrote: What confuses me is that Suggester says it's based on SpellChecker, which supposedly does work with shards. It is based on spellchecker apis, but spellchecker's ranking is based on simple comparators like string similarity, whereas suggesters use weights. when spellchecker merges from shards, it just merges all their top-N into one set and recomputes this same distance stuff over again. so, suggester can't possibly work like this correctly (forget about any technical details), as how can it make assumptions about these weights you provided. if they were e.g. log() weights from your query logs then it needs to do log-summation across the shards, etc for the final combined weight to be correct. This is specific to how you originally computed the weights you gave it. it certainly cannot be recomputing anything like spellchecker does :) Anyways, if you really want to do it, maybe https://issues.apache.org/jira/browse/SOLR-2848 is helpful. The background is in 3.x there is really only one spellchecker impl (AbstractLucene or something like that). I don't think distributed spellcheck works with any other SpellChecker subclasses in 3.x, i think its wired to only work with the Abstract-Lucene ones. When we added another subclass to 4.0, DirectSpellChecker, he saw that it was broken here and cleaned up the APIs so that spellcheckers can override this merge() operation. Unfortunately I forgot to commit those refactorings James did (which lets any spellchecker override merge()ing) to the 3.x branch, but the ideas might be useful. -- lucidimagination.com
Re: Error with distributed search and Suggester component (Solr 3.4)
On Tue, May 1, 2012 at 6:48 PM, Ken Krugler kkrugler_li...@transpac.com wrote: Hi list, Does anybody know if the Suggester component is designed to work with shards? I'm not really sure it is? They would probably have to override the default merge implementation specified by SpellChecker. But, all of the current suggesters pump out over 100,000 QPS on my machine, so I'm wondering what the usefulness of this is? And if it was useful, merging results from different machines is pretty inefficient, for suggest you would shard by term instead so that you need only contact a single host? -- lucidimagination.com
Re: Language Identification
On Mon, Apr 23, 2012 at 1:27 PM, Bai Shen baishen.li...@gmail.com wrote: I was under the impression that solr does Tika and the language identifier that Shuyo did. The page at http://wiki.apache.org/solr/LanguageDetectionlists them both. processor class=org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory processor class=org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory Again, I'm just trying to understand why it was moved to solr. Because it offers a number of features above Tika's implementation, and is available under the Apache 2.0 License so we are free to do that. -- lucidimagination.com
Re: Special characters in synonyms.txt on Solr 3.5
On Fri, Apr 20, 2012 at 12:10 PM, carl.nordenf...@bwinparty.com carl.nordenf...@bwinparty.com wrote: Directly injecting the letter ö into synonyms like so: island, ön island, ön renders the following exception on startup (both lines renders the same error): java.lang.RuntimeException: java.nio.charset.MalformedInputException: Input length = 3 at org.apache.solr.analysis.FSTSynonymFilterFactory.inform(FSTSynonymFilterFactory.java:92) at org.apache.solr.analysis.SynonymFilterFactory.inform(SynonymFilterFactory.java:50) Synonyms file needs to be in UTF-8 encoding. -- lucidimagination.com
Re: maxMergeDocs in Solr 3.6
On Thu, Apr 19, 2012 at 11:54 AM, Burton-West, Tom tburt...@umich.edu wrote: Hello all, I'm getting ready to upgrade from Solr 3.4 to Solr 3.6 and I noticed that maxMergeDocs is no longer in the example solrconfig.xml. Has maxMergeDocs been deprecated? or doe the tieredMergePolicy ignore it? its not applicable to tieredMergePolicy. when tieredmergepolicy was added, some previous global options were 'interpreted' for backwards compatibility: useCompoundFile(X) - setUseCompoundFile(X) mergeFactor(X) - setMaxMergeAtOnce(X) AND setSegmentsPerTier(X) However, in my opinion there is an easier, less confusing, more systematic approach you can use, and thats to not set these 'global' params but just specify what you want directly to TieredMergePolicy: For example for TieredMergePolicy, look at the javadocs of TieredMergePolicy here: http://lucene.staging.apache.org/core/3_6_0/api/core/org/apache/lucene/index/TieredMergePolicy.html you would simply configure it like: mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnceExplicit19/int int name=segmentsPerTier9/int double name=noCFSRatio1.0/double /mergePolicy this will invoke setMaxMergeAtOnceExplicit(19), setSegmentsPerTier(9), and setNoCFSRatio(1.0). So you can do the same thing with any of those TieredMergePolicy setters you see in the lucene javadocs. -- lucidimagination.com
Re: [Solr 4.0] what is stored in .tim index file format?
This is the term dictionary for 4.0's default codec (currently uses BlockTree implementation) .tim is the on-disk portion of the terms (similar in function to .tis in previous releases) .tip is the in-memory terms index (similar in function to .tii in previous releases) On Tue, Apr 17, 2012 at 6:37 AM, Lyuba Romanchuk lyuba.romanc...@gmail.com wrote: Hi, I have index ~31G where 27% of the index size is .fdt files (8.5G) 20% - .fdx files (6.2G) 37% - .frq files (11.6G) 16% - .tim files (5G) I didn't manage to find the description for .tim files. Can you help me with this? Thank you. Best regards, Lyuba -- lucidimagination.com
[ANNOUNCE] Apache Solr 3.6 released
12 April 2012, Apache Solr™ 3.6.0 available The Lucene PMC is pleased to announce the release of Apache Solr 3.6.0. Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites. This release contains numerous bug fixes, optimizations, and improvements, some of which are highlighted below. The release is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html (see note below). See the CHANGES.txt file included with the release for a full list of details. Solr 3.6.0 Release Highlights: * New SolrJ client connector using Apache Http Components http client (SOLR-2020) * Many analyzer factories are now multi term query aware allowing for things like field type aware lowercasing when building prefix wildcard queries. (SOLR-2438) * New Kuromoji morphological analyzer tokenizes Japanese text, producing both compound words and their segmentation. (SOLR-3056) * Range Faceting (Dates Numbers) is now supported in distributed search (SOLR-1709) * HTMLStripCharFilter has been completely re-implemented, fixing many bugs and greatly improving the performance (LUCENE-3690) * StreamingUpdateSolrServer now supports the javabin format (SOLR-1565) * New LFU Cache option for use in Solr's internal caches. (SOLR-2906) * Memory performance improvements to all FST based suggesters (SOLR-2888) * New WFSTLookupFactory suggester supports finer-grained ranking for suggestions. (LUCENE-3714) * New options for configuring the amount of concurrency used in distributed searches (SOLR-3221) * Many bug fixes Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Happy searching, Lucene/Solr developers