[ANNOUNCE] Apache Solr 4.9.0 released

2014-06-25 Thread Robert Muir
25 June 2014, Apache Solr™ 4.9.0 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.9.0

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search.  Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.9.0 is available for immediate download at:
  http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

See the CHANGES.txt file included with the release for a full list of
details.

Solr 4.9.0 Release Highlights:

* Numerous optimizations for doc values search-time performance

* Allow a client application to request the minium achieved replication
  factor for an update request (single or batch) by sending an optional
  parameter min_rf.

* Query re-ranking support with the new ReRankingQParserPlugin.

* A new [child ...] DocTransformer for optionally including Block-Join
  decendent documents inline in the results of a search.

* A new (default) Lucene49NormsFormat to better compress certain cases
  such as very short fields.


Solr 4.9.0 also includes many other new features as well as numerous
optimizations and bugfixes of the corresponding Apache Lucene release.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases.  It is possible that the mirror you are using
may not have replicated the release yet.  If that is the case, please
try another mirror.  This also goes for Maven access.

On behalf of the Lucene PMC,
Happy Searching


[ANNOUNCE] Apache Solr 4.8.1 released

2014-05-20 Thread Robert Muir
May 2014, Apache Solr™ 4.8.1 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.8.1

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.8.1 is available for immediate download at:

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr 4.8.1 includes 10 bug fixes, as well as Lucene 4.8.1 and its bug fixes.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.


[ANNOUNCE] Apache Solr 4.7.2 released.

2014-04-15 Thread Robert Muir
April 2014, Apache Solr™ 4.7.2 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.7.2

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.7.2 is available for immediate download at:

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr 4.7.2 includes 2 bug fixes, as well as Lucene 4.7.2 and its bug fixes.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.


Re: Unable to get offsets using AtomicReader.termPositionsEnum(Term)

2014-03-10 Thread Robert Muir
Hello, I think you are confused between two different index
structures, probably because of the name of the options in solr.

1. indexing term vectors: this means given a document, you can go
lookup a miniature inverted index just for that document. That means
each document has term vectors which has a term dictionary of the
terms in that one document, and optionally things like positions and
character offsets. This can be useful if you are examining *many
terms* for just a few documents. For example: the MoreLikeThis use
case. In solr this is activated with termVectors=true. To additionally
store positions/offsets information inside the term vectors its
termPositions and termOffsets, respectively.

2. indexing character offsets: this means given a term, you can get
the offset information along with each position that matched. So
really you can think of this as a special form of a payload. This is
useful if you are examining *many documents* for just a few terms. For
example, many highlighting use cases. In solr this is activated with
storeOffsetsWithPositions=true. It is unrelated to term vectors.

Hopefully this helps.

On Mon, Mar 10, 2014 at 9:32 PM, Jefferson French jkfaus...@gmail.com wrote:
 This looks like a codec issue, but I'm not sure how to address it. I've
 found that a different instance of DocsAndPositionsEnum is instantiated
 between my code and Solr's TermVectorComponent.

 Mine:
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum
 Solr: 
 org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVDocsEnum

 As far as I can tell, I've only used Lucene/Solr 4.6, so I'm not sure where
 the Lucene 4.1 reference comes from. I've searched through the Solr config
 files and can't see where to change the codec, but shouldn't the reader use
 the same codec as used when the index was created?


 On Fri, Mar 7, 2014 at 1:37 PM, Jefferson French jkfaus...@gmail.comwrote:

 We have an API on top of Lucene 4.6 that I'm trying to adapt to running
 under Solr 4.6. The problem is although I'm getting the correct offsets
 when the index is created by Lucene, the same method calls always return -1
 when the index is created by Solr. In the latter case I can see the
 character offsets via Luke, and I can even get them from Solr when I access
 the /tvrh search handler, which uses the TermVectorComponent class.

 This is roughly how I'm reading character offsets in my Lucene code:

 AtomicReader reader = ...
 Term term = ...
 DocsAndPositionsEnum postings = reader.termPositionsEnum(term);
 while (postings.nextDoc() != DocsAndPositionsEnum.NO_MORE_DOCS) {
   for (int i = 0; i  postings.freq(); i++) {
 System.out.println(start: + postings.startOffset());
 System.out.println(end: + postings.endOffset());
   }
 }


 Notice that I want the values for a single term. When run against an index
 created by Solr, the above calls to startOffset() and endOffset() return
 -1. Solr's TermVectorComponent prints the correct offsets like this
 (paraphrased):

 IndexReader reader = searcher.getIndexReader();
 Terms vector = reader.getTermVector(docId, field);
 TermsEnum termsEnum = vector.iterator(termsEnum);
 int freq = (int) termsEnum.totalTermFreq();
 DocsAndPositionsEnum dpEnum = null;
 while((text = termsEnum.next()) != null) {
   String term = text.utf8ToString();
   dpEnum = termsEnum.docsAndPositions(null, dpEnum);
   dpEnum.nextDoc();
   for (int i = 0; i  freq; i++) {
 final int pos = dpEnum.nextPosition();
 System.out.println(start: + dpEnum.startOffset());
 System.out.println(end: + dpEnum.endOffset());
   }
 }


 but in this case it is getting the offsets per doc ID, rather than a
 single term, which is what I want.

 Could anyone tell me:

1. Why I'm not able to get the offsets using my first example, and/or
2. A better way to get the offsets for a given term?

 Thanks.

Jeff











Re: ANNOUNCE: Apache Solr Reference Guide for 4.7

2014-03-05 Thread Robert Muir
I debugged the PDF a little. FWIW, the following code (using iText)
takes it to 9MB:

  public static void main(String args[]) throws Exception {
Document document = new Document();
PdfSmartCopy copy = new PdfSmartCopy(document, new
FileOutputStream(/home/rmuir/Downloads/test.pdf));
//copy.setCompressionLevel(9);
//copy.setFullCompression();
document.open();
PdfReader reader = new
PdfReader(/home/rmuir/Downloads/apache-solr-ref-guide-4.7.pdf);
int pages = reader.getNumberOfPages();
for (int i = 0; i  pages; i++) {
  PdfImportedPage page = copy.getImportedPage(reader, i+1);
  copy.addPage(page);
}
copy.freeReader(reader);
reader.close();
document.close();
  }


On Wed, Mar 5, 2014 at 10:17 AM, Steve Rowe sar...@gmail.com wrote:
 Not sure if it’s relevant anymore, but a few years ago Atlassian resolved as 
 won’t fix” a request to configure exported PDF compression ratio: 
 https://jira.atlassian.com/browse/CONF-21329.  Their suggestion: zip the 
 PDF.  I tried that - the resulting zip size is roughly 9MB, so it’s 
 definitely compressible.

 Steve

 On Mar 5, 2014, at 10:03 AM, Cassandra Targett casstarg...@gmail.com wrote:

 You know, I didn't even notice that. It did go up to 30M.

 I've made a note to look into that before we release the 4.8 version to see
 if it can be reduced at all. I suspect the screenshots are causing it to
 balloon - we made some changes to the way they appear in the PDF for 4.7
 which may be the cause, but also the software was upgraded and maybe the
 newer version is handling them differently.

 Thanks for pointing that out.


 On Tue, Mar 4, 2014 at 6:43 PM, Alexandre Rafalovitch 
 arafa...@gmail.comwrote:

 Has it really gone up in size from 5Mb for 4.6 version to 30Mb for 4.7
 version? Or some mirrors are playing tricks (mine is:
 http://www.trieuvan.com/apache/lucene/solr/ref-guide/ )

 Regards,
   Alex.
 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)


 On Wed, Mar 5, 2014 at 1:39 AM, Cassandra Targett ctarg...@apache.org
 wrote:
 The Lucene PMC is pleased to announce that we have a new version of the
 Solr Reference Guide available for Solr 4.7.

 The 395 page PDF serves as the definitive user's manual for Solr 4.7. It
 can be downloaded from the Apache mirror network:

 https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/

 Cassandra




Re: Problems with ICUCollationField

2014-02-19 Thread Robert Muir
you need the solr analysis-extras jar in your classpath, too.



On Wed, Feb 19, 2014 at 6:45 AM, Thomas Fischer fischer...@aon.at wrote:

 Hello,

 I'm migrating to solr 4.6.1 and have problems with the ICUCollationField
 (apache-solr-ref-guide-4.6.pdf, pp. 31 and 100).

 I get consistently the error message
 Error loading class 'solr.ICUCollationField'.
 even after
 INFO: Adding
 'file:/srv/solr4.6.1/contrib/analysis-extras/lib/icu4j-49.1.jar' to
 classloader
 and
 INFO: Adding
 'file:/srv/solr4.6.1/contrib/analysis-extras/lucene-libs/lucene-analyzers-icu-4.6.1.jar'
 to classloader.

 Am I missing something?

 I solr's subversion I found

 /SVN/solr/contrib/analysis-extras/src/java/org/apache/solr/schema/ICUCollationField.java
 but no corresponding class in solr4.6.1's contrib folder.

 Best
 Thomas




Re: Problems with ICUCollationField

2014-02-19 Thread Robert Muir
you need the solr analysis-extras jar itself, too.



On Wed, Feb 19, 2014 at 8:25 AM, Thomas Fischer fischer...@aon.at wrote:

 Hello Robert,

 I already added
 contrib/analysis-extras/lib/
 and
 contrib/analysis-extras/lucene-libs/
 via lib directives in solrconfig, this is why the classes mentioned are
 loaded.

 Do you know which jar is supposed to contain the ICUCollationField?

 Best regards
 Thomas



 Am 19.02.2014 um 13:54 schrieb Robert Muir:

  you need the solr analysis-extras jar in your classpath, too.
 
 
 
  On Wed, Feb 19, 2014 at 6:45 AM, Thomas Fischer fischer...@aon.at
 wrote:
 
  Hello,
 
  I'm migrating to solr 4.6.1 and have problems with the ICUCollationField
  (apache-solr-ref-guide-4.6.pdf, pp. 31 and 100).
 
  I get consistently the error message
  Error loading class 'solr.ICUCollationField'.
  even after
  INFO: Adding
  'file:/srv/solr4.6.1/contrib/analysis-extras/lib/icu4j-49.1.jar' to
  classloader
  and
  INFO: Adding
 
 'file:/srv/solr4.6.1/contrib/analysis-extras/lucene-libs/lucene-analyzers-icu-4.6.1.jar'
  to classloader.
 
  Am I missing something?
 
  I solr's subversion I found
 
 
 /SVN/solr/contrib/analysis-extras/src/java/org/apache/solr/schema/ICUCollationField.java
  but no corresponding class in solr4.6.1's contrib folder.
 
  Best
  Thomas
 
 




Re: Problems with ICUCollationField

2014-02-19 Thread Robert Muir
Hmm, for standardization of text fields, collation might be a little
awkward.

For your german umlauts, what do you mean by standardize? is this to
achieve equivalency of e.g. oe to ö in your search terms?

In that case, a simpler approach would be to put
GermanNormalizationFilterFactory in your chain:
http://lucene.apache.org/core/4_6_1/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html


On Wed, Feb 19, 2014 at 9:16 AM, Thomas Fischer fischer...@aon.at wrote:

 Thanks, that helps!

 I'm trying to migrate from the now deprecated ICUCollationKeyFilterFactory
 I used before to the ICUCollationField.
 Is there any description how to achieve this?

 First tries now yield

 ICUCollationField does not support specifying an analyzer.

 which makes it complicated since I used the ICUCollationKeyFilterFactory
 to standardize my text fields (in particular because of German Umlauts).
 But an ICUCollationField without LowerCaseFilter, a WhitespaceTokenizer, a
 LetterTokenizer, etc. doesn't do me much good, I'm afraid.
 Or is this somehow wrapped into the ICUCollationField?

 I didn't find ICUCollationField  in the solr wiki and not much information
 in the reference.
 And the hint

 solr.ICUCollationField is included in the Solr analysis-extras contrib -
 see solr/contrib/analysis-extras/README.txt for instructions on which jars
 you need to add to your SOLR_HOME/lib in order to use it.

 is misleading insofar as this README.txt doesn't mention the
 solr-analysis-extras-4.6.1.jar in dist.

 Best
 Thomas


 Am 19.02.2014 um 14:27 schrieb Robert Muir:

  you need the solr analysis-extras jar itself, too.
 
 
 
  On Wed, Feb 19, 2014 at 8:25 AM, Thomas Fischer fischer...@aon.at
 wrote:
 
  Hello Robert,
 
  I already added
  contrib/analysis-extras/lib/
  and
  contrib/analysis-extras/lucene-libs/
  via lib directives in solrconfig, this is why the classes mentioned are
  loaded.
 
  Do you know which jar is supposed to contain the ICUCollationField?
 
  Best regards
  Thomas
 
 
 
  Am 19.02.2014 um 13:54 schrieb Robert Muir:
 
  you need the solr analysis-extras jar in your classpath, too.
 
 
 
  On Wed, Feb 19, 2014 at 6:45 AM, Thomas Fischer fischer...@aon.at
  wrote:
 
  Hello,
 
  I'm migrating to solr 4.6.1 and have problems with the
 ICUCollationField
  (apache-solr-ref-guide-4.6.pdf, pp. 31 and 100).
 
  I get consistently the error message
  Error loading class 'solr.ICUCollationField'.
  even after
  INFO: Adding
  'file:/srv/solr4.6.1/contrib/analysis-extras/lib/icu4j-49.1.jar' to
  classloader
  and
  INFO: Adding
 
 
 'file:/srv/solr4.6.1/contrib/analysis-extras/lucene-libs/lucene-analyzers-icu-4.6.1.jar'
  to classloader.
 
  Am I missing something?
 
  I solr's subversion I found
 
 
 
 /SVN/solr/contrib/analysis-extras/src/java/org/apache/solr/schema/ICUCollationField.java
  but no corresponding class in solr4.6.1's contrib folder.
 
  Best
  Thomas
 
 
 
 




Re: Problems with ICUCollationField

2014-02-19 Thread Robert Muir
On Wed, Feb 19, 2014 at 10:33 AM, Thomas Fischer fischer...@aon.at wrote:


  Hmm, for standardization of text fields, collation might be a little
  awkward.

 I arrived there after using custom rules for a while (see
 RuleBasedCollator on http://wiki.apache.org/solr/UnicodeCollation) and
 then being told
 For better performance, less memory usage, and support for more locales,
 you can add the analysis-extras contrib and use
 ICUCollationKeyFilterFactory instead. (on the same page under ICU
 Collation).

  For your german umlauts, what do you mean by standardize? is this to
  achieve equivalency of e.g. oe to ö in your search terms?

 That is the main point, but I might also need the additional normalization
 of combined characters like
 o+  ̈ = ö and probably similar constructions for other languages (like
 Hungarian).


Sure but using collation to get normalization is pretty overkill too. Maybe
try ICUNormalizer2Filter? This gives you better control over the
normalization anyway.



  In that case, a simpler approach would be to put
  GermanNormalizationFilterFactory in your chain:
 
 http://lucene.apache.org/core/4_6_1/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html

 I'll see how far I get with this, but from the description
 • 'ä', 'ö', 'ü' are replaced by 'a', 'o', 'u', respectively.
 • 'ae' and 'oe' are replaced by 'a', and 'o', respectively.
 this seems to be too far-reaching a reduction: while the identification
 ä=ae is not very serious and rarely misleading, ä=a might pack words
 together that shouldn't be, Äsen and Asen are quite different concepts,


I'm not sure thats a mainstream opinion: not only do the default german
collation rules conflate these two characters as equivalent at primary
level, but so do many german stemming algorithms. Similar arguments could
be made for 'résumé' versus 'resume' and so on. Search isn't an exact
science.


[ANNOUNCE] Apache Solr 4.6.1 released.

2014-01-28 Thread Robert Muir
January 2014, Apache Solr™ 4.6.1 available The Lucene PMC is pleased
to announce the release of Apache Solr 4.6.1Solr is the popular,
blazing fast, open source NoSQL search platform from the Apache Lucene
project. Its major features include powerful full-text search, hit
highlighting, faceted search, dynamic clustering, database
integration, rich document (e.g., Word, PDF) handling, and geospatial
search. Solr is highly scalable, providing fault tolerant distributed
search and indexing, and powers the search and navigation features of
many of the world's largest internet sites.Solr 4.6.1 is available for
immediate download at:
http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Solr
4.6.1 includes 29 bug fixes and one optimization as well as Lucene
4.6.1 and its bug fixes.See the CHANGES.txt file included with the
release for a full list of changes and further details.Please report
any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)Note: The Apache
Software Foundation uses an extensive mirroring network for
distributing releases. It is possible that the mirror you are using
may not have replicated the release yet. If that is the case, please
try another mirror. This also goes for Maven access.


Re: Tracking down the input that hits an analysis chain bug

2014-01-03 Thread Robert Muir
This exception comes from OffsetAttributeImpl (e.g. you dont need to
index anything to reproduce it).

Maybe you have a missing clearAttributes() call (your tokenizer
'returns true' without calling that first)? This could explain it, if
something like a StopFilter is also present in the chain: basically
the offsets overflow.

the test stuff in BaseTokenStreamTestCase should be able to detect
this as well...

On Fri, Jan 3, 2014 at 1:56 PM, Benson Margulies ben...@basistech.com wrote:
 Using Solr Cloud with 4.3.1.

 We've got a problem with a tokenizer that manifests as calling
 OffsetAtt.setOffsets() with invalid inputs. OK, so, we want to figure out
 what input provokes our code into getting into this pickle.

 The problem happens on SolrCloud nodes.

 The problem manifests as this sort of thing:

 Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
 SEVERE: java.lang.IllegalArgumentException: startOffset must be
 non-negative, and endOffset must be = startOffset,
 startOffset=-1811581632,endOffset=-1811581632

 How could we get a document ID so that we can tell which document was being
 processed?


Re: Bad fieldNorm when using morphologic synonyms

2013-12-09 Thread Robert Muir
no, its turned on by default in the default similarity.

as i said, all that is necessary is to fix your analyzer to emit the
proper position increments.

On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand
manuel.lenorm...@gmail.com wrote:
 In order to set discountOverlaps to true you must have added the
 similarity class=solr.DefaultSimilarityFactory to the schema.xml, which
 is commented out by default!

 As by default this param is false, the above situation is expected with
 correct positioning, as said.

 In order to fix the field norms you'd have to reindex with the similarity
 class which initializes the param to true.

 Cheers,
 Manu


Re: Bad fieldNorm when using morphologic synonyms

2013-12-08 Thread Robert Muir
its accurate, you are wrong.

please, look at setDiscountOverlaps in your similarity. This is really
easy to understand.

On Sun, Dec 8, 2013 at 7:23 AM, Manuel Le Normand
manuel.lenorm...@gmail.com wrote:
 Robert, you last reply is not accurate.
 It's true that the field norms and termVectors are independent. But this
 issue of higher norms for this case is expected with well assigned
 positions. The LengthNorm is assigned as FieldInvertState.length which is
 the count of incrementToken and not num of positions! It is the case for
 wordDelimiterFilter or ReversedWildcardFilter which do change the norm when
 expanding a term.


Re: Bad fieldNorm when using morphologic synonyms

2013-12-06 Thread Robert Muir
Your analyzer needs to set positionIncrement correctly: sounds like its broken.

On Thu, Dec 5, 2013 at 1:53 PM, Isaac Hebsh isaac.he...@gmail.com wrote:
 Hi,
 we implemented a morphologic analyzer, which stems words on index time.
 For some reasons, we index both the original word and the stem (on the same
 position, of course).
 The stemming is done on a specific language, so other languages are not
 stemmed at all.

 Because of that, two documents with the same amount of terms, may have
 different termVector size. document which contains many words that being
 stemmed, will have a double sized termVector. This behaviour affects the
 relevance score in a BAD way. the fieldNorm of these documents reduces
 thier score. This is NOT the wanted behaviour in our case.

 We are looking for a way to mark the stemmed words (on index time, of
 course) so they won't affect the fieldNorm. Do such a way exist?

 Do you have another idea?


Re: Bad fieldNorm when using morphologic synonyms

2013-12-06 Thread Robert Muir
termvectors have nothing to do with any of this.

please, fix your analyzer first. if you want to add a synonym, it
should be position increment of zero.

i bet exact phrase queries aren't working correctly either.

On Fri, Dec 6, 2013 at 12:50 AM, Isaac Hebsh isaac.he...@gmail.com wrote:
 1) positions look all right (for me).
 2) fieldNorm is determined by the size of the termVector, isn't it? the
 termVector size isn't affected by the positions.


 On Fri, Dec 6, 2013 at 10:46 AM, Robert Muir rcm...@gmail.com wrote:

 Your analyzer needs to set positionIncrement correctly: sounds like its
 broken.

 On Thu, Dec 5, 2013 at 1:53 PM, Isaac Hebsh isaac.he...@gmail.com wrote:
  Hi,
  we implemented a morphologic analyzer, which stems words on index time.
  For some reasons, we index both the original word and the stem (on the
 same
  position, of course).
  The stemming is done on a specific language, so other languages are not
  stemmed at all.
 
  Because of that, two documents with the same amount of terms, may have
  different termVector size. document which contains many words that being
  stemmed, will have a double sized termVector. This behaviour affects the
  relevance score in a BAD way. the fieldNorm of these documents reduces
  thier score. This is NOT the wanted behaviour in our case.
 
  We are looking for a way to mark the stemmed words (on index time, of
  course) so they won't affect the fieldNorm. Do such a way exist?
 
  Do you have another idea?



Re: Why do people want to deploy to Tomcat?

2013-11-13 Thread Robert Muir
which example? there are so many.

On Wed, Nov 13, 2013 at 1:00 PM, Mark Miller markrmil...@gmail.com wrote:
 RE: the example folder

 It’s something I’ve been pushing towards moving away from for a long time - 
 see https://issues.apache.org/jira/browse/SOLR-3619 Rename 'example' dir to 
 'server' and pull examples into an 'examples’ directory

 Part of a push I’ve been on to own the Container level (people are now on 
 board with that for 5.0), add start scripts, and other niceties that we 
 should have but don’t yet.

 Even our config files should move away from being an “example” and end up 
 more like a default starting template. Like a database, it should be simple 
 to create a collection without needing to deal with config - you want to deal 
 with the config when you need to, not face it all up front every time it is 
 time to create a new collection.

 IMO, the name example is historical - most people already use it this way, 
 the name just confuses matters.

 - Mark


 On Nov 13, 2013, at 12:30 PM, Shawn Heisey s...@elyograg.org wrote:

 On 11/13/2013 5:29 AM, Dmitry Kan wrote:
 Reading that people have considered deploying example folder is slightly
 strange to me. No wonder they are confused and confuse their ops.

 I do use the stripped jetty included in the example, but my setup is not a 
 straight copy of the example directory. I removed a lot of it and changed 
 how jars get loaded.  I built my own init script from scratch, tailored for 
 my setup.

 I'll start a new thread with my init script and some info about how I 
 installed Solr.

 Thanks,
 Shawn




Re: Background merge errors with Solr 4.4.0 on Optimize call

2013-10-29 Thread Robert Muir
I think its a bug, but thats just my opinion. i sent a patch to dev@
for thoughts.

On Tue, Oct 29, 2013 at 6:09 PM, Erick Erickson erickerick...@gmail.com wrote:
 Hmmm, so you're saying that merging indexes where a field
 has been removed isn't handled. So you have some documents
 that do have a what field, but your schema doesn't have it,
 is that true?

 It _seems_ like you could get by by putting the _what_ field back
 into your schema, just not sending any data to it in new docs.

 I'll let others who understand merging better than me chime in on
 whether this is a case that should be handled or a bug. I pinged the
 dev list to see what the opinion is

 Best,
 Erick


 On Mon, Oct 28, 2013 at 6:39 PM, Matthew Shapiro m...@mshapiro.net wrote:

 Sorry for reposting after I just sent in a reply, but I just looked at the
 error trace closer and noticed


1. Caused by: java.lang.IllegalArgumentException: no such field what


 The 'what' field was removed by request of the customer as they wanted the
 logic behind what gets queried in the what field to be code side instead
 of solr side (for easier changing without having to re-index everything.  I
 didn't feel strongly either way and since they are paying me, I took it
 out).

 This makes me wonder if its crashing while merging because a field that
 used to be there is now gone.  However, this seems odd to me as Solr
 doesn't even let me delete the old data and instead its leaving my
 collection in an extremely bad state, with the only remedy I can think of
 is to nuke the index at the filesystem level.

 If this is indeed the cause of the crash, is the only way to delete a field
 to first completely empty your index first?


 On Mon, Oct 28, 2013 at 6:34 PM, Matthew Shapiro m...@mshapiro.net wrote:

  Thanks for your response.
 
  You were right, solr is logging to the catalina.out file for tomcat.
  When
  I click the optimize button in solr's admin interface the following logs
  are written: http://apaste.info/laup
 
  About JVM memory, solr's admin interface is listing JVM memory at 3.1%
  (221.7MB is dark grey, 512.56MB light grey and 6.99GB total).
 
 
  On Mon, Oct 28, 2013 at 6:29 AM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  For Tomcat, the Solr is often put into catalina.out
  as a default, so the output might be there. You can
  configure Solr to send the logs most anywhere you
  please, but without some specific setup
  on your part the log output just goes to the default
  for the servlet.
 
  I took a quick glance at the code but since the merges
  are happening in the background, there's not much
  context for where that error is thrown.
 
  How much memory is there for the JVM? I'm grasping
  at straws a bit...
 
  Erick
 
 
  On Sun, Oct 27, 2013 at 9:54 PM, Matthew Shapiro m...@mshapiro.net
 wrote:
 
   I am working at implementing solr to work as the search backend for
 our
  web
   system.  So far things have been going well, but today I made some
  schema
   changes and now things have broken.
  
   I updated the schema.xml file and reloaded the core (via the admin
   interface).  No errors were reported in the logs.
  
   I then pushed 100 records to be indexed.  A call to Commit afterwards
   seemed fine, however my next call for Optimize caused the following
  errors:
  
   java.io.IOException: background merge hit exception:
   _2n(4.4):C4263/154 _30(4.4):C134 _32(4.4):C10 _31(4.4):C10 into _37
   [maxNumSegments=1]
  
   null:java.io.IOException: background merge hit exception:
   _2n(4.4):C4263/154 _30(4.4):C134 _32(4.4):C10 _31(4.4):C10 into _37
   [maxNumSegments=1]
  
  
   Unfortunately, googling for background merge hit exception came up
   with 2 thing: a corrupt index or not enough free space.  The host
   machine that's hosting solr has 227 out of 229GB free (according to df
   -h), so that's not it.
  
  
   I then ran CheckIndex on the index, and got the following results:
   http://apaste.info/gmGU
  
  
   As someone who is new to solr and lucene, as far as I can tell this
   means my index is fine. So I am coming up at a loss. I'm somewhat sure
   that I could probably delete my data directory and rebuild it but I am
   more interested in finding out why is it having issues, what is the
   best way to fix it, and what is the best way to prevent it from
   happening when this goes into production.
  
  
   Does anyone have any advice that may help?
  
  
   As an aside, i do not have a stacktrace for you because the solr admin
   page isn't giving me one.  I tried looking in my logs file in my solr
   directory, but it does not contain any logs.  I opened up my
   ~/tomcat/lib/log4j.properties file and saw http://apaste.info/0rTL,
   which didnt really help me find log files.  Doing a 'find . | grep
   solr.log' didn't really help either.  Any help for finding log files
   (which may help find the actual cause of this) would also be
   appreciated.
  
 
 
 



Re: Problems installing Solr4 in Jetty9

2013-08-17 Thread Robert Muir
On Sat, Aug 17, 2013 at 3:59 AM, Chris Collins ch...@geekychris.com wrote:
 I am using 4.4 in an embedded mode and found that it has a dependency on 
 hadoop 2.0.5. alpha that in turn depends on jetty 6.1.26 which I think 
 pre-dates electricity :-}


I think this is only a test dependency ?


Re: PostingsHighlighter returning fields which don't match

2013-08-14 Thread Robert Muir
On Wed, Aug 14, 2013 at 3:53 AM, ses stew...@ssims.co.uk wrote:

 We are trying out the new PostingsHighlighter with Solr 4.2.1 and finding
 that the highlighting section of the response includes self-closing tags
 for
 all the fields in hl.fl (by default for edismax it is all fields in qf)
 where there are no highlighting matches. In contrast the same query on Solr
 4.0.0 without PostingsHighlighter it returns only the fields containing
 highlighting matches.

 here is a simplified example of the highlighting response for a document
 with no matches in the fields specified by hl.fl:
 with PostingsHighlighter:
 response
   ...
   lst name=highlighting
 lst name=Z123456
   arr name=A1/
   arr name=A2/
   arr name=A3/
   ...
 /lst
   /lst
 /response

 without PostingsHighlighter:
 response
   ...
   lst name=highlighting
 lst name=Z123456/
   /lst
 /response


Do you want to open a JIRA issue to just change the behavior?


 This is a big problem for us as we have a large number of fields in a
 dynamic field and we believe every time a highlighted response comes back
 it
 is sending us a very large number of self-closing tags which bloats the
 response to an unreasonable size (in some cases 100MB+).


Unrelated: If your queries actually go against a large number of fields,
I'm not sure how efficient this highlighter will be. Thats because at some
number of N fields, it will be much more efficient to use a
document-oriented term vector approach (e.g. standard
highlighter/fast-vector-highlighter).


Re: Who's cleaning the Fieldcache?

2013-08-14 Thread Robert Muir
On Wed, Aug 14, 2013 at 5:29 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : why? Those are my sort fields and they are occupying a lot of space (doubled
 : in this case but I see that sometimes I have three or four old segment
 : references)
 :
 : Is there something I can do to remove those old references? I tried to 
 reload
 : the core and it seems the old references are discarded (i.e. garbage
 : collected) but I believe is not a good workaround, I would avoid to reload 
 the
 : core for every replication cycle.

 You don't need to reload the core to get rid of the old FieldCaches -- in
 fact, there is nothing about reloading the core that will garuntee old
 FieldCaches get removed.

 FieldCaches are managed using a WeakHashMap - so once the IndexReader's
 associated with those FieldCaches are no logner used, they will be garbage
 collected when and if the JVMs garbage collector get arround to it.

 if they sit arround after you are done with them, they might look like the
 ytake upa log of memory, but that just means your JVM Heap has that memory
 to spare and hasn't needed to clean them up yet.

I don't think this is correct.

When you register an entry in the fieldcache, it registers event
listeners on the segment's core so that when its close()d, any entries
are purged rather than waiting on GC.

See FieldCacheImpl.java


Re: Who's cleaning the Fieldcache?

2013-08-14 Thread Robert Muir
On Wed, Aug 14, 2013 at 5:58 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 :  FieldCaches are managed using a WeakHashMap - so once the IndexReader's
 :  associated with those FieldCaches are no logner used, they will be garbage
 :  collected when and if the JVMs garbage collector get arround to it.
 : 
 :  if they sit arround after you are done with them, they might look like the
 :  ytake upa log of memory, but that just means your JVM Heap has that memory
 :  to spare and hasn't needed to clean them up yet.
 :
 : I don't think this is correct.
 :
 : When you register an entry in the fieldcache, it registers event
 : listeners on the segment's core so that when its close()d, any entries
 : are purged rather than waiting on GC.
 :
 : See FieldCacheImpl.java

 Ah ... sweet.  I didn't realize that got added.

 (In any case: it looks like a WeakHashMap is still used in case the
 listeners never get called, correct?)


I think it might be the other way around: i think it was weakmap
before always, the close listeners were then added sometime in 3.x
series, so we registered purge events as an optimization.

But one way to look at it is: readers should really get closed, so why
have the weak map and not just a regular hashmap.

Even if we want to keep the weak map (seriously i dont care, and i
dont want to be the guy fielding complaints on this), I'm going to
open with an issue with a patch that removes it and fails tests in
@afterclass if there is any entries. This way its totally clear
if/when/where anything is relying on GC today here and we can at
least look at that.


Re: Split Shard Error - maxValue must be non-negative

2013-08-13 Thread Robert Muir
did you do a (real) commit before trying to use this?
I am not sure how this splitting works, but at least the merge option
requires that.

i can't see this happening unless you are somehow splitting a 0
document index (or, if the splitter is creating 0 document splits)
so this is likely just a symptom of
https://issues.apache.org/jira/browse/LUCENE-5116

On Tue, Aug 13, 2013 at 6:46 AM, Srivatsan ranjith.venkate...@gmail.com wrote:
 Hi,

 I am experimenting with solr 4.4.0 split shard feature. When i split the
 shard i am getting the following exception.

 /java.lang.IllegalArgumentException: maxValue must be non-negative (got: -1)
 at
 org.apache.lucene.util.packed.PackedInts.bitsRequired(PackedInts.java:1184)
 at
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:140)
 at
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
 at
 org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
 at 
 org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
 at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
 at 
 org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:2488)
 at
 org.apache.solr.update.SolrIndexSplitter.split(SolrIndexSplitter.java:125)
 at
 org.apache.solr.update.DirectUpdateHandler2.split(DirectUpdateHandler2.java:766)
 at
 org.apache.solr.handler.admin.CoreAdminHandler.handleSplitAction(CoreAdminHandler.java:284)
 at
 org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186)
 at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at
 org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:611)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:209)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
 at
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
 at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
 at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
 at
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
 at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
 at
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
 at
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:368)
 at
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
 at
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
 at
 org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
 at
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)
 at 
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
 at
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
 at
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Thread.java:679)/


 How to resolve this problem?




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Split-Shard-Error-maxValue-must-be-non-negative-tp4084220.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Split Shard Error - maxValue must be non-negative

2013-08-13 Thread Robert Muir
Well, i meant before, but i just took a look and this is implemented
differently than the merge one.

In any case, i think its the same bug, because I think the only way
this can happen is if somehow this splitter is trying to create a
0-document split (or maybe a split containing all deletions).

On Tue, Aug 13, 2013 at 8:22 AM, Srivatsan ranjith.venkate...@gmail.com wrote:
 Ya i am performing commit after split request is submitted to server.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Split-Shard-Error-maxValue-must-be-non-negative-tp4084220p4084256.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Split Shard Error - maxValue must be non-negative

2013-08-13 Thread Robert Muir
On Tue, Aug 13, 2013 at 11:39 AM, Shalin Shekhar Mangar
shalinman...@gmail.com wrote:
 The splitting code calls commit before it starts the splitting. It creates
 a LiveDocsReader using a bitset created by the split. This reader is merged
 to an index using addIndexes.

 Shouldn't the addIndexes code then ignore all such 0-document segments?



Not in 4.4: https://issues.apache.org/jira/browse/LUCENE-5116


Re: Is there a way to store binary data (byte[]) in DocValues?

2013-08-12 Thread Robert Muir
On Mon, Aug 12, 2013 at 8:38 AM, Mathias Lux m...@itec.uni-klu.ac.at wrote:
 Hi!

 I'm basically searching for a method to put byte[] data into Lucene
 DocValues of type BINARY (see [1]). Currently only primitives and
 Strings are supported according to [1].

 I know that this can be done with a custom update handler, but I'd
 like to avoid that.


Can you describe a little bit what kind of operations you want to do with it?
I don't really know how BinaryField is typically used, but maybe it
could support this option. On the other hand adding it to BinaryField
might not buy you much without some additional stuff depending upon
what you need to do. Like if you really want to do sort/facet on the
thing, SORTED(SET) would probably be a better implementation: it
doesnt care that the values are binary.

BINARY, SORTED, and SORTED_SET actually all take byte[]: the difference is:
* SORTED: deduplicates/compresses the unique byte[]'s and gives each
document an ordinal number that reflects sort order (for
sorting/faceting/grouping/etc)
* SORTED_SET: similar, except each document has a set (which can be
empty), of ordinal numbers (e.g. for faceting multivalued fields)
* BINARY: just stores the byte[] for each document (no deduplication,
no compression, no ordinals, nothing).

So for sorting/faceting: BINARY is generally not very efficient unless
there is something custom going on: for example lucene's faceting
package stores the values elsewhere in a separate taxonomy index, so
it uses this type just to encode a delta-compressed ordinal list for
each document.

For scoring factors/function queries: encoding the values inside
NUMERIC(s) [up to 64 bits each] might still be best on average: the
compression applied here is surprisingly efficient.


Re: Is there a way to store binary data (byte[]) in DocValues?

2013-08-12 Thread Robert Muir
On Mon, Aug 12, 2013 at 12:25 PM, Mathias Lux m...@itec.uni-klu.ac.at wrote:

 Another thing for not using the the SORTED_SET and SORTED
 implementations is, that Solr currently works with Strings on that and
 I want to have a small memory footprint for millions of images ...
 which does not go well with immutables.

Just as a side note, again these work with byte[]. It happens to be
the case that solr uses these for its StringField (converting the
strings to bytes), but if you wanted to use these with BinaryField you
could (they just take BytesRef).


Re: Purging unused segments.

2013-08-09 Thread Robert Muir
On Fri, Aug 9, 2013 at 7:48 PM, Erick Erickson erickerick...@gmail.com wrote:

 So is there a good way, without optimizing, to purge any segments not
 referenced in the segments file? Actually I doubt that optimizing would
 even do it if I _could_, any phantom segments aren't visible from the
 segments file anyway...


I dont know why you have these files (windows? deletion policy?) but
maybe you are interested in this:

http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/index/IndexWriter.html#deleteUnusedFiles%28%29


Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Robert Muir
On Mon, Aug 5, 2013 at 11:42 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : I agree with you, 0xfffe is a special character, that is why I was asking
 : how it's handled in solr.
 : In my document, 0xfffe does not appear at the beginning, it's in the
 : content.

 Unless i'm missunderstanding something (and it's very likely that i am)...

 0xfffe is not a special character -- it is explicitly *not* a character in
 Unicode at all, it is set asside as not a character. specifically so
 that the character 0xfeff can be used as a BOM, and if the BOM is read
 incorrectly, it will cause an error.

XML doesnt allow control character like this, it defines character as:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x1-#x10] /* any Unicode character, excluding the surrogate
blocks, FFFE, and . */


Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Robert Muir
On Mon, Aug 5, 2013 at 3:03 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 :  0xfffe is not a special character -- it is explicitly *not* a character in
 :  Unicode at all, it is set asside as not a character. specifically so
 :  that the character 0xfeff can be used as a BOM, and if the BOM is read
 :  incorrectly, it will cause an error.
 :
 : XML doesnt allow control character like this, it defines character as:

 But is that even relevant?  I thought FFFE was *not* a control character?
 I thought it was completely invaid in Unicode.


its totally relevant. FFFE is a unicode codepoint, but its a noncharacter.

Its just that XML disallows FFFE and  noncharacters, but allows
other noncharacters (like 9)
These are allowed but discouraged: http://www.w3.org/TR/xml11/#charsets


Re: WikipediaTokenizer for Removing Unnecesary Parts

2013-07-23 Thread Robert Muir
If you use wikipediatokenizer it will tag different wiki elements with
different types (you can see it in the admin UI).

so then followup with typetokenfilter to only filter the types you care
about, and i think it will do what you want.

On Tue, Jul 23, 2013 at 7:53 AM, Furkan KAMACI furkankam...@gmail.comwrote:

 Hi;

 I have indexed wikipedia data with Solr DIH. However when I look data that
 is indexed at Solr I something like that as well:

 {| style=text-align: left; width: 50%; table-layout: fixed; border=0
 |- valign=top
 | style=width: 50%|
 :*[[Ubuntu]]
 :*[[Fedora]]
 :*[[Mandriva]]
 :*[[Linux Mint]]
 :*[[Debian]]
 :*[[OpenSUSE]]
 |
 *[[Red Hat]]
 *[[Mageia]]
 *[[Arch Linux]]
 *[[PCLinuxOS]]
 *[[Slackware]]
 |}

 However I want to remove them before indexing. I know that there is a
 WikipediaTokenizer in Lucene but how can I remove unnecessary parts ( as
 like links, style, etc..) with Solr?



Re: Using per-segment FieldCache or DocValues in custom component?

2013-07-02 Thread Robert Muir
Where do you get the docid from? Usually its best to just look at the whole
algorithm, e.g. docids come from per-segment readers by default anyway so
ideally you want to access any per-document things from that same
segmentreader.

As far as supporting docvalues, FieldCache API passes thru to docvalues
transparently if its enabled for the field.

On Mon, Jul 1, 2013 at 4:55 PM, Michael Ryan mr...@moreover.com wrote:

 I have some custom code that uses the top-level FieldCache (e.g.,
 FieldCache.DEFAULT.getLongs(reader, foobar, false)). I'd like to redesign
 this to use the per-segment FieldCaches so that re-opening a Searcher is
 fast(er). In most cases, I've got a docId and I want to get the value for a
 particular single-valued field for that doc.

 Is there a good place to look to see example code of per-segment
 FieldCache use? I've been looking at PerSegmentSingleValuedFaceting, but
 hoping there might be something less confusing :)

 Also thinking DocValues might be a better way to go for me... is there any
 documentation or example code for that?

 -Michael



Re: Are there any plans to change example directory layout?

2013-06-11 Thread Robert Muir
If you have a good idea... Just do it. Open an issue
On Jun 11, 2013 9:34 PM, Alexandre Rafalovitch arafa...@gmail.com wrote:

 I think it is quite hard for beginners that basic solr example
 directory is competing for attention with other - nested - examples. I
 see quite a lot of questions on which directory inside 'example' to
 pay attention to and which to ignore, etc.

 Actually, this is so confusing, I am not even sure how to put this in
 writing.

 Basically, is anybody aware of people looking into example directory
 structure? A JIRA maybe?

 Regards,
Alex.
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)



Re: Requesting to add into a Contributor Group

2013-05-05 Thread Robert Muir
done. let us know if you have any problems.

On Sat, May 4, 2013 at 10:12 AM, Krunal jariwalakru...@gmail.com wrote:

 Dear Sir,

 Kindly add me to the contributor group to help me contribute to the Solr
 wiki.

 My Email id: jariwalakru...@gmail.com
 Login Name: Krunal

 Specific changes I would like to make to begin with are:

 - Correct Link of Ajax Solr here http://wiki.apache.org/solr/SolrJS which
 is wrong, the correct link should be
 https://github.com/evolvingweb/ajax-solr/wiki

 - Add our company data here http://wiki.apache.org/solr/Support

 We offer Solr integration service on Dot Net Platform at Xcellence-IT.

 And business division of ours, i.e. nopAccelerate - offers a Solr
 Integration Plugin for nopCommerce along with other nopCommerce performance
 optimization services.


 We have been working on Solr since last 1 years and will be happy to
 contribute back by helping community maintain  update Wiki. If this is not
 allowed, then kindly let us know so I will send you our Company details so
 you can make changes too.

 Thanks,

 Awaiting your response.

 Krunal

 *Krunal Jariwala*


 *Cell:* +91-98251-07747

 *Best time to Call:* 9am to 7pm (IST) GMT +5.30



Re: Solr using a ridiculous amount of memory

2013-03-24 Thread Robert Muir
On Sun, Mar 24, 2013 at 4:19 AM, John Nielsen j...@mcb.dk wrote:

 Schema with DocValues attempt at solving problem:
 http://pastebin.com/Ne23NnW4
 Config: http://pastebin.com/x1qykyXW


This schema isn't using docvalues, due to a typo in your config.
it should not be DocValues=true but docValues=true.

Are you not getting an error? Solr needs to throw exception if you
provide invalid attributes to the field. Nothing is more frustrating
than having a typo or something in your configuration and solr just
ignores this, reports no error, and doesnt work the way you want.
I'll look into this (I already intend to add these checks to analysis
factories for the same reason).

Separately, if you really want the terms data and so on to remain on
disk, it is not enough to just enable docvalues for the field. The
default implementation uses the heap. So if you want that, you need to
set docValuesFormat=Disk on the fieldtype. This will keep the
majority of the data on disk, and only some key datastructures in heap
memory. This might have significant performance impact depending upon
what you are doing so you need to test that.


Re: Fuzzy Suggester and exactMatchFirst

2013-03-18 Thread Robert Muir
On Sun, Mar 17, 2013 at 8:19 PM, Eoghan Ó Carragáin
eoghan.ocarrag...@gmail.com wrote:

 I can see why the Fuzzy Suggester sees college as a match for colla but
 expected the exactMatchFirst parameter to ensure that suggestions beginning
 with colla to be weighted higher than fuzzier matches. I
 have spellcheck.onlyMorePopular set to true, in case this makes a
 difference.

 Am I misunderstanding what exactMatchFirst is supposed to do? Is there a
 way to ensure suggestions matching exactly what the user has entered rank
 higher than fuzzy matches?


I think exactMatchFirst is unrelated to typo-correction: it only
ensures that if you type the whole suggestion exactly that the weight
is completely ignored.
This means if you type 'college' and there is an actual suggestion of
'college' it will be weighted above 'colleges' even if colleges has a
much higher weight.

On the other hand what you want (i think) is to punish the weights of
suggestions that required some corrections. Currently I don't think
there is any way to do that:

 * NOTE: This suggester does not boost suggestions that
 * required no edits over suggestions that did require
 * edits.  This is a known limitation.

I think the trickiest part about this is how the punishment formula
should work. Because today this thing makes no assumptions as to how
you came up with your suggestion weights...

But feel free to open a JIRA issue if you have ideas !


Re: Out of Memory doing a query Solr 4.2

2013-03-15 Thread Robert Muir
On Fri, Mar 15, 2013 at 6:46 AM, raulgrande83 raulgrand...@hotmail.com wrote:
 Thank you for your help. I'm afraid it won't be so easy to change de jvm
 version, because it is required at the moment.

 It seems that Solr 4.2 supports Java 1.6 at least. Is that correct?

 Could you find any clue of what is happening in the attached traces? It
 would be great to know why it is happening now, because it was working for
 Solr 3.5.

Its probably not an OOM at all. instead its more likely IBM JVM is
probably miscompiling our code and producing large integers, like it
does quite often. For example, we had to disable testing it completely
recently for this reason. If someone were to report a JIRA issue that
mentioned IBM, I'd make the same comment there but in general not take
it seriously at all due to the kind of bugs i've seen from that JVM.

The fact that IBM JVM didnt miscompile 3.5's code is irrelevant.


Re: Out of Memory doing a query Solr 4.2

2013-03-14 Thread Robert Muir
On Thu, Mar 14, 2013 at 12:07 PM, raulgrande83 raulgrand...@hotmail.com wrote:
 JVM: IBM J9 VM(1.6.0.2.4)

I don't recommend using this JVM.


Re: Using suggester for smarter phrase autocomplete

2013-03-13 Thread Robert Muir
On Wed, Mar 13, 2013 at 11:07 AM, Eric Wilson wilson.eri...@gmail.com wrote:
 I'm trying to use the suggester for auto-completion with Solr 4. I have
 followed the example configuration for phrase suggestions at the bottom of
 this wiki page:
 http://wiki.apache.org/solr/Suggesterhttps://mail.manta.com/owa/redir.aspx?C=a570b5bb74f64f4fb810ba260e304ec5URL=http%3a%2f%2fwiki.apache.org%2fsolr%2fSuggester

 This shows how to use a text file with the following text for phrase
 suggestions:

 # simple auto-suggest phrase dictionary for testing
 # note this uses tabs as separator!
 the first phrase1.0
 the second phrase   2.0
 testing 12343.0
 foo 5.0
 the fifth phrase2.0
 the final phrase4.0

 This seems to be working in the expected way. If I query for the f I
 receive the following suggestions:

  strthe final phrase/str
  strthe fifth phrase/str
  strthe first phrase/str

 I would like to deal with the case where the user is interested in the
 foo. When the fo is entered, there will be no suggestions. Is it
 possible to provide both the phrase matches, and the matches for individual
 words, so that when the user entered text is no longer part of any actual
 phrase, there are still suggestions to be made for the final word?


Is it really the case that you want matches for individual words, or
just to handle e.g. the stopwords case like 'the fo' - foo ?

the latter can be done with analyzingsuggester (configure a stopfilter
on the analyzer).


Re: It seems a issue of deal with chinese synonym for solr

2013-03-12 Thread Robert Muir
I agree. Actually that top-level logic is fine. its the loop that
follows thats wrong: it needs to look at position increment and do the
right thing.

Want to open a JIRA issue?

On Mon, Mar 11, 2013 at 9:15 PM, 李威 li...@antvision.cn wrote:
 in org.apache.solr.parser.SolrQueryParserBase, there is a function: 
 protected Query newFieldQuery(Analyzer analyzer, String field, String 
 queryText, boolean quoted)  throws SyntaxError

 The below code can't process chinese rightly.

   BooleanClause.Occur occur = positionCount  1  operator == 
 AND_OPERATOR ?
 BooleanClause.Occur.MUST : BooleanClause.Occur.SHOULD;

 

 For example, “北京市 and “北京 are synonym, if I seach 北京市动物园, the expected 
 parse result is +(北京市 北京) +动物园, but actually it would be parsed to +北京市 
 +北京 +动物园.

 The code can process English, because English word is seperate by space, and 
 only one position.

 In order to process Chinese, I think it can charge by position increment, but 
 not by position count.

 Could you help take a look?




 Thanks,

 Wei Li


[ANNOUNCE] Apache Solr 4.2 released

2013-03-11 Thread Robert Muir
March 2013, Apache Solr™ 4.2 available
The Lucene PMC is pleased to announce the release of Apache Solr 4.2

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search.  Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.2 is available for immediate download at:
   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

See the CHANGES.txt file included with the release for a full list of details.

Solr 4.2 Release Highlights:

* A read side REST API for the schema. Always wanted to introspect the
schema over http? Now you can. Looks like the write side will be
coming next.

* DocValues have been integrated into Solr. DocValues can be loaded up
a lot faster than the field cache and can also use different
compression algorithms as well as in RAM or on Disk representations.
Faceting, sorting, and function queries all get to benefit. How about
the OS handling faceting and sorting caches off heap? No more tuning
60 gigabyte heaps? How about a snappy new per segment DocValues
faceting method? Improved numeric faceting? Sweet.

* Collection Aliasing. Got time based data? Want to re-index in a
temporary collection and then swap it into production? Done. Stay
tuned for Shard Aliasing.

* Collection API responses. The collections API was still very new in
4.0, and while it improved a fair bit in 4.1, responses were certainly
needed, but missed the cut off. Initially, we made the decision to
make the Collection API super fault tolerant, which made responses
tougher to do. No one wants to hunt through logs files to see how
things turned out. Done in 4.2.

* Interact with any collection on any node. Until 4.2, you could only
interact with a node in your cluster if it hosted at least one replica
of the collection you wanted to query/update. No longer - query any
node, whether it has a piece of your intended collection or not and
get a proxied response.

* Allow custom shard names so that new host addresses can take over
for retired shards. Working on Amazon without elastic ips? This is for
you.

* Lucene 4.2 optimizations such as compressed term vectors.

Solr 4.2 also includes many other new features as well as numerous
optimizations and bugfixes.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases.  It is possible that the mirror you
are using may not have replicated the release yet.  If that is the
case, please try another mirror.  This also goes for Maven access.

Happy searching,
Lucene/Solr developers


Re: MockAnalyzer in Lucene: attach stemmer or any custom filter?

2013-02-15 Thread Robert Muir
For 3.4, extend ReusableAnalyzerBase

On Fri, Feb 15, 2013 at 12:06 PM, Dmitry Kan solrexp...@gmail.com wrote:
 Thanks a lot, Robert.

 I need to study a bit more closely the link you have sent. I have tried to
 override the Analyzer class, but couldn't find a method 
 createComponents(String
 fieldName,Reader reader) in LUCENE_34. Instead, there is a method required
 to override: tokenStream(String fieldName, Reader reader). Is there a way
 of incorporating the custom filter into the TokenStream?


 Dmitry

 On Thu, Feb 14, 2013 at 5:37 PM, Robert Muir rcm...@gmail.com wrote:

 MockAnalyzer is really just MocKTokenizer+MockTokenFilter+

 Instead you just define your own analyzer chain using MockTokenizer.
 This is the way all lucene's own analysis tests work: e.g.

 http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/common/src/test/org/apache/lucene/analysis/en/TestEnglishMinimalStemFilter.java

 On Thu, Feb 14, 2013 at 7:40 AM, Dmitry Kan solrexp...@gmail.com wrote:
  Hello,
 
  Asked a question on SO:
 
 
 http://stackoverflow.com/questions/14873207/mockanalyzer-in-lucene-attach-stemmer-or-any-custom-filter
 
  Is there a way to configure a stemmer or a custom filter with the
  MockAnalyzer class?
  Version: LUCENE_34
 
  Dmitry



Re: MockAnalyzer in Lucene: attach stemmer or any custom filter?

2013-02-14 Thread Robert Muir
MockAnalyzer is really just MocKTokenizer+MockTokenFilter+

Instead you just define your own analyzer chain using MockTokenizer.
This is the way all lucene's own analysis tests work: e.g.
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/common/src/test/org/apache/lucene/analysis/en/TestEnglishMinimalStemFilter.java

On Thu, Feb 14, 2013 at 7:40 AM, Dmitry Kan solrexp...@gmail.com wrote:
 Hello,

 Asked a question on SO:

 http://stackoverflow.com/questions/14873207/mockanalyzer-in-lucene-attach-stemmer-or-any-custom-filter

 Is there a way to configure a stemmer or a custom filter with the
 MockAnalyzer class?
 Version: LUCENE_34

 Dmitry


Re: Exception when trying to save to a field with storeOffsetsWithPositions=true

2013-01-22 Thread Robert Muir
On Tue, Jan 22, 2013 at 12:23 PM, Meng Muk
meng@uniqueinteractive.com wrote:

 If I set the field type to text_en however it works, I'm guessing
 something in the way the text is being analyzed is causing this exception
 to appear? Is there a limitation in how storeOffsetsWithPositions should be
 used?


IndexWriter will refuse broken offsets up-front if you use this
feature: Its strict about this and will throw an exception at
index-time if the analyzer is broken.

You can see the list of broken analysis components here:
https://issues.apache.org/jira/browse/LUCENE-4641

If you really want to use one of these broken analysis components, you
can use another highlighter, but it probably just means you wont see
these analyzer bugs until search time (InvalidTokenOffsetsExceptions
and so on)


[ANNOUNCE] Apache Solr 3.6.2 released

2012-12-25 Thread Robert Muir
25 December 2012, Apache Solr™ 3.6.2 available

The Lucene PMC and Santa Claus are pleased to announce the release of
Apache Solr 3.6.2.

Solr is the popular, blazing fast open source enterprise search
platform from the Apache Lucene project. Its major features include
powerful full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
distributed search and index replication, and it powers the search and
navigation features of many of the world's largest internet sites.

This release is a bug fix release for version 3.6.1. It contains
numerous bug fixes, optimizations, and improvements, some of which are
highlighted below.  The release is available for immediate download
at: http://lucene.apache.org/solr/mirrors-solr-3x-redir.html (see note
below).

See the CHANGES.txt file included with the release for a full list of details.

Solr 3.6.2 Release Highlights:

 * Fixed ConcurrentModificationException during highlighting, if all
fields were requested.

 * Fixed edismax queryparser to apply minShouldMatch to implicit
boolean queries.

 * Several bugfixes to the DataImportHandler.

 * Bug fixes from Apache Lucene 3.6.2.

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases.  It is possible that the mirror you
are using may not have replicated the release yet.  If that is the
case, please try another mirror.  This also goes for Maven access.

Happy holidays and happy searching,

Lucene/Solr developers


Re: Japanese exact match results do not show on top of results

2012-12-20 Thread Robert Muir
I think you are hitting solr-3589. There is a vote underway for a 3.6.2
that contains this fix
On Dec 20, 2012 6:29 PM, kirpakaro khem...@yahoo.com wrote:

 Hi folks,

 I am having couple of problems with Japanese data, 1. it is not
 properly
 indexing all the data 2. displaying the exact match result on top and then
 90%match and 80%match etc. does not work.
  I am using solr3.6.1 and using text_ja as the fieldType here is the schema


field name=q type=text_ja indexed=true stored=true /
field name=qs type=text_general indexed=false stored=true
 multiValued=true/
field name=q_e type=string indexed=true stored=true /

  copyField source=q dest=q_e maxChars=250/

 what I want to achieve is that if there is an exact query match it should
 provide the results from q_e followed by results from partial match from q
 field and if there is nothing in q_e field then partial matches should come
 from q field.  This is how I specify the query

 http://localhost:7983/zoom/jp/select/?q=鹿児島
 鹿児島銀行rows=10version=2.2qf=query+query_exact^1mm=90%25pf=q^1+q_e^10
 OR
 version=2.2rows=10qf=q+q_e^1pf=query^10+query_exact^1

 somehow the exact query matches results do not come on top, though the data
 contains it. It is puzzling that all the documents do not get indexed
 properly, but if I change the q field to string and q_e to text_ja then all
 the records are indexed properly, but that still does not solve the problem
 of exact match on top followed by partial matches.

 text_ja field uses:
 filter class=solr.JapaneseBaseFormFilterFactory/
 filter class=solr.JapanesePartOfSpeechStopFilterFactory
 tags=../../../solr/conf/lang/stoptags_ja.txt
 enablePositionIncrements=true/
   filter class=solr.CJKWidthFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=../../../solr/conf/lang/stopwords_ja.txt
 enablePositionIncrements=true /
  filter class=solr.JapaneseKatakanaStemFilterFactory minimumLength=4/
   filter class=solr.LowerCaseFilterFactory/

  How to solve this problem,

 Thanks










 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Japanese-exact-match-results-do-not-show-on-top-of-results-tp4028422.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: ICUTokenizer labels number as Han character?

2012-12-19 Thread Robert Muir
Your attachment didnt come through: I think the list strips them.
Maybe just open a JIRA and attach your screenshots? or put them
elsewhere and just include a link?

As far as the ultimate behavior, I think its correct. Keep in mind
tokens don't really get a script value: runs of untokenized text do.
common is stuff like numbers/punctuation/etc that just keeps the run
whatever it was before (e.g. Han).

And the bigram filter only bigrams text with certain token types (NUM
is not one of them), so making a singleton is correct.

On Wed, Dec 19, 2012 at 5:10 PM, Tom Burton-West tburt...@umich.edu wrote:
 Hello,

 Don't know if the Solr admin panel is lying, or if this is a wierd bug.
 The string: 1986年  gets analyzed by the ICUTokenizer with 1986 being
 identified as type:NUM and script:Han.  Then the CJKBigram filter identifies
 1986 as type:Num and script:Han and 年 as type:Single and script: Common.

 This doesn't seem right.   Couldn't fit the whole analysis output on one
 screen so there are two screenshots attached.

 Any clues as to what is going on and whether it is a problem?

 Tom


Re: order question on solr multi value field

2012-12-18 Thread Robert Muir
I agree with James. Actually lucene tests will fail if a codec violates this.

Actually it goes much deeper than this.

From the lucene apis, when you call IndexReader.document() with your
storedfieldVisitor, it must visit the fields in the original order
added.

so even if you do:

add(title, title value 1);
add(body, body value);
add(title, title value 2);

Currently stored fields must be returned in exactly this order:
title1, then body, then title2.

This is pretty annoying :) I don't think its truly necessary to
maintain that crazy guarantee, but I'm pretty sure something tests it
somewhere. In my opinion its too restrictive and prevents useful
optimizations.

But in my opinion, title1 should always come back before title2 in the
order you added them just like today.

On Tue, Dec 18, 2012 at 10:54 AM, Dyer, James
james.d...@ingramcontent.com wrote:
 I would say such a guarantee is implied by the javadoc to 
 Analyzer#getPositionIncrementGap .  It says this value is an increment to 
 be added to the next token emitted from tokenStream.

 http://lucene.apache.org/core/4_0_0-ALPHA/core/org/apache/lucene/analysis/Analyzer.html#getPositionIncrementGap%28java.lang.String%29

 Also compare unofficial documentation such as Lucene In Action 2nd ed, 
 section 4.7.1:  Lucene logically appends the tokens...sequentially.

 Having multi-valued fields stay in the order in which they were added to the 
 Document is a guarantee that many many users depend on.

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Jack Krupansky [mailto:j...@basetechnology.com]
 Sent: Tuesday, December 18, 2012 9:30 AM
 To: solr-user@lucene.apache.org
 Subject: Re: order question on solr multi value field

 If there is no official guarantee in the Javadoc for the code then there
 is no official guarantee. Period. If somebody wants an official, contractual
 guarantee, a Jira should be filed to do so. To put it simple, are the values
 a list or a set?

 -- Jack Krupansky

 -Original Message-
 From: Erik Hatcher
 Sent: Tuesday, December 18, 2012 9:40 AM
 To: solr-user@lucene.apache.org
 Cc: solr-user@lucene.apache.org
 Subject: Re: order question on solr multi value field

 I don't know of an official guarantee of maintaining order but it's
 definitely guaranteed an relied upon to retain order.  Many will scream if
 this changed.

 Indexed doesn't matter here because what you get back are the stored values
 no matter if the field is indexed or not.

Erik

 On Dec 18, 2012, at 3:04, hellorsanjeev sanjeev.dhi...@3pillarglobal.com
 wrote:

 thank you for quick response :)

 I also have the same observation and I too believe that there is no reason
 for Solr to reorder a multi value field.

 But would you stay firm on your conclusion if I say that my multi value
 field was indexed?

 Please note - as per my one year experience with Solr, it always returned
 the values in the insertions order irrespective of the fact that field was
 indexed or not.

 My main concern is because I couldn't find it documented anywhere, it
 might
 happen that in Solr 4.0 or later, they start reordering them. If they do
 then there will be a big problem for us :)



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/order-question-on-solr-multi-value-field-tp4027695p4027713.html
 Sent from the Solr - User mailing list archive at Nabble.com.





Re: Regexp and speed

2012-11-30 Thread Robert Muir
On Fri, Nov 30, 2012 at 12:13 PM, Roman Chyla roman.ch...@gmail.com wrote:


 The code here:

 https://github.com/romanchyla/montysolr/blob/solr-trunk/contrib/adsabs/src/test/org/adsabs/lucene/BenchmarkAuthorSearch.java

 The benchmark should probably not be called 'benchmark', do you think it
 may be too simplistic? Can we expect some bad surprises somewhere?


I think maybe a few surprises, since it extends LuceneTestCase and uses
RandomIndexWriter, newSearcher and so on, the benchmark results can be
confusing.

This stuff is fantastic to use for tests but for benchmarks may cause
confusion.

For example you might run it and it gets SimpleText codec, maybe wraps the
indexsearcher with slow things like ParallelReader, and maybe you get
horrific merge parameters and so on.


Re: Skewed IDF in multi lingual index

2012-11-26 Thread Robert Muir
Hi again Markus. Sorry for the slow reply here.

I'm confused: are you saying the score goes negative? Are you sure there is
no 3.x segments? Can you check that docCount is not -1? Do you happen to
have a test, can you share your modified similarity, or give more details?

I just want to make sure there isn't a bug in lucene here (we verify this
statistic currently in checkindex and other places, but there is always the
possibility)

On Mon, Nov 12, 2012 at 7:39 AM, Markus Jelsma
markus.jel...@openindex.iowrote:

 I'd like to add that multiplicative boosting on very scarce properties,
 e.g. you want to boost on a boolean value of which there are only very few,
 causes a problem in scoring when using docCount instead of maxDoc. If
 docCount is one IDF will be ~0.3, with the fieldWeight you'll end up with a
 score below 0. Because of this the product of all multiplicative boosts
 will be lower than the product of boosts similar boosts, lowering the
 document in rank instead of boosting it.

 -Original message-
  From:Markus Jelsma markus.jel...@openindex.io
  Sent: Fri 09-Nov-2012 10:23
  To: solr-user@lucene.apache.org
  Subject: RE: Skewed IDF in multi lingual index
 
  Robert, Tom,
 
  That's it indeed! Using maxDoc as numerator opposed to docCount yields
 very skewed results for an unevenly distributed multi-lingual index. We
 have one language dominating the other twenty so the dominating language
 contains no rare terms compared to the others.
 
  We're now checking results using docCount and it seems alright. I do
 have to get used to the fact that document scores are now roughly 1000
 times higher than before but i'm already very happy with
 CollectionStatistics and will see if all works well.
 
  Any other tips to share?
 
  Thanks,
  Markus
 
 
 
  -Original message-
   From:Robert Muir rcm...@gmail.com
   Sent: Thu 08-Nov-2012 17:44
   To: solr-user@lucene.apache.org
   Subject: Re: Skewed IDF in multi lingual index
  
   Hi Markus: how are the languages distributed across documents?
  
   Imagine I have a text_en field and a text_fr field. Lets say I have
   100 documents, 95 are english and only 5 are french.
   So the text_en field is populated 95% of the time, and the text_fr 5%
   of the time.
  
   But the default IDF computation doesnt look at things this way: it
   always uses '100' as maxDoc. So in such a situation, any terms against
   text_fr are rare :)
  
   The first thing i would look at, is treating this situation as merging
   results from a english index with 95 docs and a french index with 5
   docs.
   So I would consider overriding the two idfExplain methods (term and
   phrase) to use CollectionStatistics.docCount() instead of
   CollectionStatistics.maxDoc()
   The former would be 95 for the english field (instead of 100), and 5
   for the french field (instead of 100).
  
   I dont think this will solve all your problems: but it might help.
  
   Note: you must ensure your index is fully upgraded to 4.0 to try this
   statistic, otherwise it will return -1 if you have any 3.x segments in
   your index.
  
   On Thu, Nov 8, 2012 at 11:13 AM, Markus Jelsma
   markus.jel...@openindex.io wrote:
Hi,
   
We're testing a large multi lingual index with _LANG fields for each
 language and using dismax to query them all. Users provide, explicit or
 implicit, language preferences that we use for either additive or
 multiplicative boosting on the language of the document. However, additive
 boosting is not adequate because it cannot overcome the extremely high IDF
 values for the same word in another language so regardless of the the
 preference, foreign documents are returned. Multiplicative boosting solves
 this problem but has the other downside as it doesn't allow us with
 standard qf=field^boost to prefer documents in another language above the
 preferred language because the multiplicative is so strong. We do use the
 def function (boost=def(query($qq),.3)) to prevent one boost query to
 return 0 and thus a product of 0 for all boost queries. But it doesn't help
 that much
   
This all comes down to IDF differences between the languages, even
 common words such as country names like `india` show large differences in
 IDF. Is here anyone with some hints or experiences to share about skewed
 IDF in such an index?
   
Thanks,
Markus
  
 



Re: Error loading class solr.CJKBigramFilterFactory

2012-11-14 Thread Robert Muir
On Wed, Nov 14, 2012 at 8:12 AM, Frederico Azeiteiro
frederico.azeite...@cision.com wrote:
 Fo make some further testing I installed SOLR 3.5.0 using default Jetty
 server.

 When tried to start SOLR using the same schema I get:



 SEVERE: org.apache.solr.common.SolrException: Error loading class
 'solr.CJKBigramFilterFactory'

This filter was added in 3.6, so its expected that it wouldnt be found.


Re: Error loading class solr.CJKBigramFilterFactory

2012-11-14 Thread Robert Muir
I'm sure. I added it to 3.6 ;)

You must have something funky with your tomcat configuration, like an
exploded war with different versions of jars or some other form of jar
hell.

On Wed, Nov 14, 2012 at 9:32 AM, Frederico Azeiteiro
frederico.azeite...@cision.com wrote:
 Are you sure about that?

 We have it working on:

 Solr Specification Version: 3.5.0.2011.11.22.14.54.38
 Solr Implementation Version: 3.5.0 1204988 - simon - 2011-11-22 14:54:38
 Lucene Specification Version: 3.5.0
 Lucene Implementation Version: 3.5.0 1204988 - simon - 2011-11-22 14:46:51
 Current Time: Wed Nov 14 17:30:07 WET 2012
 Server Start Time:Wed Nov 14 11:40:36 WET 2012

 ??

 Thanks,
 Frederico


 -Mensagem original-
 De: Robert Muir [mailto:rcm...@gmail.com]
 Enviada: quarta-feira, 14 de Novembro de 2012 16:28
 Para: solr-user@lucene.apache.org
 Assunto: Re: Error loading class solr.CJKBigramFilterFactory

 On Wed, Nov 14, 2012 at 8:12 AM, Frederico Azeiteiro 
 frederico.azeite...@cision.com wrote:
 Fo make some further testing I installed SOLR 3.5.0 using default
 Jetty server.

 When tried to start SOLR using the same schema I get:



 SEVERE: org.apache.solr.common.SolrException: Error loading class
 'solr.CJKBigramFilterFactory'

 This filter was added in 3.6, so its expected that it wouldnt be found.


Re: Does ICUFoldingFilterFactory make CJKWidthFilterFactory unnecessary?

2012-11-14 Thread Robert Muir
Yes, its a subset
On Nov 14, 2012 1:18 PM, Shawn Heisey s...@elyograg.org wrote:

 I am using ICUFoldingFilterFactory in my Solr schema.  Now I am looking at
 adding CJKBigramFilterFactory, and I've noticed that it often goes with
 CJKWidthFilterFactory.  Here are the relevant Javadocs for my question:

 http://lucene.apache.org/core/**4_0_0/analyzers-common/org/**
 apache/lucene/analysis/cjk/**CJKWidthFilter.htmlhttp://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html
 http://lucene.apache.org/core/**4_0_0/analyzers-icu/org/**
 apache/lucene/analysis/icu/**ICUFoldingFilter.htmlhttp://lucene.apache.org/core/4_0_0/analyzers-icu/org/apache/lucene/analysis/icu/ICUFoldingFilter.html

 The descriptions of these two classes suggest that if I already have
 ICUFoldingFilter, I do not need CJKWidthFilter.  Do I have that right or
 wrong?

 Thanks,
 Shawn




Re: URL parameters to use FieldAnalysisRequestHandler

2012-11-13 Thread Robert Muir
I think the UI uses this behind the scenes, as in no more
analysis.jsp like before?

So maybe try using something like burpsuite and just using the
analysis UI in your browser to see what requests its sending.

On Tue, Nov 13, 2012 at 11:00 AM, Tom Burton-West tburt...@umich.edu wrote:
 Hello,

 I  would like to send a request to the FieldAnalysisRequestHandler.  The
 javadoc lists the parameter names such as analysis.field, but sending those
 as URL parameters does not seem to work:

 mysolr.umich.edu/analysis/field?analysis.name=titleq=fire-fly

 leaving out the analysis doesn't work either:

 mysolr.umich.edu/analysis/field?name=titleq=fire-fly

 No matter what field I specify, the analysis returned is for the default
 field. (See repsonse excerpt below)

 Is there a page somewhere that shows the correct syntax for sending get
 requests to the FieldAnalysisRequestHandler?

 Tom

 
 lst name=analysis
 lst name=field_types/
 lst name=field_names
 lst name=ocr


Re: customize solr search/scoring for performance

2012-11-12 Thread Robert Muir
Whenever I look at solr users' stacktraces for disjunctions, I always
notice they get BooleanScorer2.

Is there some reason for this or is it not intentional (e.g. maybe a
in-order collector is always being used when its possible at least in
simple cases to allow for out-of-order hits?)

When I examine test contributions from clover reports (e.g.
https://builds.apache.org/job/Lucene-Solr-Clover-4.x/49/clover-report/),
I notice that only lucene tests, and solr spellchecking tests actually
hit BooleanScorer's collect. All other solr tests hit BooleanScorer2.

If its possible to allow for an out of order collector in some common
cases (e.g. large disjunctions w/ minShouldMatch generated by solr
queryparsers), it could be a nice performance improvement.

On Mon, Nov 12, 2012 at 3:48 PM, jchen2000 jchen...@yahoo.com wrote:
 The following was generated from jvisualvm. Seems like the perf is related to
 scoring a lot. Any idea/pointer on how to customize that part?

 http://lucene.472066.n3.nabble.com/file/n4019850/profilingResult.png



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/customize-solr-search-scoring-for-performance-tp4019444p4019850.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud, Zookeeper and Stopwords with Umlaute or other special characters

2012-11-08 Thread Robert Muir
On Wed, Nov 7, 2012 at 11:45 AM, Daniel Brügge
daniel.brue...@googlemail.com wrote:
 Hi,

 i am running a SolrCloud cluster with the 4.0.0 version. I have a stopwords
 file
 which is in the correct encoding.

What makes you think that?

Note: Because I can read it is not the correct answer.

Ensure any of your stopwords files etc are in UTF-8. This is often
different from the encoding your computer uses by default if you open
a file, start typing in it, and press save.


Re: Skewed IDF in multi lingual index

2012-11-08 Thread Robert Muir
Hi Markus: how are the languages distributed across documents?

Imagine I have a text_en field and a text_fr field. Lets say I have
100 documents, 95 are english and only 5 are french.
So the text_en field is populated 95% of the time, and the text_fr 5%
of the time.

But the default IDF computation doesnt look at things this way: it
always uses '100' as maxDoc. So in such a situation, any terms against
text_fr are rare :)

The first thing i would look at, is treating this situation as merging
results from a english index with 95 docs and a french index with 5
docs.
So I would consider overriding the two idfExplain methods (term and
phrase) to use CollectionStatistics.docCount() instead of
CollectionStatistics.maxDoc()
The former would be 95 for the english field (instead of 100), and 5
for the french field (instead of 100).

I dont think this will solve all your problems: but it might help.

Note: you must ensure your index is fully upgraded to 4.0 to try this
statistic, otherwise it will return -1 if you have any 3.x segments in
your index.

On Thu, Nov 8, 2012 at 11:13 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Hi,

 We're testing a large multi lingual index with _LANG fields for each language 
 and using dismax to query them all. Users provide, explicit or implicit, 
 language preferences that we use for either additive or multiplicative 
 boosting on the language of the document. However, additive boosting is not 
 adequate because it cannot overcome the extremely high IDF values for the 
 same word in another language so regardless of the the preference, foreign 
 documents are returned. Multiplicative boosting solves this problem but has 
 the other downside as it doesn't allow us with standard qf=field^boost to 
 prefer documents in another language above the preferred language because the 
 multiplicative is so strong. We do use the def function 
 (boost=def(query($qq),.3)) to prevent one boost query to return 0 and thus a 
 product of 0 for all boost queries. But it doesn't help that much

 This all comes down to IDF differences between the languages, even common 
 words such as country names like `india` show large differences in IDF. Is 
 here anyone with some hints or experiences to share about skewed IDF in such 
 an index?

 Thanks,
 Markus


Re: Where can I find an example of a 4.0 contraction file?

2012-11-01 Thread Robert Muir
You have a character encoding issue: this is telling you the file is
not correctly encoded as UTF-8.

On Thu, Nov 1, 2012 at 6:11 PM, dm_tim dm_...@yahoo.com wrote:
 I should have mentioned I tried that. I get the following exception:
 SEVERE: Unable to create core: core0
 java.lang.RuntimeException: java.nio.charset.MalformedInputException: Input
 length = 1

 Any other suggestions?

 Regards,

 Tim



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Where-can-I-find-an-example-of-a-4-0-contraction-file-tp4017699p4017705.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Unable to build trunk

2012-10-31 Thread Robert Muir
you will have to use 'find' on your .ivy2 !

On Wed, Oct 31, 2012 at 6:32 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Hi,

 Where is that lock file located? I triggered it again (in another contrib) 
 and wil trigger it again in the future and don't want to remove my ivy cache 
 each time :)

 Thanks


 -Original message-
 From:Robert Muir rcm...@gmail.com
 Sent: Tue 30-Oct-2012 15:14
 To: solr-user@lucene.apache.org
 Subject: Re: Unable to build trunk

 Its not wonky. you just have to ensure you have nothing else (like
 some IDE, or build somewhere else) using ivy, then its safe to remove
 the .lck file there.

 I turned on this locking so that it hangs instead of causing cache
 corruption, but ivy only has simplelockfactory so if you ^C at the
 wrong time, it might leave a .lck file.

 On Tue, Oct 30, 2012 at 9:27 AM, Erick Erickson erickerick...@gmail.com 
 wrote:
  Not sure if it's relevant, but sometimes the ivy caches are wonky. Try
  deleting (on OS X) ~/.ivy2 recursively and building again? Of course
  your next build will download a bunch of jars...
 
  FWIW,
  Erick
 
  On Tue, Oct 30, 2012 at 5:38 AM, Markus Jelsma
  markus.jel...@openindex.io wrote:
  Hi,
 
  Since yesterday we're unable to build trunk and also a clean check out 
  from trunk. We can compile the sources but not the example or dist.
 
  It hangs on resolve and after a while prints the following:
 
  resolve:
 
  [ivy:retrieve]
  [ivy:retrieve] :: problems summary ::
  [ivy:retrieve]  WARNINGS
  [ivy:retrieve]  module not found: 
  com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4
  [ivy:retrieve]   local: tried
  [ivy:retrieve]
  /home/markus/.ivy2/local/com.carrotsearch.randomizedtesting/randomizedtesting-runner/2.0.4/ivys/ivy.xml
  [ivy:retrieve]-- artifact 
  com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4!randomizedtesting-runner.jar:
  [ivy:retrieve]
  /home/markus/.ivy2/local/com.carrotsearch.randomizedtesting/randomizedtesting-runner/2.0.4/jars/randomizedtesting-runner.jar
  [ivy:retrieve]   shared: tried
  [ivy:retrieve]
  /home/markus/.ivy2/shared/com.carrotsearch.randomizedtesting/randomizedtesting-runner/2.0.4/ivys/ivy.xml
  [ivy:retrieve]-- artifact 
  com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4!randomizedtesting-runner.jar:
  [ivy:retrieve]
  /home/markus/.ivy2/shared/com.carrotsearch.randomizedtesting/randomizedtesting-runner/2.0.4/jars/randomizedtesting-runner.jar
  [ivy:retrieve]   public: tried
  [ivy:retrieve]
  http://repo1.maven.org/maven2/com/carrotsearch/randomizedtesting/randomizedtesting-runner/2.0.4/randomizedtesting-runner-2.0.4.pom
  [ivy:retrieve]   sonatype-releases: tried
  [ivy:retrieve]
  http://oss.sonatype.org/content/repositories/releases/com/carrotsearch/randomizedtesting/randomizedtesting-runner/2.0.4/randomizedtesting-runner-2.0.4.pom
  [ivy:retrieve]   working-chinese-mirror: tried
  [ivy:retrieve]
  http://mirror.netcologne.de/maven2/com/carrotsearch/randomizedtesting/randomizedtesting-runner/2.0.4/randomizedtesting-runner-2.0.4.pom
  [ivy:retrieve]-- artifact 
  com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4!randomizedtesting-runner.jar:
  [ivy:retrieve]
  http://mirror.netcologne.de/maven2/com/carrotsearch/randomizedtesting/randomizedtesting-runner/2.0.4/randomizedtesting-runner-2.0.4.jar
  [ivy:retrieve]  ::
  [ivy:retrieve]  ::  UNRESOLVED DEPENDENCIES ::
  [ivy:retrieve]  ::
  [ivy:retrieve]  :: 
  com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4: not 
  found
  [ivy:retrieve]  ::
  [ivy:retrieve]  ERRORS
  [ivy:retrieve]  impossible to acquire lock for 
  com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4
  [ivy:retrieve]  impossible to acquire lock for 
  com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4
  [ivy:retrieve]  impossible to acquire lock for 
  com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4
  [ivy:retrieve]  impossible to acquire lock for 
  com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4
  [ivy:retrieve]  impossible to acquire lock for 
  com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4
  [ivy:retrieve]  impossible to acquire lock for 
  com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4
  [ivy:retrieve]  impossible to acquire lock for 
  com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4
  [ivy:retrieve]  impossible to acquire lock for 
  com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4
  [ivy:retrieve]  impossible to acquire lock for 
  com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4
  [ivy:retrieve]
  [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE 

Re: Unable to build trunk

2012-10-30 Thread Robert Muir
Its not wonky. you just have to ensure you have nothing else (like
some IDE, or build somewhere else) using ivy, then its safe to remove
the .lck file there.

I turned on this locking so that it hangs instead of causing cache
corruption, but ivy only has simplelockfactory so if you ^C at the
wrong time, it might leave a .lck file.

On Tue, Oct 30, 2012 at 9:27 AM, Erick Erickson erickerick...@gmail.com wrote:
 Not sure if it's relevant, but sometimes the ivy caches are wonky. Try
 deleting (on OS X) ~/.ivy2 recursively and building again? Of course
 your next build will download a bunch of jars...

 FWIW,
 Erick

 On Tue, Oct 30, 2012 at 5:38 AM, Markus Jelsma
 markus.jel...@openindex.io wrote:
 Hi,

 Since yesterday we're unable to build trunk and also a clean check out from 
 trunk. We can compile the sources but not the example or dist.

 It hangs on resolve and after a while prints the following:

 resolve:

 [ivy:retrieve]
 [ivy:retrieve] :: problems summary ::
 [ivy:retrieve]  WARNINGS
 [ivy:retrieve]  module not found: 
 com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4
 [ivy:retrieve]   local: tried
 [ivy:retrieve]
 /home/markus/.ivy2/local/com.carrotsearch.randomizedtesting/randomizedtesting-runner/2.0.4/ivys/ivy.xml
 [ivy:retrieve]-- artifact 
 com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4!randomizedtesting-runner.jar:
 [ivy:retrieve]
 /home/markus/.ivy2/local/com.carrotsearch.randomizedtesting/randomizedtesting-runner/2.0.4/jars/randomizedtesting-runner.jar
 [ivy:retrieve]   shared: tried
 [ivy:retrieve]
 /home/markus/.ivy2/shared/com.carrotsearch.randomizedtesting/randomizedtesting-runner/2.0.4/ivys/ivy.xml
 [ivy:retrieve]-- artifact 
 com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4!randomizedtesting-runner.jar:
 [ivy:retrieve]
 /home/markus/.ivy2/shared/com.carrotsearch.randomizedtesting/randomizedtesting-runner/2.0.4/jars/randomizedtesting-runner.jar
 [ivy:retrieve]   public: tried
 [ivy:retrieve]
 http://repo1.maven.org/maven2/com/carrotsearch/randomizedtesting/randomizedtesting-runner/2.0.4/randomizedtesting-runner-2.0.4.pom
 [ivy:retrieve]   sonatype-releases: tried
 [ivy:retrieve]
 http://oss.sonatype.org/content/repositories/releases/com/carrotsearch/randomizedtesting/randomizedtesting-runner/2.0.4/randomizedtesting-runner-2.0.4.pom
 [ivy:retrieve]   working-chinese-mirror: tried
 [ivy:retrieve]
 http://mirror.netcologne.de/maven2/com/carrotsearch/randomizedtesting/randomizedtesting-runner/2.0.4/randomizedtesting-runner-2.0.4.pom
 [ivy:retrieve]-- artifact 
 com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4!randomizedtesting-runner.jar:
 [ivy:retrieve]
 http://mirror.netcologne.de/maven2/com/carrotsearch/randomizedtesting/randomizedtesting-runner/2.0.4/randomizedtesting-runner-2.0.4.jar
 [ivy:retrieve]  ::
 [ivy:retrieve]  ::  UNRESOLVED DEPENDENCIES ::
 [ivy:retrieve]  ::
 [ivy:retrieve]  :: 
 com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4: not found
 [ivy:retrieve]  ::
 [ivy:retrieve]  ERRORS
 [ivy:retrieve]  impossible to acquire lock for 
 com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4
 [ivy:retrieve]  impossible to acquire lock for 
 com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4
 [ivy:retrieve]  impossible to acquire lock for 
 com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4
 [ivy:retrieve]  impossible to acquire lock for 
 com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4
 [ivy:retrieve]  impossible to acquire lock for 
 com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4
 [ivy:retrieve]  impossible to acquire lock for 
 com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4
 [ivy:retrieve]  impossible to acquire lock for 
 com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4
 [ivy:retrieve]  impossible to acquire lock for 
 com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4
 [ivy:retrieve]  impossible to acquire lock for 
 com.carrotsearch.randomizedtesting#randomizedtesting-runner;2.0.4
 [ivy:retrieve]
 [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

 BUILD FAILED
 /home/markus/src/solr/trunk/solr/build.xml:336: The following error occurred 
 while executing this line:
 /home/markus/src/solr/trunk/solr/common-build.xml:345: The following error 
 occurred while executing this line:
 /home/markus/src/solr/trunk/solr/common-build.xml:388: The following error 
 occurred while executing this line:
 /home/markus/src/solr/trunk/lucene/common-build.xml:316: impossible to 
 resolve dependencies:
 resolve failed - see output for details

 Total time: 18 minutes 19 seconds

 As you can 

Re: Improving performance for use-case where large (200) number of phrase queries are used?

2012-10-24 Thread Robert Muir
On Wed, Oct 24, 2012 at 11:09 AM, Aaron Daubman daub...@gmail.com wrote:
 Greetings,

 We have a solr instance in use that gets some perhaps atypical queries
 and suffers from poor (2 second) QTimes.

 Documents (~2,350,000) in this instance are mainly comprised of
 various descriptive fields, such as multi-word (phrase) tags - an
 average document contains 200-400 phrases like this across several
 different multi-valued field types.

 A custom QueryComponent has been built that functions somewhat like a
 very specific MoreLikeThis. A seed document is specified via the
 incoming query, its terms are retrieved, boosted both by query
 parameters as well as fields within the document that specify term
 weighting, sorted by this custom boosting, and then a second query is
 crafted by taking the top 200 (sorted by the custom boosting)
 resulting field values paired with their fields and searching for
 documents matching these 200 values.

a few more ideas:
* use shingles e.g. to turn two-word phrases into single terms (how
long is your average phrase?).
* in addition to the above, maybe for phrases with  2 terms, consider
just a boolean conjunction of the shingled phrases instead of a real
phrase query: e.g. more like this - (more_like AND like_this). This
would have some false positives.
* use a more aggressive stopwords list for your MorePhrasesLikeThis.
* reduce this number 200, and instead work harder to prune out which
phrases are the most descriptive from the seed document, e.g. based
on some heuristics like their frequency or location within that seed
document, so your query isnt so massive.


Re: ICUTokenizer ArrayIndexOutOfBounds

2012-10-17 Thread Robert Muir
calling reset() is mandatory part of the consumer lifecycle before
calling incrementToken(), see:

https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html

A lot of people don't consume these correctly, thats why these
tokenizers now try to throw exceptions if you do it wrong, rather than
wrong results otherwise.

If you really want to test that your consumer code (queryparser,
whatever) is doing this correctly, test your code with
MockTokenizer/MockAnalyzer in the test-framework package. This has a
little state machine with a lot more checks.

On Wed, Oct 17, 2012 at 6:56 AM, Shane Perry thry...@gmail.com wrote:
 Hi,

 I've been playing around with using the ICUTokenizer from 4.0.0.
 Using the code below, I was receiving an ArrayIndexOutOfBounds
 exception on the call to tokenizer.incrementToken().  Looking at the
 ICUTokenizer source, I can see why this is occuring (usableLength
 defaults to -1).

 ICUTokenizer tokenizer = new ICUTokenizer(myReader);
 CharTermAttribute termAtt = 
 tokenizer.getAttribute(CharTermAttribute.class);

 while(tokenizer.incrementToken())
 {
 System.out.println(termAtt.toString());
 }

 After poking around a little more, I found that I can just call
 tokenizer.reset() (initializes usableLength to 0) right after
 constructing the object
 (org.apache.lucene.analysis.icu.segmentation.TestICUTokenizer does a
 similar step in it's super class).  I was wondering if someone could
 explain why I need to call tokenizer.reset() prior to using the
 tokenizer for the first time.

 Thanks in advance,

 Shane


[ANNOUNCE] Apache Solr 4.0 released.

2012-10-12 Thread Robert Muir
October 12 2012, Apache Solr™ 4.0 available.
The Lucene PMC is pleased to announce the release of Apache Solr 4.0.

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search.  Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the
searchand navigation features of many of the world's largest internet
sites.

Solr 4.0 is available for immediate download at:
   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

See the CHANGES.txt file included with the release for a full list of details.

Solr 4.0 Release Highlights:

The largest set of features goes by the development code-name
SolrCloud and involves bringing easy scalability to Solr.  See
http://wiki.apache.org/solr/SolrCloud for more details.
* Distributed indexing designed from the ground up for near real-time
(NRT) and NoSQL features such as realtime-get, optimistic locking, and
durable updates.
* High availability with no single points of failure.
* Apache Zookeeper integration for distributed coordination and
cluster metadata and configuration storage.
* Immunity to split-brain issues due to Zookeeper's Paxos distributed
consensus protocols.
* Updates sent to any node in the cluster and are automatically
forwarded to the correct shard and replicated to multiple nodes for
redundancy.
* Queries sent to any node automatically perform a full distributed
search across the cluster with load balancing and fail-over.
* A collection management API.
* Smart SolrJ client (CloudSolrServer) that knows to send documents
only to the shard leaders

Solr 4.0 includes more NoSQL features for those using Solr as a
primary data store:
* Update durability – A transaction log ensures that even uncommitted
documents are never lost.
* Real-time Get – The ability to quickly retrieve the latest version
of a document, without the need to commit or open a new searcher
* Versioning and Optimistic Locking – combined with real-time get,
this allows read-update-write functionality that ensures no
conflicting changes were made concurrently by other clients.
* Atomic updates - the ability to add, remove, change, and increment
fields of an existing document without having to send in the complete
document again.

Many additional improvements include:
* New spatial field types with polygon support.
* Pivot Faceting – Multi-level or hierarchical faceting where the top
constraints for one field are found for each top constraint of a
different field.
* Pseudo-fields – The ability to alias fields, or to add metadata
along with returned documents, such as function query values and
results of spatial distance calculations.
* A spell checker implementation that can work directly from the main
index instead of creating a sidecar index.
* Pseudo-Join functionality – The ability to select a set of documents
based on their relationship to a second set of documents.
* Function query enhancements including conditional function queries
and relevancy functions.
* New update processors to facilitate modifying documents prior to indexing.
* A brand new web admin interface, including support for SolrCloud and
improved error reporting
* Numerous bug fixes and optimizations.

Noteworthy changes since 4.0-BETA:
* New spatial field types with polygon support.
* Various Admin UI improvements.
* SolrCloud related performance optimizations in writing the
transaction log, PeerSync recovery, Leader election, and ClusterState
caching.
* Numerous bug fixes and optimizations.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases.  It is possible that the mirror you
are using may not have replicated the release yet.  If that is the
case, please try another mirror.  This also goes for Maven access.

Happy searching,

Apache Lucene/Solr Developers


Re: Using additional dictionary with DirectSolrSpellChecker

2012-10-10 Thread Robert Muir
On Wed, Oct 10, 2012 at 9:02 AM, O. Klein kl...@octoweb.nl wrote:
 I don't want to tweak the threshold. For majority of cases it works fine.

 It's for cases where term has low frequency but is spelled correctly.

 If you lower the threshold you would also get incorrect spelled terms as
 suggestions.


Yeah there is no real magic here when the corpus contains typos. this
existing docFreq heuristic was just borrowed from the old index-based
spellchecker.

I do wonder if using # of occurrences (totalTermFreq) instead of # of
documents with the term (docFreq) would improve the heuristic.

In all cases I think if you want to also integrate a dictionary or
something, it seems like this could somehow be done with the
File-based spellchecker?


Re: Indexing in Solr: invalid UTF-8

2012-09-25 Thread Robert Muir
On Tue, Sep 25, 2012 at 2:02 PM, Patrick Oliver Glauner
patrick.oliver.glau...@cern.ch wrote:
 Hi
 Thanks. But I see that 0xd835 is missing in this list (see my exceptions).

 What's the best way to get rid of all of them in Python? I am new to unicode 
 in Python but I am sure that this use case is quite frequent.


I don't really know python either: so I could be wrong here but are
you just taking these binary .PDF and .DOC files and treating them as
UTF-8 text and sending them to Solr?

If so, I don't think that will work very well. Maybe instead try
parsing these binary files with something like Tika to get at the
actual content and send that? (it seems some people have developed
python integration for this, e.g.
http://redmine.djity.net/projects/pythontika/wiki)


Re: SOLR memory usage jump in JVM

2012-09-20 Thread Robert Muir
On Thu, Sep 20, 2012 at 3:09 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de wrote:

 By the way while looking for upgrading to JDK7, the release notes say under 
 section
 known issues about the PorterStemmer bug:
 ...The recommended workaround is to specify -XX:-UseLoopPredicate on the 
 command line.
 Is this still not fixed, or won't fix?

How in the world can we fix it?

Oracle released a broken java version: there's nothing we can do about
that. Go take it up with them.

-- 
lucidworks.com


Re: Solr - Lucene Debuging help

2012-09-10 Thread Robert Muir
On Mon, Sep 10, 2012 at 4:43 PM, BadalChhatbar badal...@yahoo.com wrote:
 Steve,

 Those document tips didn't help.

 errors i m getting are like (_TestUtil cannot be resolved).



Did you do these two steps:
1. ant eclipse
2. refresh your project

-- 
lucidworks.com


Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Robert Muir
On Fri, Sep 7, 2012 at 2:19 PM, Tom Burton-West tburt...@umich.edu wrote:
 Thanks Robert,

 I'll have to spend some time understanding the default codec for Solr 4.0.
 Did I miss something in the changes file?

http://lucene.apache.org/core/4_0_0-BETA/

see the file formats section, especially
http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Termdictionary

(since blocktree covers term dictionary and terms index)


  I'll be digging into the default codec docs and testing sometime in next
 week  or two (with a 2 billion term index)  If I understand it well enough,
 I'll be happy to draft some changes up for either the wiki or Solr the
 example solrconfig.xml  file.

right i think we should remove these parameters.


 Does this mean that the default codec will reduce memory use for the terms
 index enough so I don't need to use either of these settings to deal with
 my  2 billion term indexes?

probably. i dont know enough about your terms or how much RAM you have
to say for sure.

if not, just customize blocktree's params with a CodecFactory in solr,
or even pick another implementation (FixedGap, VariableGap, whatever).

the interval/divisor stuff is mostly only useful if you are not
reindexing from scratch: e.g. if you are gonna plop your 3.x index
into 4.x then you should set
those to whatever you were using before, since it will be using
PreflexCodec to read those.

-- 
lucidworks.com


[ANNOUNCE] Apache Solr 4.0-beta released.

2012-08-14 Thread Robert Muir
14 August 2012, Apache Solr™ 4.0-beta available
The Lucene PMC is pleased to announce the release of Apache Solr 4.0-beta.

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search, dynamic clustering, database
integration, rich document (e.g., Word, PDF) handling, and geospatial search.
Solr is highly scalable, providing fault tolerant distributed search
and indexing,
and powers the search and navigation features of many of the world's
largest internet sites.

Solr 4.0-beta is available for immediate download at:
   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html?ver=4.0b

See the CHANGES.txt file included with the release for a full list of
details.

Highlights of changes since 4.0-alpha:

  * Added a Collection management API for Solr Cloud.

  * Solr Admin UI now clearly displays failures related to
initializing SolrCores

  * Updatable documents can create a document if it doesn't already exist,
or you can force that the document must already exist.

  * Full delete-by-query support for Solr Cloud.

  * Default to NRTCachingDirectory for improved near-realtime performance.

  * Improved Solrj client performance with Solr Cloud: updates are
only sent to leaders by default.

  * Various other API changes, optimizations and bug fixes.

This is a beta for early adopters. The guarantee for this beta release
is that the index
format will be the 4.0 index format, supported through the 5.x series
of Lucene/Solr, unless there
is a critical bug (e.g. that would cause index corruption) that would
prevent this.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Happy searching,

Lucene/Solr developers


Re: how to retrieve total token count per collection/index

2012-08-09 Thread Robert Muir
On Thu, Aug 9, 2012 at 10:20 AM, tech.vronk t...@vronk.net wrote:
 Hello,

 I wonder how to figure out the total token count in a collection (per
 index), i.e. the size of a corpus/collection measured in tokens.


You want to use this statistic, which tells you number of tokens for
an indexed field:
http://lucene.apache.org/core/4_0_0-ALPHA/core/org/apache/lucene/index/Terms.html#getSumTotalTermFreq%28%29

-- 
lucidimagination.com


Re: how to retrieve total token count per collection/index

2012-08-09 Thread Robert Muir
On Thu, Aug 9, 2012 at 4:24 PM, tech.vronk t...@vronk.net wrote:

 Is there any 3.6 equivalent for this, before I install and run 4.0?
 I can't seem to find a corresponding class (org.apache.lucene.index.Terms)
 in 3.6.


unfortunately 3.6 does not carry this statistic, there is really no
clear delineation of 'field' in 3.x, all the terms are just a big
sorted list of field+term.
so there are no field-level statistics at all! these are new in 4.0

-- 
lucidimagination.com


Re: Highlighting error InvalidTokenOffsetsException: Token oedipus exceeds length of provided text sized 11

2012-08-03 Thread Robert Muir
On Fri, Aug 3, 2012 at 12:38 AM, Justin Engelman jus...@smalldemons.com wrote:
 I have an autocomplete index that I return highlighting information for but
 am getting an error with certain search strings and fields on Solr 3.5.

try the 3.6 release:

* LUCENE-3642, SOLR-2891, LUCENE-3717: Fixed bugs in CharTokenizer,
n-gram tokenizers/filters,
  compound token filters, thai word filter, icutokenizer, pattern analyzer,
  wikipediatokenizer, and smart chinese where they would create
invalid offsets in
  some situations, leading to problems in highlighting.

-- 
lucidimagination.com


Re: Using Solr-319 with Solr 3.6.0

2012-08-03 Thread Robert Muir
On Fri, Aug 3, 2012 at 12:57 PM, Himanshu Jindal
himanshujin...@gmail.com wrote:
 filter class=solr.SynonymFilterFactory synonyms=synonyms_ja.txt
 ignoreCase=true expand=true
 tokenFactory=solr.JapaneseTokenizerFactory randomAttribute=randomValue/

I think you have a typo here, it should be tokenizerFactory, not tokenFactory

-- 
lucidimagination.com


Re: Memory leak?? with CloseableThreadLocal with use of Snowball Filter

2012-08-02 Thread Robert Muir
On Thu, Aug 2, 2012 at 3:13 AM, Laurent Vaills laurent.vai...@gmail.com wrote:
 Hi everyone,

 Is there any chance to get his backported for a 3.6.2 ?


Hello, I personally have no problem with it: but its really
technically not a bugfix, just an optimization.

It also doesnt solve the actual problem if you have a tomcat
threadpool configuration recycling threads too fast. There will be
other performance problems.

-- 
lucidimagination.com


Re: Memory leak?? with CloseableThreadLocal with use of Snowball Filter

2012-08-01 Thread Robert Muir
On Tue, Jul 31, 2012 at 2:34 PM, roz dev rozde...@gmail.com wrote:
 Hi All

 I am using Solr 4 from trunk and using it with Tomcat 6. I am noticing that
 when we are indexing lots of data with 16 concurrent threads, Heap grows
 continuously. It remains high and ultimately most of the stuff ends up
 being moved to Old Gen. Eventually, Old Gen also fills up and we start
 getting into excessive GC problem.

Hi: I don't claim to know anything about how tomcat manages threads,
but really you shouldnt have all these objects.

In general snowball stemmers should be reused per-thread-per-field.
But if you have a lot of fields*threads, especially if there really is
high thread churn on tomcat, then this could be bad with snowball:
see eks dev's comment on https://issues.apache.org/jira/browse/LUCENE-3841

I think it would be useful to see if you can tune tomcat's threadpool
as he describes.

separately: Snowball stemmers are currently really ram-expensive for
stupid reasons.
each one creates a ton of Among objects, e.g. an EnglishStemmer today
is about 8KB.

I'll regenerate these and open a JIRA issue: as the snowball code
generator in their svn was improved
recently and each one now takes about 64 bytes instead (the Among's
are static and reused).

Still this wont really solve your problem, because the analysis
chain could have other heavy parts
in initialization, but it seems good to fix.

As a workaround until then you can also just use the good old
PorterStemmer (PorterStemFilterFactory in solr).
Its not exactly the same as using Snowball(English) but its pretty
close and also much faster.

-- 
lucidimagination.com


Re: ICUCollation throws exception

2012-07-21 Thread Robert Muir
)
 Caused by: org.apache.solr.common.SolrException: Plugin init failure for
 [schema.xml] analyzer/filter: class org.apache.solr.schema.ICUCollationField
 at
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:168)
 at
 org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:356)
 at
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)
 at
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)
 at
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:142)
 ... 34 more
 Caused by: java.lang.ClassCastException: class
 org.apache.solr.schema.ICUCollationField
 at java.lang.Class.asSubclass(Class.java:3018)
 at
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:409)
 at
 org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:430)
 at
 org.apache.solr.util.plugin.AbstractPluginLoader.create(AbstractPluginLoader.java:86)
 at
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:142)
 ... 38 more

 Jul 16, 2012 5:27:48 PM org.apache.solr.core.CoreContainer create
 INFO: Creating SolrCore 'viaf' using instanceDir:
 /usr/local/swissbib/solr.versions/configs/current.home/viaf
 Jul 16, 2012 5:27:48 PM org.apache.solr.core.SolrResourceLoader init

 **end of Exception***


 2012/7/21 Robert Muir rcm...@gmail.com

 Can you include the entire exception? This is really necessary!

 On Tue, Jul 17, 2012 at 2:58 AM, Oliver Schihin
 oliver.schi...@unibas.ch wrote:
  Hello
 
  According to release notes from 4.0.0-ALPHA, SOLR-2396, I replaced
  ICUCollationKeyFilterFactory with ICUCollationField in our schema. But
 this
  throws an exception, see the following excerpt from the log:
  
  Jul 16, 2012 5:27:48 PM org.apache.solr.common.SolrException log
  SEVERE: null:org.apache.solr.common.SolrException: Plugin init failure
 for
  [schema.xml] fieldType alphaOnlySort: Pl
  ugin init failure for [schema.xml] analyzer/filter: class
  org.apache.solr.schema.ICUCollationField
  at
 
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:168)
  at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:359)
  
  The deprecated filter of ICUCollationKeyFilterFactory is working without
 any
  problem. This is how I did the schema (with the deprecated filter):
  
 !-- field type for sort strings --
 fieldType name=alphaOnlySort class=solr.TextField
  sortMissingLast=true omitNorms=true
analyzer
  tokenizer class=solr.KeywordTokenizerFactory/
  filter class=solr.ICUCollationKeyFilterFactory
  locale=de@collation=phonebook
  strength=primary
   /
/analyzer
  /fieldType
  
 
  Do I have to replace jars in /contrib/analysis-extras/, or any other
 hints
  of what might be wrong in my install and configuration?
 
  Thanks a lot
  Oliver
 
 



 --
 lucidimagination.com




-- 
lucidimagination.com


Re: ICUCollation throws exception

2012-07-20 Thread Robert Muir
Can you include the entire exception? This is really necessary!

On Tue, Jul 17, 2012 at 2:58 AM, Oliver Schihin
oliver.schi...@unibas.ch wrote:
 Hello

 According to release notes from 4.0.0-ALPHA, SOLR-2396, I replaced
 ICUCollationKeyFilterFactory with ICUCollationField in our schema. But this
 throws an exception, see the following excerpt from the log:
 
 Jul 16, 2012 5:27:48 PM org.apache.solr.common.SolrException log
 SEVERE: null:org.apache.solr.common.SolrException: Plugin init failure for
 [schema.xml] fieldType alphaOnlySort: Pl
 ugin init failure for [schema.xml] analyzer/filter: class
 org.apache.solr.schema.ICUCollationField
 at
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:168)
 at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:359)
 
 The deprecated filter of ICUCollationKeyFilterFactory is working without any
 problem. This is how I did the schema (with the deprecated filter):
 
!-- field type for sort strings --
fieldType name=alphaOnlySort class=solr.TextField
 sortMissingLast=true omitNorms=true
   analyzer
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.ICUCollationKeyFilterFactory
 locale=de@collation=phonebook
 strength=primary
  /
   /analyzer
 /fieldType
 

 Do I have to replace jars in /contrib/analysis-extras/, or any other hints
 of what might be wrong in my install and configuration?

 Thanks a lot
 Oliver





-- 
lucidimagination.com


Re: Frustrating differences in fieldNorm between two different versions of solr indexing the same document

2012-07-19 Thread Robert Muir
On Thu, Jul 19, 2012 at 12:10 AM, Aaron Daubman daub...@gmail.com wrote:
 Greetings,

 I've been digging in to this for two days now and have come up short -
 hopefully there is some simple answer I am just not seeing:

 I have a solr 1.4.1 instance and a solr 3.6.0 instance, both configured as
 identically as possible (given deprecations) and indexing the same document.

Why did you do this? If you want the exact same scoring, use the exact
same analysis.
This means specifying luceneMatchVersion = 2.9, and the exact same
analysis components (even if deprecated).

 I have taken the field values for the example below and run them
 through /admin/analysis.jsp on each solr instance. Even for the problematic
 docs/fields, the results are almost identical. For the example below, the
 t_tag values for the problematic doc:
 1.4.1: 162 values
 3.6.0: 164 values


This is why: you changed your analysis.

-- 
lucidimagination.com


Re: Frustrating differences in fieldNorm between two different versions of solr indexing the same document

2012-07-19 Thread Robert Muir
On Thu, Jul 19, 2012 at 11:11 AM, Aaron Daubman daub...@gmail.com wrote:

 Apologies if I didn't clearly state my goal/concern: I am not looking for
 the exact same scoring - I am looking to explain scoring differences.
  Deprecated components will eventually go away, time moves on, etc...
 etc... I would like to be able to run current code, and should be able to -
 the part that is sticking is being able to *explain* the difference in
 results.


OK: i totally missed that, sorry!

to explain why you see such a large difference:

The difference is that these length normalizations are computed at
index time and fit inside a *single byte* by default. This is to keep
ram usage low for many documents and many fields with norms (since its
#fieldsWithNorms * #documents in bytes in ram).
So this is lossy: basically you can think of there being only 256
possible values. So when you increased the number of terms only
slightly by changing your analysis, this happened to bump you over the
edge rounding you up to the next value.

more information:
http://lucene.apache.org/core/3_6_0/scoring.html
http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html

by the way: if you don't like this:
1. if you can still live with a single byte, maybe plug in your own
Similarity class into 3.6, overriding decodeNormValue/encodeNormValue.
For example, you could use a different SmallFloat configuration that
has less range but more precision for your use case (if your docs are
all short or whatever)
2. otherwise, if you feel you need more than a single byte, check out
4.0-ALPHA: you arent limited to a single byte there.

-- 
lucidimagination.com


Re: Solr 4.0 IllegalStateException: this writer hit an OutOfMemoryError; cannot commit

2012-07-10 Thread Robert Muir
On Tue, Jul 10, 2012 at 3:11 AM, Vadim Kisselmann
v.kisselm...@gmail.com wrote:
 Hi folks,
 my Test-Server with Solr 4.0 from trunk(version 1292064 from late
 february) throws this exception...

Can you run Lucene's checkIndex tool on your index?

If that is clean, can you try a newer version?
This could be a number of things, including something already fixed.



 auto commit error...:java.lang.IllegalStateException: this writer hit
 an OutOfMemoryError; cannot commit
 at 
 org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2650)
 at 
 org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2804)
 at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2786)
 at 
 org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:391)
 at org.apache.solr.update.CommitTracker.run(CommitTracker.java:197)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
 at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)


do you have another exception in your logs? To my knowledge, in all
cases that IndexWriter throws an OutOfMemoryError, the original
OutOfMemoryError is also rethrown (not just this IllegalStateException
noting that at some point, it hit OOM.


 My Server has 24GB RAM, 8GB for JVM. I index round about 20 docs per
 seconds, my index is small with 10Mio docs. It runs
 about a couple of weeks and then suddenly i get this errors..
 I can't see any problems in VisualVM with my GC. It's all ok, memory
 consumption is about 6GB, no swapping, no i/o problems..it's all
 green:)
 What's going on on this machine?:)  My uncommitted docs are gone, right?

Yes, your commit failed.

-- 
lucidimagination.com


Re: problem adding new fields in DIH

2012-07-09 Thread Robert Muir
Hello,

This is because Solr's Codec implementation defers to the schema, to
determine how the field should be indexed. When a core is reloaded,
the IndexWriter is not closed but the existing writer is kept around:
so you are basically trying to index to the old version of schema
before the reload.

I feel like we should fix this, but I only have two ideas:
1. turn off per-field codec support by default, so that if you want to
e.g. set a field to use MemoryPostingsFormat or Pulsing, you must
explicitly enable a per-field codec configuration in solrconfig.xml.
This would parallel how Similarity works, and is probably ok since
this is pretty expert stuff. Then you would have no issues, but if
someone wanted per-field codec support they would have to make the
tradeoff that reloading a core still leaves them indexing with the old
configuration.
2. close and reopen the indexwriter on core reloads.

On Mon, Jul 9, 2012 at 3:36 PM, Brent Mills bmi...@uship.com wrote:
 We're having an issue when we add or change a field in the db-data-config.xml 
 and schema.xml files in solr.  Basically whenever I add something new to 
 index I add it to the database, then the data config, then add the field to 
 the schema to index, reload the core, and do a full import.  This has worked 
 fine until we upgraded to an iteration of 4.0 (we are currently on 4.0 
 alpha).  Now sometimes when we go through this process solr throws errors 
 about the field not being found.  The only way to fix this is to restart 
 tomcat and everything immediately starts working fine again.

 The interesting thing is that this is only a problem if the database is 
 returning a value for that field and only in the documents that have a value. 
  The field shows up in the schema browser in solr, it just has no data in it. 
  If I completely remove it from the database but leave it in the schema and 
 dataconfig files there is no issue.  Also of note, this is happening on 2 
 different machines.

 Here's the trace

 SEVERE: Exception while solr commit.
 java.lang.IllegalArgumentException: no such field test
 at 
 org.apache.solr.core.DefaultCodecFactory$1.getPostingsFormatForField(DefaultCodecFactory.java:49)
 at 
 org.apache.lucene.codecs.lucene40.Lucene40Codec$1.getPostingsFormatForField(Lucene40Codec.java:52)
 at 
 org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.addField(PerFieldPostingsFormat.java:94)
 at 
 org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:335)
 at 
 org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85)
 at org.apache.lucene.index.TermsHash.flush(TermsHash.java:117)
 at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53)
 at 
 org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:82)
 at 
 org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:480)
 at 
 org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:422)
 at 
 org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:554)
 at 
 org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2547)
 at 
 org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2683)
 at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2663)
 at 
 org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:414)
 at 
 org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:82)
 at 
 org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
 at 
 org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:919)
 at 
 org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:154)
 at 
 org.apache.solr.handler.dataimport.SolrWriter.commit(SolrWriter.java:107)
 at 
 org.apache.solr.handler.dataimport.DocBuilder.finish(DocBuilder.java:304)
 at 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:256)
 at 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333)
 at 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:399)
 at 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:380)




-- 
lucidimagination.com


Re: problem adding new fields in DIH

2012-07-09 Thread Robert Muir
Thanks again for reporting this Brent. I opened a JIRA issue:
https://issues.apache.org/jira/browse/SOLR-3610

On Mon, Jul 9, 2012 at 3:36 PM, Brent Mills bmi...@uship.com wrote:
 We're having an issue when we add or change a field in the db-data-config.xml 
 and schema.xml files in solr.  Basically whenever I add something new to 
 index I add it to the database, then the data config, then add the field to 
 the schema to index, reload the core, and do a full import.  This has worked 
 fine until we upgraded to an iteration of 4.0 (we are currently on 4.0 
 alpha).  Now sometimes when we go through this process solr throws errors 
 about the field not being found.  The only way to fix this is to restart 
 tomcat and everything immediately starts working fine again.

 The interesting thing is that this is only a problem if the database is 
 returning a value for that field and only in the documents that have a value. 
  The field shows up in the schema browser in solr, it just has no data in it. 
  If I completely remove it from the database but leave it in the schema and 
 dataconfig files there is no issue.  Also of note, this is happening on 2 
 different machines.

 Here's the trace

 SEVERE: Exception while solr commit.
 java.lang.IllegalArgumentException: no such field test
 at 
 org.apache.solr.core.DefaultCodecFactory$1.getPostingsFormatForField(DefaultCodecFactory.java:49)
 at 
 org.apache.lucene.codecs.lucene40.Lucene40Codec$1.getPostingsFormatForField(Lucene40Codec.java:52)
 at 
 org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.addField(PerFieldPostingsFormat.java:94)
 at 
 org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:335)
 at 
 org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85)
 at org.apache.lucene.index.TermsHash.flush(TermsHash.java:117)
 at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53)
 at 
 org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:82)
 at 
 org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:480)
 at 
 org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:422)
 at 
 org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:554)
 at 
 org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2547)
 at 
 org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2683)
 at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2663)
 at 
 org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:414)
 at 
 org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:82)
 at 
 org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
 at 
 org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:919)
 at 
 org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:154)
 at 
 org.apache.solr.handler.dataimport.SolrWriter.commit(SolrWriter.java:107)
 at 
 org.apache.solr.handler.dataimport.DocBuilder.finish(DocBuilder.java:304)
 at 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:256)
 at 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333)
 at 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:399)
 at 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:380)




-- 
lucidimagination.com


[ANNOUNCE] Apache Solr 4.0-alpha released.

2012-07-03 Thread Robert Muir
3 July 2012, Apache Solr™ 4.0-alpha available
The Lucene PMC is pleased to announce the release of Apache Solr 4.0-alpha.

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search, dynamic clustering, database
integration, rich document (e.g., Word, PDF) handling, and geospatial search.
Solr is highly scalable, providing fault tolerant distributed search
and indexing,
and powers the search and navigation features of many of the world's
largest internet sites.

Solr 4.0-alpha is available for immediate download at:
   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html?ver=4.0a

See the CHANGES.txt file included with the release for a full list of
details.

Solr 4.0-alpha Release Highlights:

The largest set of features goes by the development code-name “Solr
Cloud” and involves bringing easy scalability to Solr.  See
http://wiki.apache.org/solr/SolrCloud for more details.
 * Distributed indexing designed from the ground up for near real-time
(NRT) and NoSQL features such as realtime-get, optimistic locking, and
durable updates.
 * High availability with no single points of failure.
 * Apache Zookeeper integration for distributed coordination and
cluster metadata and configuration storage.
 * Immunity to split-brain issues due to Zookeeper's Paxos distributed
consensus protocols.
 * Updates sent to any node in the cluster and are automatically
forwarded to the correct shard and replicated to multiple nodes for
redundancy.
 * Queries sent to any node automatically perform a full distributed
search across the cluster with load balancing and fail-over.

Solr 4.0-alpha includes more NoSQL features for those using Solr as a
primary data store:
 * Update durability – A transaction log ensures that even uncommitted
documents are never lost.
 * Real-time Get – The ability to quickly retrieve the latest version
of a document, without the need to commit or open a new searcher
 * Versioning and Optimistic Locking – combined with real-time get,
this allows read-update-write functionality that ensures no
conflicting changes were made concurrently by other clients.
 * Atomic updates -  the ability to add, remove, change, and increment
fields of an existing document without having to send in the complete
document again.

There are many other features coming in Solr 4, such as
 * Pivot Faceting – Multi-level or hierarchical faceting where the top
constraints for one field are found for each top constraint of a
different field.
 * Pseudo-fields – The ability to alias fields, or to add metadata
along with returned documents, such as function query values and
results of spatial distance calculations.
 * A spell checker implementation that can work directly from the main
index instead of creating a sidecar index.
 * Pseudo-Join functionality – The ability to select a set of
documents based on their relationship to a second set of documents.
 * Function query enhancements including conditional function queries
and relevancy functions.
 * New update processors to facilitate modifying documents prior to indexing.
 * A brand new web admin interface, including support for SolrCloud.

This is an alpha release for early adopters. The guarantee for this
alpha release is that the index
format will be the 4.0 index format, supported through the 5.x series
of Lucene/Solr, unless there
is a critical bug (e.g. that would cause index corruption) that would
prevent this.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Happy searching,

Lucene/Solr developers


Re: Exception when optimizing index

2012-06-13 Thread Robert Muir
On Thu, Jun 7, 2012 at 5:50 AM, Rok Rejc rokrej...@gmail.com wrote:
   - java.runtime.nameOpenJDK Runtime Environment
   - java.runtime.version1.6.0_22-b22
...

 As far as I see from the JIRA issue I have the patch attached (as mentioned
 I have a trunk version from May 12). Any ideas?


its not guaranteed that the patch will workaround all hotspot bugs
related to http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5091921

Since you can reproduce, is it possible for you to re-test the
scenario with a newer JVM (e.g. 1.7.0_04) just to rule that out?

-- 
lucidimagination.com


Re: Solr1.4 and threads ....

2012-06-13 Thread Robert Muir
On Wed, Jun 13, 2012 at 4:38 PM, Benson Margulies bimargul...@gmail.com wrote:

 Does this suggest anything to anyone? Other than that we've
 misanalyzed the logic in the tokenizer and there's a way to make it
 burp on one thread?

it might suggest the different tokenstream instances refer to some
shared object that is not thread safe: we had bugs like this before
(e.g. sharing a JDK collator is ok, but ICU ones are not thread-safe,
so you must clone them).

Because of this we beefed up our base analysis class
(http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6/lucene/test-framework/src/java/org/apache/lucene/analysis/BaseTokenStreamTestCase.java)
to find thread safety bugs like this.

I recommend just grabbing the test-framework.jar (we release it as an
artifact), extend that class and write a test like:
  public void testRandomStrings() throws Exception {
checkRandomData(random, analyzer, 10);
  }

(or use the one in the branch, its even been improved since 3.6)

-- 
lucidimagination.com


Re: per-fieldtype similarity not working

2012-06-08 Thread Robert Muir
On Fri, Jun 8, 2012 at 5:04 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Thanks Robert,

 The difference in scores is clear now so it shouldn't matter as queryNorm 
 doesn't affect ranking but coord does. Can you explain why coord is left out 
 now and why it is considered to skew results and why queryNorm skews results? 
 And which specific new ranking algorithms they confuse, BM25F?

I think its easiest to compare the two TF normalization functions,
DefaultSimilarity really needs something like this because its
function (sqrt) grows very fast for a single term.
On the other hand, consider BM25's: tf/(tf+lengthNorm), it saturates
rather quickly for a single term, so when multiple terms are being
scored, huge numbers of occurrences of a single term won't dominate
the overall score.

You can see this visually here (give it a second to load, and imagine
documentLength = averageDocumentLength and k=1.2):
http://www.wolframalpha.com/input/?i=plot+sqrt%28x%29%2C+x%2F%28x%2B1.2%29%2C+x%3D1+to+100


 Also, i would expect the default SchemaSimilarityFactory to behave the same 
 as DefaultSimilarity this might raise some further confusion down the line.

Thats ok: I'd rather the very expert case (Per-Field scoring) be
trickier than have a trap for people that try to use any algorithm
other than TFIDFSimilarity

-- 
lucidimagination.com


Re: per-fieldtype similarity not working

2012-06-01 Thread Robert Muir
On Fri, Jun 1, 2012 at 5:13 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Thanks but i am clearly missing something? We declare the similarity in the 
 fieldType just as in the example and looking at the example again i don't see 
 how it's being done differently. What am i missnig and where do i miss it? :)


Hi Markus, checkout the last line at the bottom:
 !-- default similarity, defers to the fieldType --
 similarity class=solr.SchemaSimilarityFactory/

When this is set, it means IndexSearcher/IndexWriter use a
PerFieldSimilarityWrapper that delegates based to the Solr schema
fieldtype.

Note this is just a simple ordinary similarity impl
(http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/search/similarities/SchemaSimilarityFactory.java),
you could also write your own that works differently.

-- 
lucidimagination.com


Re: per-fieldtype similarity not working

2012-06-01 Thread Robert Muir
On Fri, Jun 1, 2012 at 11:39 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Hi!


 Ah, it makes sense now! This global configured similarity knows returns a 
 fieldType defined similarity if available and if not the standard Lucene 
 similarity. This would, i assume, mean that the two defined similarities 
 below without per fieldType declared similarities would always yield the same 
 results?

Not true: note that two methods (coord and querynorm) are not perfield
but global across the entire query tree.

By default these are disabled in the wrapper, as they only skew or
confuse most modern scoring algorithms (eg all the new ranking
algorithms in lucene 4) respectively.

So if you want to do per-field scoring where *all* of your sims are
vector-space, it could make sense to customize (e.g. subclass)
SchemaSimilarityFactory and do something useful for these methods.


-- 
lucidimagination.com


Re: per-fieldtype similarity not working

2012-05-31 Thread Robert Muir
On Thu, May 31, 2012 at 11:23 AM, Markus Jelsma
markus.jel...@openindex.io wrote:

 We simply declare the following in our fieldType:
 similarity class=FQCN/


Thats not enough, see the example:
http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/schema-sim.xml


-- 
lucidimagination.com


Re: boost not showing up in Solr 3.6 debugQueries?

2012-05-17 Thread Robert Muir
On Thu, May 17, 2012 at 4:51 PM, Tom Burton-West tburt...@umich.edu wrote:

 But in Solr 3.6 I am not seeing the boost factor called out.

  On the other hand it looks like it may now be incoroporated in the
 queryNorm (Please see example below).

 Is there a bug in Solr 3.6 debugQueries?  Is there some new behavior
 regarding boosts and queryNorms? or am I missing something obvious?


Your queries are different. your first example is a simple termquery.
The second example is a boolean query.

if you have a booleanquery(green frog) with a boost of 5, it
incorporates its boost into the query norm passed down to its
children.
So when leaf nodes normalize their weight, it includes all the boosts
from the parent hierarchy.
You can see what I mean if you look at BooleanWeight.normalize()

Because of how this is done, 3.x's explain confusingly only shows the
leaf node's explicit boost, since thats all it really knows.
To see what i mean try something like booleanquery(green^2 frog^3)^5

In 4.x these boosts are split apart from and kept separate from the
query norm, so we could actually improve the explanations here I
think.

-- 
lucidimagination.com


Re: Language analyzers

2012-05-16 Thread Robert Muir
On Wed, May 16, 2012 at 10:17 AM, anarchos78
rigasathanasio...@hotmail.com wrote:
 Hello,

 Is it possible to use two language analyzers for one fieldtype. Lets say
 Greek and English (for indexing and querying)


For greek and english, its easy, they use totally different characters
so none of their tokenfilters will conflict with each other.
Just use standardtokenizer and two stopfilters (greek and english),
two stemmers (greek and english) and so on.

-- 
lucidimagination.com


Re: FrenchLightStemFilterFactory : normalizing tokens longer than 4 characters and having repeated characters in it

2012-05-16 Thread Robert Muir
On Wed, May 16, 2012 at 8:28 AM, Tanguy Moal tanguy.m...@gmail.com wrote:
 Any idea someone ?

 I think this is important since this could produce weird results on
 collections with numbers mixed in text.

I agree, i think we should just add ' Character.isLetter(ch)' to the
undoublet check?

Thanks for bringing this up. do you want to open a JIRA issue?

http://wiki.apache.org/solr/HowToContribute


-- 
lucidimagination.com


Re: apostrophe / ayn / alif

2012-05-15 Thread Robert Muir
On Tue, May 15, 2012 at 2:47 PM, Naomi Dushay ndus...@stanford.edu wrote:
 We are using the ICUFoldingFilterFactory with great success to fold 
 diacritics so searches with and without the diacritics get the same results.

 We recently discovered we have some Korean records that use an alif diacritic 
 instead of an apostrophe, and this diacritic is NOT getting folded.   Has 
 anyone experienced this for alif or ayn characters?   Do you have a solution?


What do you mean alif diacritic in Korean? Alif (ا) isn't a diacritic
and isn't used in Korean.

Or did you mean arabic dagger alif ( ٰ ) ? This is not a diacritic in
unicode (though its a combining mark).


-- 
lucidimagination.com


Re: Implementing multiterm chain for ICUCollationKeyFilterFactory

2012-05-03 Thread Robert Muir
On Thu, May 3, 2012 at 9:35 AM, OliverS oliver.schi...@unibas.ch wrote:
 Hello

 I read and tried a lot, but somehow I don't fully understand and it doesn't
 work. I'm working on solr 4.0 (latest trunk) and use
 ICUCollationKeyFilterFactory for my main field type. Now, wildcard queries
 don't work, even though ICUCollationKeyFilterFactory seems to be
 http://lucene.apache.org/solr/api/org/apache/solr/analysis/class-use/MultiTermAwareComponent.html

this filter implements that interface solely to support rangequeries
in collation order (in addition to sort), so that it has all the
lucene functionality.

wildcards and even prefix queries simply wont work, because these are
binary keys intended just for this purpose. if you want to do textish
queries like this, you need to use a text field.

-- 
lucidimagination.com


Re: Error with distributed search and Suggester component (Solr 3.4)

2012-05-02 Thread Robert Muir
On Wed, May 2, 2012 at 12:16 PM, Ken Krugler
kkrugler_li...@transpac.com wrote:

 What confuses me is that Suggester says it's based on SpellChecker, which 
 supposedly does work with shards.


It is based on spellchecker apis, but spellchecker's ranking is based
on simple comparators like string similarity, whereas suggesters use
weights.

when spellchecker merges from shards, it just merges all their top-N
into one set and recomputes this same distance stuff over again.

so, suggester can't possibly work like this correctly (forget about
any technical details), as how can it make assumptions about these
weights you provided. if they were e.g. log() weights from your query
logs then it needs to do log-summation across the shards, etc for the
final combined weight to be correct. This is specific to how you
originally computed the weights you gave it. it certainly cannot be
recomputing anything like spellchecker does :)

Anyways, if you really want to do it, maybe
https://issues.apache.org/jira/browse/SOLR-2848 is helpful. The
background is in 3.x there is really only one spellchecker impl
(AbstractLucene or something like that). I don't think distributed
spellcheck works with any other SpellChecker subclasses in 3.x, i
think its wired to only work with the Abstract-Lucene ones.

When we added another subclass to 4.0, DirectSpellChecker, he saw that
it was broken here and cleaned up the APIs so that spellcheckers can
override this merge() operation. Unfortunately I forgot to commit
those refactorings James did (which lets any spellchecker override
merge()ing) to the 3.x branch, but the ideas might be useful.

-- 
lucidimagination.com


Re: Error with distributed search and Suggester component (Solr 3.4)

2012-05-01 Thread Robert Muir
On Tue, May 1, 2012 at 6:48 PM, Ken Krugler kkrugler_li...@transpac.com wrote:
 Hi list,

 Does anybody know if the Suggester component is designed to work with shards?


I'm not really sure it is? They would probably have to override the
default merge implementation specified by SpellChecker.

But, all of the current suggesters pump out over 100,000 QPS on my
machine, so I'm wondering what the usefulness of this is?

And if it was useful, merging results from different machines is
pretty inefficient, for suggest you would shard by term instead so
that you need only contact a single host?


-- 
lucidimagination.com


Re: Language Identification

2012-04-23 Thread Robert Muir
On Mon, Apr 23, 2012 at 1:27 PM, Bai Shen baishen.li...@gmail.com wrote:
 I was under the impression that solr does Tika and the language identifier
 that Shuyo did.  The page at
 http://wiki.apache.org/solr/LanguageDetectionlists them both.

 processor 
 class=org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory
 processor 
 class=org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory

 Again, I'm just trying to understand why it was moved to solr.


Because it offers a number of features above Tika's implementation,
and is available under the Apache 2.0 License so we are free to do
that.

-- 
lucidimagination.com


Re: Special characters in synonyms.txt on Solr 3.5

2012-04-20 Thread Robert Muir
On Fri, Apr 20, 2012 at 12:10 PM, carl.nordenf...@bwinparty.com
carl.nordenf...@bwinparty.com wrote:
 Directly injecting the letter ö into synonyms like so:
 island, ön
 island, ön

 renders the following exception on startup (both lines renders the same 
 error):

 java.lang.RuntimeException: java.nio.charset.MalformedInputException: Input 
 length = 3
                             at 
 org.apache.solr.analysis.FSTSynonymFilterFactory.inform(FSTSynonymFilterFactory.java:92)
                             at 
 org.apache.solr.analysis.SynonymFilterFactory.inform(SynonymFilterFactory.java:50)

Synonyms file needs to be in UTF-8 encoding.



-- 
lucidimagination.com


Re: maxMergeDocs in Solr 3.6

2012-04-19 Thread Robert Muir
On Thu, Apr 19, 2012 at 11:54 AM, Burton-West, Tom tburt...@umich.edu wrote:
 Hello all,

 I'm getting ready to upgrade from Solr 3.4 to Solr 3.6 and I noticed that 
 maxMergeDocs is no longer in the example solrconfig.xml.
 Has maxMergeDocs been deprecated? or doe the tieredMergePolicy ignore it?

its not applicable to tieredMergePolicy.

when tieredmergepolicy was added, some previous global options were
'interpreted' for backwards compatibility:
useCompoundFile(X) - setUseCompoundFile(X)
mergeFactor(X) - setMaxMergeAtOnce(X) AND setSegmentsPerTier(X)

However, in my opinion there is an easier, less confusing, more
systematic approach you can use, and thats to not set these 'global'
params but just specify what you want directly to TieredMergePolicy:

For example for TieredMergePolicy, look at the javadocs of
TieredMergePolicy here:
http://lucene.staging.apache.org/core/3_6_0/api/core/org/apache/lucene/index/TieredMergePolicy.html

you would simply configure it like:

mergePolicy class=org.apache.lucene.index.TieredMergePolicy
  int name=maxMergeAtOnceExplicit19/int
  int name=segmentsPerTier9/int
  double name=noCFSRatio1.0/double
/mergePolicy

this will invoke setMaxMergeAtOnceExplicit(19), setSegmentsPerTier(9),
and setNoCFSRatio(1.0). So you can do the same thing with any of those
TieredMergePolicy setters you see in the lucene javadocs.

-- 
lucidimagination.com


Re: [Solr 4.0] what is stored in .tim index file format?

2012-04-17 Thread Robert Muir
This is the term dictionary for 4.0's default codec (currently uses
BlockTree implementation)

.tim is the on-disk portion of the terms (similar in function to .tis
in previous releases)
.tip is the in-memory terms index (similar in function to .tii in
previous releases)

On Tue, Apr 17, 2012 at 6:37 AM, Lyuba Romanchuk
lyuba.romanc...@gmail.com wrote:
 Hi,

 I have index ~31G where
 27% of the index size is .fdt files (8.5G)
 20% - .fdx files (6.2G)
 37% - .frq files (11.6G)
 16% - .tim files (5G)

 I didn't manage to find the description for .tim files. Can you help me
 with this?

 Thank you.
 Best regards,
 Lyuba



-- 
lucidimagination.com


[ANNOUNCE] Apache Solr 3.6 released

2012-04-12 Thread Robert Muir
12 April 2012, Apache Solr™ 3.6.0 available
The Lucene PMC is pleased to announce the release of Apache Solr 3.6.0.

Solr is the popular, blazing fast open source enterprise search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search, dynamic clustering, database
integration, rich document (e.g., Word, PDF) handling, and geospatial search.
Solr is highly scalable, providing distributed search and index replication,
and it powers the search and navigation features of many of the world's
largest internet sites.

This release contains numerous bug fixes, optimizations, and
improvements, some of which are highlighted below.  The release
is available for immediate download at:
   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html (see
note below).

See the CHANGES.txt file included with the release for a full list of
details.

Solr 3.6.0 Release Highlights:

 * New SolrJ client connector using Apache Http Components http client
   (SOLR-2020)

 * Many analyzer factories are now multi term query aware allowing for things
   like field type aware lowercasing when building prefix  wildcard queries.
   (SOLR-2438)

 * New Kuromoji morphological analyzer tokenizes Japanese text, producing
   both compound words and their segmentation. (SOLR-3056)

 * Range Faceting (Dates  Numbers) is now supported in distributed search
   (SOLR-1709)

 * HTMLStripCharFilter has been completely re-implemented, fixing many bugs
   and greatly improving the performance (LUCENE-3690)

 * StreamingUpdateSolrServer now supports the javabin format (SOLR-1565)

 * New LFU Cache option for use in Solr's internal caches. (SOLR-2906)

 * Memory performance improvements to all FST based suggesters (SOLR-2888)

 * New WFSTLookupFactory suggester supports finer-grained ranking for
   suggestions. (LUCENE-3714)

 * New options for configuring the amount of concurrency used in distributed
   searches (SOLR-3221)

 * Many bug fixes

Note: The Apache Software Foundation uses an extensive mirroring network for
distributing releases.  It is possible that the mirror you are using may not
have replicated the release yet.  If that is the case, please try another
mirror.  This also goes for Maven access.

Happy searching,

Lucene/Solr developers


  1   2   3   4   >