RE: umlauts / diacritic expansion

2019-04-16 Thread Markus Jelsma
Hello Michael, For the case of normalizing ü to ue, take a look at the german normalizer [1]. Regards, Markus [1] https://lucene.apache.org/core/7_6_0/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html -Original message- > From:Ralf Heyde > Sent:

RE: 8.0.0 ClassCastException in ValueSource

2019-03-27 Thread Markus Jelsma
t: Re: 8.0.0 ClassCastException in ValueSource > > Hi Markus, > > Thanks for reporting this. It looks like a side-effect of the Scorable > refactoring, can you open a JIRA issue? > > On Wed, Mar 20, 2019 at 5:01 PM Markus Jelsma > wrote: > > > > Hello, > > &g

8.0.0 ClassCastException in ValueSource

2019-03-20 Thread Markus Jelsma
Hello, Upgraded to Lucene and Solr 8.0 and ran all our unit tests, this one popped up: Caused by: java.lang.ClassCastException: org.apache.lucene.queries.function.ValueSource$ScoreAndDoc cannot be cast to org.apache.lucene.search.Scorer at

RE: Query-of-Death Lucene/Solr 7.6

2019-02-08 Thread Markus Jelsma
Hello, I think i tracked it further down to LUCENE-8589 or SOLR-12243:. When i leave Solr's edismax' pf parameter empty, everything runs fast. When all fields are configured for pf, the node dies. I am now unsure whether i am on the right list, or if i should move to Solr's. Please let me

Query-of-Death Lucene/Solr 7.6

2019-02-08 Thread Markus Jelsma
Hello, While working on SOLR-12743, using 7.6 on two nodes and 7.2.1 on the remaining four, we stumbled upon a situation where the 7.6 nodes quickly succumb when a 'Query-of-Death' is issued, 7.2.1 up to 7.5 are all unaffected (tested and confirmed). Following Smiley's suggestion i used

RE: An example for creating SynonymMap Object?

2018-10-15 Thread Markus Jelsma
regards > > > > On 10/15/18 3:28 PM, Markus Jelsma wrote: > > Hello Baris, > > > > Check out the filter factory and the map parser for a more low level > > example: > >

RE: An example for creating SynonymMap Object?

2018-10-15 Thread Markus Jelsma
Hello Baris, Check out the filter factory and the map parser for a more low level example: https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymGraphFilterFactory.java

RE: Lucene same search result for worlds with and without spaces

2018-06-20 Thread Markus Jelsma
Hi Egorlex, Set the tokenSeparator to "" and ShingleFilter will concatenate all shingles without whitespace. Keep in mind, this will greatly increase the size of the index so it might not be a good idea to concatenate all pairs of words. If you are looking for finding "similarissues" with

RE: Rewrite SynonymQuery to support payloads

2018-05-24 Thread Markus Jelsma
12238 > > Patch and Pull Request is attached but it has not been reviewed yet. > Give it a look, and then we can continue the discussion here! > let me know if you feel your requirement is different ! > > Cheers > > On Wed, May 23, 2018 at 11:41 AM, Markus

Rewrite SynonymQuery to support payloads

2018-05-23 Thread Markus Jelsma
Hello, To support payloads we rewrite SynonymQuery to a pair of SpanTerm queries which we then can wrap in the PayloadScoreQuery. This is not the right way to do this because if both clauses match, both are also scored.  We could try to rewrite SynonymQuery to a SpanOrQuery but i suppose that

Multiple languages, boosting and, stemming and KeywordRepeat

2018-05-14 Thread Markus Jelsma
Hello, First, apologies for the weird subject line, and apologies for cross-posting, but last week it got no replies on the Solr user mailing list. We index many languages and search over all those languages at once, but boost the language of the user's preference. To differentiate between

RE: German decompounding/tokenization with Lucene?

2017-09-16 Thread Markus Jelsma
? > > Send a pull request. :) > > Uwe > > Am 16. September 2017 12:42:30 MESZ schrieb Markus Jelsma > <markus.jel...@openindex.io>: > >Hello Uwe, > > > >Thanks for getting rid of the compounds. The dictionary can be smaller, > >it still has

RE: German decompounding/tokenization with Lucene?

2017-09-16 Thread Markus Jelsma
Hello Uwe, Thanks for getting rid of the compounds. The dictionary can be smaller, it still has about 1500 duplicates. It is also unsorted. Regards, Markus -Original message- > From:Uwe Schindler > Sent: Saturday 16th September 2017 12:16 > To:

RE: Using POS payloads for chunking

2017-06-14 Thread Markus Jelsma
re/org/apache/lucene/analysis/tokenattributes/TypeAttribute.html > [2] : > https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/core/TypeTokenFilter.html > > Il giorno mer 14 giu 2017 alle ore 23:33 Markus Jelsma < > markus.jel...@openindex.io> ha scrit

RE: Using POS payloads for chunking

2017-06-14 Thread Markus Jelsma
an a number in there you have to > provide your own decoders and the like to make sense of your > payload > > Best, > Erick (Erickson, not Hatcher) > > On Wed, Jun 14, 2017 at 2:22 PM, Markus Jelsma > <markus.jel...@openindex.io> wrote: > > Hello Erik, > >

RE: Using POS payloads for chunking

2017-06-14 Thread Markus Jelsma
June 2017 23:03 > To: java-user@lucene.apache.org > Subject: Re: Using POS payloads for chunking > > Markus - how are you encoding payloads as bitsets and use them for scoring? > Curious to see how folks are leveraging them. > > Erik > > > On Jun 14, 2

RE: Using POS payloads for chunking

2017-06-14 Thread Markus Jelsma
Hello, We use POS-tagging too, and encode them as payload bitsets for scoring, which is, as far as is know, the only possibility with payloads. So, instead of encoding them as payloads, why not index your treebanks POS-tags as tokens on the same position, like synonyms. If you do that, you can

RE: Term no longer matches if PositionLengthAttr is set to two

2017-05-04 Thread Markus Jelsma
Ok, we decided not to implement PositionLengthAttribute for now due to, it either is a bad applied (how could one even misapply that attribute?) or Solr's QueryBuilder has a weird way of dealing with it or.. well. Thanks, Markus -Original message- > From:Markus Jelsma

RE: Term no longer matches if PositionLengthAttr is set to two

2017-05-01 Thread Markus Jelsma
Hello again, apologies for cross-posting and having to get back to this unsolved problem. Initially i thought this is a problem i have with, or in Lucene. Maybe not, so is this problem in Solr? Is here anyone who has seen this problem before? Many thanks, Markus -Original message- >

Term no longer matches if PositionLengthAttr is set to two

2017-04-25 Thread Markus Jelsma
Hello, We have a decompounder and recently implemented the PositionLengthAttribute in it and set it to 2 for a two-word compound such as drinkwater (drinking water in dutch). The decompounder runs both at index- and query-time on Solr 6.5.0. The problem is, q=content_nl:drinkwater no longer

RE: Lucene

2017-02-08 Thread Markus Jelsma
Hello - you are on the wrong list, this is Lucene java user, not the Solr user mailing list. But this is what you are looking for: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika https://wiki.apache.org/solr/ExtractingRequestHandler First is

RE: question

2017-01-16 Thread Markus Jelsma
Yes, they should be the same unless the field is indexed with shingles, in that case order matters. Markus -Original message- > From:Julius Kravjar > Sent: Monday 16th January 2017 18:20 > To: java-user@lucene.apache.org > Subject: question > > May I have

Offset bug in WordDelimiterFilter?

2016-12-06 Thread Markus Jelsma
Hello - i noticed something peculiar running Lucene/Solr 6.3.0. The plural vaccinatieprogramma's should have a startOffset of 0 and a endOffset of 21 when passed through WordDelimiterFilter and/or stemmers but it isn't, slightly messing up highlighted terms. wdf = new

Range query on date field

2016-11-24 Thread Markus Jelsma
Hi - i seem to be having trouble correctly executing a range query on a date field. The following Solr document is indexed via a unit test followed by a commit:       view     test_key     2013-01-09T17:11:40Z   I can retrieve the document simply wrapping term queries in a boolean query

Upgrade 6.2.x Char* API's

2016-09-21 Thread Markus Jelsma
Hello - upgrading one of our libraries to 6.2.0 failed due to LUCENE-7318. This is fixed nicely on 6.2.1, many thanks for that! Upgrading to 6.2.1, however, still raises compile errors. I haven't seen any notice of this in CHANGES.txt or its API changes section for both 6.2.x versions. Any

RE: LowerCaseFilter gone in 6.2.0

2016-08-31 Thread Markus Jelsma
; H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message- > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > > Sent: Wednesday, August 31, 2016 11:08 AM > > To: java-user@lucene.apache.org >

LowerCaseFilter gone in 6.2.0

2016-08-31 Thread Markus Jelsma
Hello - i'm upgrading a project that uses Lucene to 6.2.0 and get the compile error that LowerCaseFilter does not exists. And, so it seems, the JavaDoc is gone too. I've checked CHANGES.txt and there is no mention of it, not even in the API changes section. Any ideas? Thanks, Markus

RE: BlendedTermQuery causing negative IDF?

2016-04-19 Thread Markus Jelsma
are developed for indices in which stop > words are eliminated. > Therefore, most of the term-weighting models have problems scoring common > terms. > By the way, DFI model does a decent job when handling common terms. > > Ahmet > > > > On Tuesday, April 19, 2016 4:48 PM,

BlendedTermQuery causing negative IDF?

2016-04-19 Thread Markus Jelsma
Hello, I just made a Solr query parser for BlendedTermQuery on Lucene 6.0 using BM25 similarity and i have a very simple unit test to see if something is working at all. But to my surprise, one of the results has a negative score, caused by a negative IDF because docFreq is higher than

RE: Problem with porter stemming

2016-03-14 Thread Markus Jelsma
Hi - if you don't want specific words passed through a stemmer, you need to supply a CharArraySet with exclusions as the second argument to its constructor. Markus -Original message- > From:Dwaipayan Roy > Sent: Monday 14th March 2016 15:31 > To:

RE: Jira issue for possibly transient resource issue, or a Lucene or JVM bug?

2016-01-21 Thread Markus Jelsma
issue, or a Lucene or > JVM bug? > > LUCENE-6970 > > On Thu, Jan 21, 2016 at 4:07 PM, Markus Jelsma <markus.jel...@openindex.io> > wrote: > > > Hi - we get the above issue as well some times. I've noticed Lucene-dev > > mails on this issue [1] but i

RE: propagate Query.rewrite call to super.rewrite after 5.4 upgrade

2015-12-17 Thread Markus Jelsma
mentation needs to rewrite to a BoostQuery. You can do that by > prepending the following to your rewrite(IndexReader) implementation: > > if (getBoost() != 1f) { return super.rewrite(reader); } > > > Le jeu. 17 déc. 2015 à 13:23, Markus Jelsma <markus.jel...@openindex.io&g

propagate Query.rewrite call to super.rewrite after 5.4 upgrade

2015-12-17 Thread Markus Jelsma
Hi, Apologies for the cross post. We have a class overridding SpanPositionRangeQuery. It is similar to a SpanFirst query but it is capable of adjusting the boost value with regard to distance. With the 5.4 upgrade the unit tests suddenly threw the following exception: Query class

LUCENE-5388 AbstractMethodError

2014-01-30 Thread Markus Jelsma
Hi, Apologies for cross posting; i got no response on the Sorl list. We have a developement environment running trunk but have custom analyzers and token filters built on 4.6.1. Now the constructors have changes somewhat and stuff breaks. Here's a consumer trying to get a TokenStream from an

RE: LUCENE-5388 AbstractMethodError

2014-01-30 Thread Markus Jelsma
Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, January 30, 2014 10:50 AM To: java-user@lucene.apache.org Subject: LUCENE-5388 AbstractMethodError Hi, Apologies for cross posting; i got no response on the Sorl list. We have a developement

RE: LUCENE-5388 AbstractMethodError

2014-01-30 Thread Markus Jelsma
://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, January 30, 2014 12:52 PM To: java-user@lucene.apache.org Subject: RE: LUCENE-5388 AbstractMethodError Hi Uwe, The bug occurred only

Coordination factor disabled for BM25 and other new scoring models

2013-08-22 Thread Markus Jelsma
Hi, I know it is recommended to disable the coordination factor when using models other than default TFIDFSimilarity. And out of curiosity i'd like to know the motivation behind it but it is not explained anywhere, not even in LUCENE-2959, the patches, wiki, PDF's or whatever. So, anyone here

Final token filters

2013-08-19 Thread Markus Jelsma
Hi, This is likely discussed before but i couldn't to find it. Why are most token filters final, or are most or all members private and / or final? It is impossible to customize token filters by extending them, instead we need to copy code around. How do you customize for example some bits

RE: read past EOF when merge

2012-11-05 Thread Markus Jelsma
code that uses Directory for replication. - Mark On Nov 2, 2012, at 6:53 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, For what it's worth, we have seen similar issues with Lucene/Solr from this week's trunk. The issue manifests itself when it want to replicate

RE: read past EOF when merge

2012-11-02 Thread Markus Jelsma
Hi, For what it's worth, we have seen similar issues with Lucene/Solr from this week's trunk. The issue manifests itself when it want to replicate. The servers have not been taken offline and did not crash when this happenend. 2012-10-30 16:12:51,061 WARN [solr.handler.ReplicationHandler] -

RE: read past EOF when merge

2012-11-02 Thread Markus Jelsma
No this is not using NFS but EXT3 on SSD. Thanks -Original message- From:Michael McCandless luc...@mikemccandless.com Sent: Fri 02-Nov-2012 16:22 To: java-user@lucene.apache.org Subject: Re: quot;read past EOFquot; when merge On Fri, Nov 2, 2012 at 6:53 AM, Markus Jelsma

RE: Highlighter IOOBE with modified HyphenationCompoundWordTokenFilter

2012-10-05 Thread Markus Jelsma
Matthijs li...@selckin.be Sent: Thu 04-Oct-2012 15:55 To: java-user@lucene.apache.org Subject: Re: Highlighter IOOBE with modified HyphenationCompoundWordTokenFilter And to include the code On Thu, Oct 4, 2012 at 3:52 PM, Markus Jelsma markus.jel...@openindex.io wrote: I forgot to add

Highlighter IOOBE with modified HyphenationCompoundWordTokenFilter

2012-10-04 Thread Markus Jelsma
Hi, I've modified the HyphenationCompoundWordTokenFilter to emit less subtokens because the original filter can emit all kinds of subtokens that have a very different meaning on their own. I've modified it so no overlapping subtokens are emitted and no subtokens are emitted that can be found

RE: Highlighter IOOBE with modified HyphenationCompoundWordTokenFilter

2012-10-04 Thread Markus Jelsma
I forgot to add that this is with today's build of trunk. -Original message- From:Markus Jelsma markus.jel...@openindex.io Sent: Thu 04-Oct-2012 15:42 To: java-user@lucene.apache.org Subject: Highlighter IOOBE with modified HyphenationCompoundWordTokenFilter Hi, I've modified

Re: what's the status of droids project(http://incubator.apache.org/droids/)?

2011-08-23 Thread Markus Jelsma
You should ask on the Droids list but there's some activity in Jira. And did you consider Apache Nutch? On Tuesday 23 August 2011 10:17:50 Li Li wrote: hi all I am interested in vertical crawler. But it seems this project is not very active. It's last update time is 11/16/2009

Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?

2011-01-18 Thread Markus Jelsma
[X] ASF Mirrors (linked in our release announcements or via the Lucene website) [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) [X] I/we build them from source via an SVN/Git checkout. [] Other (someone in your company mirrors them internally or via a downstream