RE: run in eclipse error

2017-10-17 Thread Mike Sokolov
Checkstyle has a onetoplevelclass rule that would enforce this On October 17, 2017 3:45:01 AM EDT, Uwe Schindler wrote: >Hi, > >this has nothing to do with the Java version. I generally ignore this >Eclipse-failure as I only develop in Eclipse, but run from command >line. The

Re: FunctionValues vs DoubleValuesSource

2017-10-13 Thread Mike Sokolov
Oh thanks Alan that's a good suggestion, but I already wrote max and sum double values sources since it was easy enough. If you think that's a good approach I could post a patch. On October 13, 2017 3:57:30 AM EDT, Alan Woodward wrote: >Hi, > >Yes, moving stuff over to

Re: Accent insensitive search for greek characters

2017-09-27 Thread Mike Sokolov
These are only used in classical Greek I think, explaining probably why they are not covered by the simpler filter. On September 27, 2017 9:48:37 AM EDT, Ahmet Arslan wrote: >I may be wrong about ASCIIFoldingFilter. Please go with the >ICUFoldingFilter. >Ahmet >On

Re: Small Vocabulary

2012-08-06 Thread Mike Sokolov
There was some interesting work done on optimizing queries including very common words (stop words) that I think overlaps with your problem. See this blog post http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2 from the Hathi Trust. The upshot in a

Re: find meaningful words through Lucene

2012-06-27 Thread Mike Sokolov
Maybe high frequency terms that are not evenly distributed throughout the corpus would be a better definition. Discriminative terms. I'm sure there is something in the machine learning literature about unsupervised clustering that would help here. But I don't know what it is :) -Mike On

Re: Fast way to get the start of document

2012-06-25 Thread Mike Sokolov
with an extra 1st page field for the too-huge documents. -Paul -Original Message- From: Mike Sokolov [mailto:soko...@ifactory.com] Sent: Saturday, June 23, 2012 7:16 PM To: java-user@lucene.apache.org Cc: Jack Krupansky Subject: Re: Fast way to get the start of document I got the sense

Re: Fast way to get the start of document

2012-06-23 Thread Mike Sokolov
whether to highlight. -Mike Sokolov On 6/23/2012 6:17 PM, Jack Krupansky wrote: Simply have two fields, full_body and limited_body. The former would index but not store the full document text from Tika (the content metadata.) The latter would store but not necessarily index the first 10K or so

Re: filter by term frequency

2012-06-17 Thread Mike Sokolov
://wiki.apache.org/solr/FunctionQuery#tf Lucene does have FunctionQuery, ValueSource, and TermFreqValueSource. See: http://lucene.apache.org/solr/api/org/apache/solr/search/function/FunctionQuery.html -- Jack Krupansky -Original Message- From: Mike Sokolov Sent: Saturday, June 16, 2012 2

filter by term frequency

2012-06-16 Thread Mike Sokolov
I imagine this is a question that comes up from time to time, but I haven't been able to find a definitive answer anywhere, so... I'm wondering whether there is some type of Lucene query that filters by term frequency. For example, suppose I want to find all documents that have exactly 2

Re: Approches/semantics for arbitrarily combining boolean and proximity search operators?

2012-05-16 Thread Mike Sokolov
It sounds me as if there could be a market for a new kind of query that would implement: A w/5 (B and C) in the way that people understand it to mean - the same A near both B and C, not just any A. Maybe it's too hard to implement using rewrites into existing SpanQueries? In term of the

surround parser match-all query

2012-05-06 Thread Mike Sokolov
does anybody know how to express a MatchAllDocsQuery in surround query parser language? I've tried * and() but those don't parse. I looked at the grammar and I don't think there is a way. Please let us all know if you know otherwise! Thanks -Mike Sokolov

Re: surround parser match-all query

2012-05-06 Thread Mike Sokolov
...@ifactory.com wrote: does anybody know how to express a MatchAllDocsQuery in surround query parser language? I've tried * and() but those don't parse. I looked at the grammar and I don't think there is a way. Please let us all know if you know otherwise! Thanks -Mike Sokolov

Re: surround parser match-all query

2012-05-06 Thread Mike Sokolov
in surround query parser language? I've tried * and() but those don't parse. I looked at the grammar and I don't think there is a way. Please let us all know if you know otherwise! Thanks -Mike Sokolov - To unsubscribe, e-mail: java

Re: surround parser match-all query

2012-05-06 Thread Mike Sokolov
know if it would be worth the trouble. It turns out in my very specific case I have a term that appears in every document in a particular field, so I am just using a search for that at the moment. -Mike On 5/6/2012 8:04 PM, Mike Sokolov wrote: I think what I have in mind would be purely

Re: Retrieving offsets

2012-01-19 Thread Mike Sokolov
I think you have hit on all the best solutions. The Jira issues you mentioned do indeed hold out some promising solutions here, but they are a ways away, requiring some significant re-plumbing and I'm not sure there is a lot of attention being paid to that at the moment. You should vote for

Re: Lucene 4.0 Index Format Finalization Timetable

2011-12-07 Thread Mike Sokolov
My personal view, as a bystander with no more information than you, is that one has to assume there will be further index format changes before a 4.0 release. This is based on the number of changes in the last 9 months, and the amount of activity on the dev list. For us the implication is we

Re: Advanced NearSpanQuery

2011-07-13 Thread Mike Sokolov
Can you wrap a SpanNearQuery around an DisjunctionSumQuery with minNrShouldMatch=8? -Mike On 07/13/2011 08:53 AM, Jeroen Lauwers wrote: Hi, I was wondering if anyone could help me on this: I want to search for: 1. a set of words (eg. 10) 2. only a couple of words may come in

Re: Advanced NearSpanQuery

2011-07-13 Thread Mike Sokolov
me in the right direction? Jeroen -Original Message- From: Mike Sokolov [mailto:soko...@ifactory.com] Sent: woensdag 13 juli 2011 15:23 To: java-user@lucene.apache.org Cc: Jeroen Lauwers Subject: Re: Advanced NearSpanQuery Can you wrap a SpanNearQuery around an DisjunctionSumQuery

highlighting performance

2011-06-20 Thread Mike Sokolov
Our apps use highlighting, and I expect that highlighting is an expensive operation since it requires processing the text of the documents, but I ran a test and was surprised just how expensive it is. I made a test index with three fields: path, modified, and contents. I made the index using

Re: Sharding Techniques

2011-05-10 Thread Mike Sokolov
Down to basics, Lucene searches work by locating terms and resolving documents from them. For standard term queries, a term is located by a process akin to binary search. That means that it uses log(n) seeks to get the term. Let's say you have 10M terms in your corpus. If you stored that in a

Re: QueryValidator

2011-05-05 Thread Mike Sokolov
It's an idea - sorry I don't have an implementation I can share easily; it's embedded in our application code and not easy to refactor. I'm not sure where this would fit in the solr architecture; maybe some subclass of SearchHandler? I guess the query rewriter would need to be aware of which

Re: new to lucene, non standard index

2011-05-05 Thread Mike Sokolov
Are the tokens unique within a document? If so, why not store a document for every doc/token pair with fields: id (doc#/token#) doc-id (doc#) token weight1 weight2 frequency Then search for token, sort by weight1, weight2 or frequency. If the token matches are unique within a document you

Re: new to lucene, non standard index

2011-05-05 Thread Mike Sokolov
that contain foo, but I want them sorted by frequency. Then, I would have doc1, doc2. Now, I want to search for all the documents that contain foon, but I want them sorted by weight1. Then, I would have doc2, doc1 Does that clarify? On May 5, 2011, at 3:01 PM, Mike Sokolov wrote

proposed change to CharTokenizer

2010-10-14 Thread Mike Sokolov
Background: I've been trying to enable hit highlighting of XML documents in such a way that the highlighting preserves the well-formedness of the XML. I thought I could get this to work by implementing a CharFilter that extracts text from XML (somewhat like HTMLStripCharFilter, except I am