Re: Lucene same search result for worlds with and without spaces

2018-06-26 Thread Ahmet Arslan
Hi Egorlex,

Shingle filter won't turn "similarissues" into "similar issues". But it can do 
the reverse.
It is like a sliding window. Think about what indexed tokens would be if you 
set token separator to ""

Ahmet








On Wednesday, June 20, 2018, 12:42:22 PM GMT+3, egorlex  
wrote: 





Thanks for replay!

sorry, could you help a little, according to example

"given the phrase “Shingles is a viral disease”, a shingle filter might
produce:

Shingles is
is a
a viral
viral disease
"

I do not quite understand how this ShingleFilter can turn "similarissues"
into "similar issues" 


Thanks!



--
Sent from: http://lucene.472066.n3.nabble.com/Lucene-Java-Users-f532864.html

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene same search result for worlds with and without spaces

2018-06-19 Thread Ahmet Arslan
Hi Egorlex,

ShingleFilter could be used to achieve your goal.

Ahmet







On Tuesday, June 19, 2018, 8:06:46 PM GMT+3, egorlex  wrote: 





Hi,

I need help with Lucene.

How a can realize same search result for worlds with and without spaces.

For example request "similar issues" and "similarissues" must return all
Similar Issues.

Thanks.



--
Sent from: http://lucene.472066.n3.nabble.com/Lucene-Java-Users-f532864.html

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Case Insensitive Search for StringField

2018-05-25 Thread Ahmet Arslan
 Hi,
string_ci type could be constructed from: keyword tokenizer + lowercase filter 
+ may be trim filter.
Ahmet
On Friday, May 25, 2018, 1:50:19 PM GMT+3, Chellasamy G 
 wrote:  
 
 

Hi Team,





Kindly help me out with this problem.





Thanks,

Satyan





 On Wed, 23 May 2018 15:01:39 +0530 Chellasamy G 
chellasam...@zohocorp.com wrote 




Hi, 

 

 

 

Thanks for the reply. 

 

 

 

Actually I need to implement it for StringField which is non-analyzed. So, if I 
am not wrong I can't add the analyzer for StringField. 

 

 

 

 

 

My scenario is something similar to the one discussed in the below thread, 

 

 

 

https://discuss.elastic.co/t/es-5-0-case-insensitive-search-for-keyword-fields/64111/10
 

 

 

 

Could you please let me know how to do the same the thing in lucene. 

 

 

 

 

 

Thanks, 

 

Satyan 

 

 

 

 

 

 

 On Wed, 23 May 2018 12:09:31 +0530 Adrien Grand 
lt;jpou...@gmail.comgt; wrote  

 

 

 

 

Hi Satyan, 

 

 

 

You need to add a LowercaseTokenFilter to your analysis chain. The way to 

 

do it depends on how you are building your analyzer today (pre-built 

 

analyzer, extending Analyzer or using CustomAnalyzer). This will preserve 

 

the original case in field values because lowercasing will only be applied 

 

to the content of the inverted index, not stored fields where hits are 

 

fetched from. 

 

 

 

Le mer. 23 mai 2018 à 08:36, Chellasamy G 
lt;chellasam...@zohocorp.comgt; a 

 

écrit : 

 

 

 

gt; 

 

gt; 

 

gt; Hi, 

 

gt; 

 

gt; 

 

gt; 

 

gt; I can't find any way to perform case insensitive search on 
StringField. 

 

gt; Please help me out. 

 

gt; 

 

gt; 

 

gt; 

 

gt; 

 

gt; 

 

gt; i.e If the field value is "Flying Robots", then the phrases "flying 

 

gt; robots", "fLying RObots" etc should match the value. 

 

gt; 

 

gt; 

 

gt; 

 

gt; 

 

gt; 

 

gt; I also need the original case of the field value to be preserved in 
the 

 

gt; search results. 

 

gt; 

 

gt; 

 

gt; 

 

gt; 

 

gt; 

 

gt; Thanks, 

 

gt; 

 

gt; Satyan 

 

gt; 

 

gt; 

 

gt; 

 

gt; 

 

 

 

 

 

 





  

Re: Custom Similarity

2018-02-08 Thread Ahmet Arslan


Hi Roy,


In order to activate payloads during scoring, you need to do two separate 
things at the same time:
* use a payload aware query type: org.apache.lucene.queries.payloads.*
* use payload aware similarity

Here is an old post that might inspire you :  
https://lucidworks.com/2009/08/05/getting-started-with-payloads/


Ahmet



On Saturday, January 27, 2018, 5:43:36 PM GMT+3, Dwaipayan Roy 
 wrote: 





Thanks for your replies. But still, I am not sure about the way to do the
thing. Can you please provide me with an example code snippet or, link to
some page where I can find one?

Thanks..

On Tue, Jan 16, 2018 at 3:28 PM, Dwaipayan Roy 
wrote:

> ​I want to make a scoring function that will score the documents by the
> following function:
> given Q = {q1, q2, ... }
> score(D,Q) =
>    for all qi:
>      SUM of {
>          LOG { weight_1(qi) + weight_2(qi) + weight_3(qi) }
>      }
>
> I have stored weight_1, weight_2 and weight_3 for all term of all
> documents as payload, with payload delimiter = | (pipe) during indexing.
>
> However, I am not sure on how to integrate all the weights during
> retrieval. I am sure that I have to @Override some score() but not sure
> about the exact class.
>
> Please help me here.
> ​
> Best,
> Dwaipayan..​

>
>


-- 
Dwaipayan Roy.


Re: To get the term-freq

2017-11-17 Thread Ahmet Arslan
Hi,

I am also intersted into the answer to this question. 
I wonder whether term freq. function query would work here.

Ahmet





On Friday, November 17, 2017, 10:32:23 AM GMT+3, Dwaipayan Roy 
 wrote: 





​Hi,

I want to get the term frequency of a given term t in a given document with
lucene docid say d.
Formally, I need a function say f() that takes two arguments: 1.
lucene-docid d, 2. term t, and returns the number of time t occurs in d.

I know of one solution, that is, traversing the whole document using
TermsEnum iterator, but it is taking a lot of time. I want a solution that
works fast.

Dwaipayan.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: get begin/end of matched terms

2017-10-21 Thread Ahmet Arslan

Hi Nicolas,

With SpanQuery family, it is possible to retrieve spans (index/position 
information)

Also, you may find luwak relevant. 

https://github.com/flaxsearch/luwak


Ahmet







On Sunday, October 22, 2017, 1:16:01 AM GMT+3, Nicolas Paris 
 wrote: 





Hi


I am looking for a way to get the index of matched terms. Right now, I
haven't found any trivial solution. I ve found the  highlighter code[1]
that looks to do the job. Before I start to code that myself, user may
already did that and point me in the right direction.

Thanks

[1] 
https://github.com/apache/lucene-solr/blob/0971fe691aa9446ab6f4442b6d79ae1c81e31594/lucene/highlighter/src/java/org/apache/lucene/search/highlight/Highlighter.java


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Accent insensitive search for greek characters

2017-09-27 Thread Ahmet Arslan
 I may be wrong about ASCIIFoldingFilter. Please go with the ICUFoldingFilter.
Ahmet
On Wednesday, September 27, 2017, 3:47:01 PM GMT+3, Chitra 
<chithu.r...@gmail.com> wrote:  
 
 Hi Ahmet,                      Thank you so much for the reply.

I have tried but it seems, ASCIIFoldingFilter is not supporting greek accent 
characters and it supports only Latin like accent characters. Am I missing 
anything?



Chitra



On Wed, Sep 27, 2017 at 5:47 PM, Ahmet Arslan <iori...@yahoo.com> wrote:



Hi,
Yes ICUFoldingFilter or ASCIIFoldingFilter could be used.
ahmet 

 
 
 On Wednesday, September 27, 2017, 1:54:43 PM GMT+3, Chitra 
<chithu.r...@gmail.com> wrote: 





Hi,
                In Lucene, I want to search greek characters(with accent
insensitive) by removing or replacing accent marks with similar characters.

Example: we are trying to convert  Greek Extended characters
<http://www.unicode.org/ charts/PDF/U1F00.pdf> to basic Greek Unicode
<http://www.unicode.org/ charts/PDF/U0370.pdf> for providing accent
insensitive search...


Kindly suggest the better solution to achieve this...? Does
ICUFoldingFilter solve my use-case?

-- 
Regards,
Chitra





-- 
Regards,Chitra

Re: Accent insensitive search for greek characters

2017-09-27 Thread Ahmet Arslan


Hi,
Yes ICUFoldingFilter or ASCIIFoldingFilter could be used.
ahmet 

 
 
 On Wednesday, September 27, 2017, 1:54:43 PM GMT+3, Chitra 
 wrote: 





Hi,
                In Lucene, I want to search greek characters(with accent
insensitive) by removing or replacing accent marks with similar characters.

Example: we are trying to convert  Greek Extended characters
 to basic Greek Unicode
 for providing accent
insensitive search...


Kindly suggest the better solution to achieve this...? Does
ICUFoldingFilter solve my use-case?

-- 
Regards,
Chitra



Re: Re: What is the fastest way to loop over all documents in an index?

2017-09-05 Thread Ahmet Arslan

Hi Ishan,

I saw following loop is suggested for this task in the stack overflow.

for (int i=0; i<reader.maxDoc(); i++)

How can we confirm that internal Lucene IDs are subsequent numbers from 0 to 
maxDoc()-1?

I thought that they are arbitrary integers.

Ahmet



On Tuesday, September 5, 2017, 7:54:31 AM GMT+3, Ishan Chattopadhyaya 
<ichattopadhy...@gmail.com> wrote: 





Maybe IndexReader#document(), looping over docids is the best here?
http://lucene.apache.org/core/6_6_0/core/org/apache/lucene/index/IndexReader.html#document-int-

On Tue, Sep 5, 2017 at 7:57 AM, Ahmet Arslan <iori...@yahoo.com.invalid>
wrote:

> Hi Jean,
>
> I am also interested answers to this question. I need this feature too.
> Currently I am using a hack.
> I create an artificial field (with an artificial token) attached to every
> document.
>
> I traverse all documents using the code snippet given in my previous
> related question. (no one answered to it)
>
> http://lucene.472066.n3.nabble.com/PostingsEnum-for-
> documents-that-does-not-contain-a-term-td4349482.html
> I found EverythingEnum class in the Lucene50PostingsReader.java, but I
> couldn't figure out how to use it.
> So, I do not know if this class is for the task, but its name looks
> promising.
> Thanks,Ahmet
>
>
>
> On Tuesday, September 5, 2017, 3:09:37 AM GMT+3, Jean Claude van Johnson <
> vanjohnsonjeancla...@gmail.com> wrote:
>
>
>
>
>
> Hi there,
>
> I have an use case, were I need to iterate over all documents in an index
> from time to time.
> It seems that the MatchAllDocsQuery is what I should use for this, however
> it creates a bunch of Objects (Score etc) that I don’t really need.
>
> My question to you is:
>
> What is the fastest way to loop over all documents in an index?
> Is it looping over all possible doc id’s (+filtering out deleted
> documents)?
>
> Thank you very much.
>
> Best regards
> Claude
>


Re: What is the fastest way to loop over all documents in an index?

2017-09-04 Thread Ahmet Arslan
Hi Jean,

I am also interested answers to this question. I need this feature too. 
Currently I am using a hack.
I create an artificial field (with an artificial token) attached to every 
document. 

I traverse all documents using the code snippet given in my previous related 
question. (no one answered to it)

http://lucene.472066.n3.nabble.com/PostingsEnum-for-documents-that-does-not-contain-a-term-td4349482.html
I found EverythingEnum class in the Lucene50PostingsReader.java, but I couldn't 
figure out how to use it.
So, I do not know if this class is for the task, but its name looks promising.
Thanks,Ahmet



On Tuesday, September 5, 2017, 3:09:37 AM GMT+3, Jean Claude van Johnson 
 wrote: 





Hi there,

I have an use case, were I need to iterate over all documents in an index from 
time to time.
It seems that the MatchAllDocsQuery is what I should use for this, however it 
creates a bunch of Objects (Score etc) that I don’t really need.

My question to you is: 

What is the fastest way to loop over all documents in an index?
Is it looping over all possible doc id’s (+filtering out deleted documents)?

Thank you very much.

Best regards
Claude


Re: Occur.FILTER clarification

2017-08-11 Thread Ahmet Arslan
Hi Adrien,
Thank you for the explanation.
Here is what I have ended up with.
BooleanQuery.Builder builder = new BooleanQuery.Builder();
builder
.add(new MatchAllDocsQuery(), BooleanClause.Occur.FILTER)
.add(new TermQuery(term), BooleanClause.Occur.MUST_NOT);

Ahmet

On Friday, August 11, 2017, 3:58:25 PM GMT+3, Adrien Grand <jpou...@gmail.com> 
wrote:


FILTER does the opposite of MUST_NOT.

Regarding scoring, putting the query in a FILTER or MUST_NOT clause is good
enough since such clauses do not need scores. You do not need to add an
additional ConstantScoreQuery wrapper.

Le mar. 8 août 2017 à 23:06, Ahmet Arslan <iori...@yahoo.com.invalid> a
écrit :


> Hi all,
> I am trying to access document lenght statistics of the documents that do
> not contain a given term.
> I have written following piece of code
>
> BooleanQuery.Builder builder = new BooleanQuery.Builder();builder.add(new
> MatchAllDocsQuery(), BooleanClause.Occur.MUST)        .add(new
> TermQuery(term), BooleanClause.Occur.FILTER);
> ScoreDoc[] hits = searcher.search(new ConstantScoreQuery(builder.build()),
> Integer.MAX_VALUE).scoreDocs;
> Javadoc says FILTER is like MUST except that these clauses do not
> participate in scoring.
> I was expecting FILTER behaves like MUST_NO, no? (FILTER-clause is
> excluded from the result set)
>
> Also, to disable scoring altogether, is it enough to wrap final boolean
> query using ConstantQuery?Or individual clauses should be wrapped to?
> MatchAllDocsQuery is already ConstantQuery, right?
> Thanks,Ahmet
>

Occur.FILTER clarification

2017-08-08 Thread Ahmet Arslan
Hi all,
I am trying to access document lenght statistics of the documents that do not 
contain a given term.
I have written following piece of code

BooleanQuery.Builder builder = new BooleanQuery.Builder();builder.add(new 
MatchAllDocsQuery(), BooleanClause.Occur.MUST).add(new TermQuery(term), 
BooleanClause.Occur.FILTER);
ScoreDoc[] hits = searcher.search(new ConstantScoreQuery(builder.build()), 
Integer.MAX_VALUE).scoreDocs;
Javadoc says FILTER is like MUST except that these clauses do not participate 
in scoring.
I was expecting FILTER behaves like MUST_NO, no? (FILTER-clause is excluded 
from the result set)

Also, to disable scoring altogether, is it enough to wrap final boolean query 
using ConstantQuery?Or individual clauses should be wrapped to?
MatchAllDocsQuery is already ConstantQuery, right?
Thanks,Ahmet


Re: How to fetch documents for which field is not defined

2017-08-07 Thread Ahmet Arslan
How about Solr's exists function query? How does it work?function queries are 
now part of Lucene (org.apache.lucene.queries.function.) right?
Ahmet


On Sunday, July 16, 2017, 11:19:40 AM GMT+3, Trejkaz  
wrote:


On Sat, Jul 15, 2017 at 8:12 PM, Uwe Schindler  wrote:
> That is the "Solr" answer. But it is slow like hell.
>
> In Lucene there is a natove query named FieldValueQuery already for this.
> It requires DocValues enabled for the field.
>
> IMHO, the best and fastest variant (also to Solr users) is to add a separate
> multivalued string field named 'fieldnames' where you index all field named
> that have a value. After that you can query on this using the field name.
> Elasticsearch is doing the field name approach for exists/not exists by 
> default.

The catch is, you usually have to analyse a field to determine whether
it has a value. Apparently Elasticsearch's field existence query does
not do this, so it considers blank text to be a value, which is not
the same as what the user expected when they did the query.

We *were* using FieldValueQuery, but since moving to Lucene 6 we have
stopped using uninverting reader, so that option doesn't cover all
fields, and fields like "content" aren't really practical to put in

DocValues...


The approach to add a fieldnames field works, but is fiddly at
indexing-time, because now you have to use TokenStream for all fields,
so that you can read one token from each field to test whether there
is one before you add the whole document. I guess it's at least easier
to understand how it works at query-time.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


PostingsEnum for documents that does not contain a term

2017-08-07 Thread Ahmet Arslan
Hi,
I am traversing posting list of a given term/word using the following code. I 
am accessing/processing term frequency and document length.
Term term = new Term(field, word);
PostingsEnum postingsEnum = MultiFields.getTermDocsEnum(reader, field, 
term.bytes());

if (postingsEnum == null) return word + "(stopword)";

while (postingsEnum.nextDoc() != PostingsEnum.NO_MORE_DOCS) {
            final int freq = postingsEnum.freq();
final long numTerms = norms.get(postingsEnum.docID());
...
}

Now I want to traverse the remaining documents. (documents that does *not* 
contain the term/word)

What is the best way to accomplish this?
Thanks,Ahmet 

Re: How to fetch documents for which field is not defined

2017-07-15 Thread Ahmet Arslan
Hi,
As an alternative, function queries can also be used.exists function may be 
more intuitive.
q={!func}(not(exists(field3))
On Saturday, July 15, 2017, 1:01:04 PM GMT+3, Rajnish kamboj 
<rajnishk7.i...@gmail.com> wrote:


Ok, I will check.

On Sat, 15 Jul 2017 at 3:26 PM, Ahmet Arslan <iori...@yahoo.com> wrote:

> Hi,
>
> Yes, here it is:  q=+*:* -field3:[* TO *]
>
> Ahmet
>
> On Saturday, July 15, 2017, 8:16:00 AM GMT+3, Rajnish kamboj <
> rajnishk7.i...@gmail.com> wrote:
>
>
> Hi
> Does Lucene provide any API to fetch documents for which a field is not
> defined.
>
> Example
> Document1 : field1=value1, field2=value2,field3=value3
>
> Document2 : field1=value4, field2=value4
>
> I want a query to get documents for which field3 is not defined. In example
> it should return Document2.
>
> Regards
> Rajnish
>

Re: How to fetch documents for which field is not defined

2017-07-15 Thread Ahmet Arslan
Hi,
Yes, here it is:  q=+*:* -field3:[* TO *]
Ahmet
On Saturday, July 15, 2017, 8:16:00 AM GMT+3, Rajnish kamboj 
 wrote:


Hi
Does Lucene provide any API to fetch documents for which a field is not
defined.

Example
Document1 : field1=value1, field2=value2,field3=value3

Document2 : field1=value4, field2=value4

I want a query to get documents for which field3 is not defined. In example
it should return Document2.

Regards
Rajnish

Re: Penalize fact the searched term is within a world

2017-06-08 Thread Ahmet Arslan
Hi,
You can completely ban within-a-word search by simply using WhitespaceTokenizer 
for example.By the way, it is all about how you tokenize/analyze your text. 
Once you decided, you can create a two versions of a single field using 
different analysers.This allows you to assign different weights to those field 
at query time.
Ahmet


On Thursday, June 8, 2017, 2:56:37 PM GMT+3, Jacek Grzebyta 
 wrote:


Hi,

Apologies for repeating question from IRC room but I am not sure if that is
alive.

I have no idea about how lucene works but I need to modify some part in
rdf4j project which depends on that.

I need to use lucene to create a mapping file based on text searching and I
found there is a following problem. Let take a term 'abcd' which is mapped
to node 'abcd-2' whereas node 'abcd' exists. I found the issue is lucene is
searching the term and finds it in both nodes 'abcd' and 'abcd-2' and gives
the same score. My question is: how to modify the scoring to penalise the
fact the searched term is a part of longer word and give more score if that
is itself a word.

Visually It looks like that:

node 'abcd':
  - name: abcd

total score = LS /lucene score/ * 2.0 /name weight/



node 'abcd-2':
  - name: abcd-2
  - alias1: abcd-h
  - alias2: abcd-k9

total score = LS * 2.0 + LS * 0.5 /alias1 score/ + LS * 0.1 /alias2 score/

I gave different weights for properties. "Name" has the the highest weight
but "alias" has some small weight as well. In total the score for a node is
a sum of all partial score * weight. Finally 'abcd-2' has highest score
than 'abcd'.

thanks,
Jacek

Re: A question over TokenFilters

2017-04-21 Thread Ahmet Arslan
Hi,
LimitTokenCountFilter is used to index first n tokens. May be it can inspire 
you.

Ahmet
On Friday, April 21, 2017, 6:20:11 PM GMT+3, Edoardo Causarano 
 wrote:
Hi all.

I’m relatively new to Lucene, so I have a couple questions about writing custom 
filters.

The way I understand it, one would extend 
org.apache.lucene.analysis.TokenFilter and override #incrementToken to examine 
the current token provided by a stream token producer.

I’d like to write some logic that considers the last n seen tokens therefore I 
need to access this context as the filter chain is scanning the stream.

Can anyone point to an example of such a construct? 

Also, how would I access and update this context keeping multithreading in 
mind? Actually, what is the treading model of a TokenStream, can anyone point 
out a good summary for it?

TIA


Best,
Edoardo
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to get the index last modification date ?

2017-04-08 Thread Ahmet Arslan

Hi Jean,
How about LukeRequest handler? Many of the information displayed on the admin 
screen comes from it.https://wiki.apache.org/solr/LukeRequestHandler

Ahmet
On Sunday, April 9, 2017, 2:21:38 AM GMT+3, Jean-Claude Dauphin 
 wrote:
Hello,

I need to check the index last modification date to count the number of
indexed terms only if tthe index has changed.

Any idea or suggestion on how to do this.

Thank you in advance.

Best wishes,

-- 
Jean-Claude Dauphin

Re: How to customize the delimiters used by the WordDelimiterFilter in Lucene?

2017-03-18 Thread Ahmet Arslan
Hi,

May be look at the factory class to see how types argument is handled?

Ahmet


On Friday, March 17, 2017 11:05 PM, "pha...@mailbox.org"  
wrote:



Hi,


I am trying to index words like 'e-mail' as 'email', 'e mail' and 'e-mail' with 
Lucene 4.4.0.


Lucene's WordDelimiterFilter should be ideal for this. However, it treats 
every(?) non-alphanumeric character as a delimiter. So, terms like 'C++' are 
transformed to 'C', which is not what I want.


Apparently, Solr allows to specify custom delimiters. But how can I do it in 
Lucene?


I have looked into the documentation and the 'byte[] charTypeTable' parameter 
in the Constructor looked promising. But it seems to have no effect if I 
specify some delimiters in a charTypeTable.


Thank you!


-

To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org

For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search any field name having a specific value

2017-03-17 Thread Ahmet Arslan
Hi,

You can retrieve the list of field names using LukeRequestHandler.

Ahmet


On Friday, March 17, 2017 9:53 PM, Cristian Lorenzetto 
 wrote:



It permits to search in a predefined lists of fields that you have to know
in advance. In my case i dont know what is the fieldname.
maybe WildcardQuery?


2017-03-17 19:30 GMT+01:00 Corbin, J.D. :

> ​You might take a look at MultiFieldQueryParser.  I believe it allows you
> to search multiple index fields at the same time.
>
>
>
> J.D. Corbin
>
> Senior Research Engineer
>
> Advanced Computing & Data Science Lab
>
> 3075 W. Ray Road
> Suite 200
> Chandler, AZ 85226-2495
> USA
>
>
> M: (303) 912-0958
>
> E: jd.cor...@pearson.com
>
> Pearson
>
> Always Learning
> Learn more at www.pearson.com 
>
> On Fri, Mar 17, 2017 at 11:05 AM, Cristian Lorenzetto <
> cristian.lorenze...@gmail.com> wrote:
>
> > it is possible create a query searching any document containing any field
> > having value == X?
> >
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: any analyzer will keep punctuation?

2017-03-08 Thread Ahmet Arslan
Hi,

Please find wdftypes.txt in the source tree for an example. 
It is an argument of word delimiter filter factory.
Also see hashtag example: https://issues.apache.org/jira/browse/SOLR-2059

Ahmet



On Wednesday, March 8, 2017 6:22 AM, Yonghui Zhao <zhaoyong...@gmail.com> wrote:
Hi Ahmet,

Thanks for your reply, but I didn't quite get your idea.
I want to get an analyzer like standard analyzer but with punctuation
customized.
I think one way is customizing an analyzer  with a customizer  tokenizer
like StandardTokenizer.
In my tokenizer I will re-write StandardTokenizerImpl which seems a little
complicate.
I don't understand how "a customised word delimiter filter factory" works
in tokenizer.



2017-03-06 22:26 GMT+08:00 Ahmet Arslan <iori...@yahoo.com>:

> Hi Zhao,
>
> WhiteSpace tokeniser followed by a customised word delimiter filter
> factory would be solution.
> Please see types attribute of the word delimiter filter for customising
> characters.
>
> ahmet
>
>
>
> On Monday, March 6, 2017 12:22 PM, Yonghui Zhao <zhaoyong...@gmail.com>
> wrote:
> Yes whitespace analyzer will keep punctuation, but it only breaks word by
> space.
>
>
> I didn’t explain my requirement clearly.
>
> I want to an analyzer like standard analyzer but may keep some punctuation
> configured.
>
>
> 2017-03-06 18:03 GMT+08:00 Ahmet Arslan <iori...@yahoo.com.invalid>:
>
> > Hi,
> >
> > Whitespace analyser/tokenizer for example.
> >
> > Ahmet
> >
> >
> >
> > On Monday, March 6, 2017 10:21 AM, Yonghui Zhao <zhaoyong...@gmail.com>
> > wrote:
> > Lucene standard anlyzer will remove almost all punctuation.
> > In some cases, we want to keep some punctuation, for example in music
> > search, some singer name and album name could be a punctuation.
> >
> > Is there any analyzer that we can customized punctuation to be removed?
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: any analyzer will keep punctuation?

2017-03-06 Thread Ahmet Arslan
Hi Zhao,

WhiteSpace tokeniser followed by a customised word delimiter filter factory 
would be solution.
Please see types attribute of the word delimiter filter for customising 
characters.

ahmet



On Monday, March 6, 2017 12:22 PM, Yonghui Zhao <zhaoyong...@gmail.com> wrote:
Yes whitespace analyzer will keep punctuation, but it only breaks word by
space.


I didn’t explain my requirement clearly.

I want to an analyzer like standard analyzer but may keep some punctuation
configured.


2017-03-06 18:03 GMT+08:00 Ahmet Arslan <iori...@yahoo.com.invalid>:

> Hi,
>
> Whitespace analyser/tokenizer for example.
>
> Ahmet
>
>
>
> On Monday, March 6, 2017 10:21 AM, Yonghui Zhao <zhaoyong...@gmail.com>
> wrote:
> Lucene standard anlyzer will remove almost all punctuation.
> In some cases, we want to keep some punctuation, for example in music
> search, some singer name and album name could be a punctuation.
>
> Is there any analyzer that we can customized punctuation to be removed?
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: any analyzer will keep punctuation?

2017-03-06 Thread Ahmet Arslan
Hi,

Whitespace analyser/tokenizer for example.

Ahmet



On Monday, March 6, 2017 10:21 AM, Yonghui Zhao  wrote:
Lucene standard anlyzer will remove almost all punctuation.
In some cases, we want to keep some punctuation, for example in music
search, some singer name and album name could be a punctuation.

Is there any analyzer that we can customized punctuation to be removed?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: term frequency in solr

2017-01-05 Thread Ahmet Arslan
Hi,

I guess you are working with default techproducts.
can you try using the terms request handler: 
query.setRequestHandler("terms")

Ahmet


On Friday, January 6, 2017 1:19 AM, huda barakat <eng.huda.bara...@gmail.com> 
wrote:
Thank you for fast reply, I add the query in the code but still not working:


import java.util.List;

import org.apache.solr.client.solrj.SolrClient;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrRequest;
import org.apache.solr.client.solrj.impl.HttpSolrClient;
import org.apache.solr.client.solrj.request.QueryRequest;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.client.solrj.response.TermsResponse;

public class App3 {
public static void main(String[] args) throws Exception {
String urlString = "http://localhost:8983/solr/techproducts;;
SolrClient solr = new HttpSolrClient.Builder(urlString).build();

SolrQuery query = new SolrQuery();
query.setQuery("*:*");
query.setTerms(true);
query.addTermsField("name");
SolrRequest req = new QueryRequest(query);
QueryResponse rsp = req.process(solr);

System.out.println(rsp);

System.out.println("numFound: " + rsp.getResults().getNumFound());

TermsResponse termResp =rsp.getTermsResponse();
List terms = termResp.getTerms("name");
System.out.print("size="+ terms.size());
}
}
///
I got this error:

numFound: 32
Exception in thread "main" java.lang.NullPointerException
at testPkg.App3.main(App3.java:30)


On 5 January 2017 at 18:25, Ahmet Arslan <iori...@yahoo.com.invalid> wrote:

> Hi,
>
> I think you are missing the main query parameter? q=*:*
>
> By the way you may get more response in the sole-user mailing list.
>
> Ahmet
>
>
> On Wednesday, January 4, 2017 4:59 PM, huda barakat <
> eng.huda.bara...@gmail.com> wrote:
> Please help me with this:
>
>
> I have this code which return term frequency from techproducts example:
>
> 
> /
> import java.util.List;
>
> import org.apache.solr.client.solrj.SolrClient;
> import org.apache.solr.client.solrj.SolrQuery;
> import org.apache.solr.client.solrj.SolrRequest;
> import org.apache.solr.client.solrj.impl.HttpSolrClient;
> import org.apache.solr.client.solrj.request.QueryRequest;
> import org.apache.solr.client.solrj.response.QueryResponse;
> import org.apache.solr.client.solrj.response.TermsResponse;
>
> public class test4 {
> public static void main(String[] args) throws Exception {
> String urlString = "http://localhost:8983/solr/techproducts;;
> SolrClient solr = new HttpSolrClient.Builder(urlString).build();
>
> SolrQuery query = new SolrQuery();
> query.setTerms(true);
> query.addTermsField("name");
> SolrRequest req = new QueryRequest(query);
> QueryResponse rsp = req.process(solr);
>
> System.out.println(rsp);
>
> System.out.println("numFound: " + rsp.getResults().getNumFound());
>
> TermsResponse termResp =rsp.getTermsResponse();
> List terms = termResp.getTerms("name");
> System.out.print("size="+ terms.size());
> }
> }
> 
> /
>
> the result is 0 records I don't know why?? this is what I got:
>
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
> details.
> {responseHeader={status=0,QTime=0,params={terms=true,
> terms.fl=name,wt=javabin,version=2}},response={
> numFound=0,start=0,docs=[]}}
> numFound: 0
> Exception in thread "main" java.lang.NullPointerException
> at solr_test.solr.test4.main(test4.java:29)
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: term frequency in solr

2017-01-05 Thread Ahmet Arslan
Hi,

I think you are missing the main query parameter? q=*:*

By the way you may get more response in the sole-user mailing list.

Ahmet


On Wednesday, January 4, 2017 4:59 PM, huda barakat 
 wrote:
Please help me with this:


I have this code which return term frequency from techproducts example:

/
import java.util.List;

import org.apache.solr.client.solrj.SolrClient;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrRequest;
import org.apache.solr.client.solrj.impl.HttpSolrClient;
import org.apache.solr.client.solrj.request.QueryRequest;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.client.solrj.response.TermsResponse;

public class test4 {
public static void main(String[] args) throws Exception {
String urlString = "http://localhost:8983/solr/techproducts;;
SolrClient solr = new HttpSolrClient.Builder(urlString).build();

SolrQuery query = new SolrQuery();
query.setTerms(true);
query.addTermsField("name");
SolrRequest req = new QueryRequest(query);
QueryResponse rsp = req.process(solr);

System.out.println(rsp);

System.out.println("numFound: " + rsp.getResults().getNumFound());

TermsResponse termResp =rsp.getTermsResponse();
List terms = termResp.getTerms("name");
System.out.print("size="+ terms.size());
}
}
/

the result is 0 records I don't know why?? this is what I got:

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
details.
{responseHeader={status=0,QTime=0,params={terms=true,terms.fl=name,wt=javabin,version=2}},response={numFound=0,start=0,docs=[]}}
numFound: 0
Exception in thread "main" java.lang.NullPointerException
at solr_test.solr.test4.main(test4.java:29)

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Email id tokenizer (actual email id & multiple terms)

2016-12-20 Thread Ahmet Arslan
Hi,

You can index whole address in a separate field. 
Otherwise, how would you handle positions of the split tokens?

By the way, speed of phrase search may be just fine, so consider trying first.

Ahmet


On Tuesday, December 20, 2016 5:15 PM, suriya prakash  
wrote:
Hi,

I am using standard analyzer and want to split token for email_id "
luc...@gmail.com" as "lucene", "gmail","com","luc...@gmail.com" in a single
pass.

I have already changed jflex to split email id as separate words(lucene,
gmail, com). But we need to do phrase search which will not be efficient.
So i want to index actual email id and splitted words.

Can you please help me to achieve this. OR let me know whether phrase
search is efficient for this case?


Regards,
Suriya

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ComplexPhraseQueryParser with wildcards

2016-12-20 Thread Ahmet Arslan
Hi Otmar,

A single term inside quotes is meaningless. A phrase query should have at least 
two terms in it, shouldn't it?

What is your intention with a such "john*" query?

Ahmet


On Tuesday, December 20, 2016 4:56 PM, Otmar Caduff  wrote:



Hi,

I have an index with a single document with a field "field" and textual
content "johnny peters" and I am using
org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser to
parse the query:
   field: (john* peter)
When searching with this query, I am getting the document as expected.
However with this query:
   field: ("john*" "peter")
I am getting the following exception:
Exception in thread "main" java.lang.IllegalArgumentException: Unknown
query type "org.apache.lucene.search.PrefixQuery" found in phrase query
string "john*"
at
org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser$ComplexPhraseQuery.rewrite(ComplexPhraseQueryParser.java:268)
at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:278)
at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:836)
at
org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:886)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:535)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:744)
at
org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:460)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:489)
at ComplexQueryTest.main(ComplexQueryTest.java:36)

Note: the exception is not thrown during the parse() method call, but
during the search() method call.

I don't see why the ComplexQueryParser can't handle this. Am I misusing it?
Or should I file a bug on Jira?

I'm on Lucene 5.5.1, but the situation looks the same on 6.3.0. Any help is
appreciated!

Otmar

The code to reproduce my issue:


import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.RAMDirectory;

public class ComplexQueryTest {

public static void main(String[] args) throws Throwable {
RAMDirectory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory, new IndexWriterConfig(new
StandardAnalyzer()));

Document doc1 = new Document();
doc1.add(new TextField("field", "johnny peters", Store.NO));
writer.addDocument(doc1);

writer.commit();
writer.close();

IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
ComplexPhraseQueryParser parser = new ComplexPhraseQueryParser("field", new
StandardAnalyzer());
TopDocs topDocs;

Query queryOk = parser.parse("field: (john* peters)");
topDocs = searcher.search(queryOk, 2);
System.out.println("found " + topDocs.totalHits + " docs");

Query queryFail = parser.parse("field: (\"john*\" \"peters\")");
topDocs = searcher.search(queryFail, 2); // -> throws the above
mentioned exception
System.out.println("found " + topDocs.totalHits + " docs");

}

}

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Best way to search by pages

2016-11-26 Thread Ahmet Arslan
How about keeping two indices: page index and document index.
Issue the query to the document index and list n documents.
For each document, list k pages fetched from page index.

Ahmet



On Saturday, November 26, 2016 12:16 PM, Joe MA  wrote:
Greetings,

   I am trying to use Lucene to search large documents, and return the pages
where a term(s) is matched.  For example, say I am indexing 500 auto
manuals, each with around 1000 pages each.  So if the user searched for
"Taurus" and  "flat" and "tire", a good result could be "2006 Ford Taurus
Manual: pages 100, 134, 650, 741".



My first approach was to index each page within each manual as a separate
document. This works to a degree, but you may miss hits where the terms are
separated on different pages "flat" on page 100, "tire" on page 101.  Or not
every page would have "Taurus".  Not to mention you are indexing 500,000
individual pages as documents when you really only need to index 500 actual
documents (and aggregating the results is a hassle).



Now, my current approach is to index each document as a whole (so only 500
documents in the index), but I store term vectors and positions with the
content, so that I know the position of any search term hit ("tire" found in
document 32, at position 64,320).   To find the actual page, as I index the
content, I insert a 'page break' special code, such as
"LUCENE_PAGE_BREAKER", between each page.  Then when pulling my search hit
positions, I also pull the positions of all 499 (assuming 500 pages in a
document) page break terms and store in an array.  Then, step through the
array until the position of my search hit is less than the position of a
page breaker, and you know what page the hit occurred.



My question is:   This seems like such a common requirement.  Is there a
better way of doing this?  



Thanks - J

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Multi-field IDF

2016-11-17 Thread Ahmet Arslan
Hi Nicholas,

IDF, among others, is a measure of term specificity. If 'or' is not so usual in 
titles, then it has some discrimination power in that domain.

I think it's OK 'or' to get a high IDF value in this case.

Ahmet



On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier 
 wrote:
IDF measures the selectivity of a term. But the calculation is 
per-field. That can be bad for very short fields (like titles). One 
example of this problem: If I don't delete stop words, then "or", "and", 
etc. should be dealt with low IDF values, however "or" is, perhaps, not 
so usual in titles. Then, "or" will have a high IDF value and be treated 
as an important term. That's bad.

One solution I see is to modify the Similarity to have a global, or 
multi-field IDF value. This value would include in its calculation 
longer fields that has more "normal text"-like stats. However this is 
not trivial because I can't just add document-frequencies (I would be 
counting some documents several times if "or" is present in more than 
one field). I would need need to OR the bit-vectors that signal the 
presence of the term, right? Not trivial.

Has anyone encountered this issue? Has it been solved? Is my thinking wrong?

Should I also try the developers' list?

Thanks!

Nicolás.-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How exclude empty fields?

2016-11-11 Thread Ahmet Arslan
Hi,

Match all docs query minus Promotion.endDate:[* TO *]
+*:* -Promotion.endDate:[* TO *]

Ahmet


On Friday, November 11, 2016 5:59 PM, voidmind  wrote:
Hi,

I have indexed content about Promotions with effectiveDate and endDate
fields for when the promotions start and end.

I want to query for expired promotions so I do have this criteria, which
works fine:

+Promotion.endDate:[210100TOvariable containing yesterday's date]

The issue I have is that some promotions are permanent so they don't have
an endDate set.

I tried doing:

( +Promotion.endDate:[210100TOvariable containing yesterday's date]
|| -Promotion.endDate:* )

But it doesn't seem to work because the promotions with no endDate are in
my results (empty endDate fields are not indexed apparently)

How would I exclude content that doesn't have an endDate set?

Thanks,
Alexandre Leduc

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Isn't fieldLength in BM25 supposed to be an integer?

2016-11-09 Thread Ahmet Arslan
Hi Mossaab,

Probably due to the encodeNormValue/decodeNormValue transformation of the 
document length.

Please see the aforementioned methods in BM25Similarity.java

Ahmet





On Wednesday, November 9, 2016 10:25 PM, Mossaab Bagdouri 
 wrote:
Hi,

On Lucene 6.2.1, I have the following explain output for a document that
contain two words. I'm wondering why the value of fieldLength is not 2.

A related question was posted on S.O. two years ago:
http://stackoverflow.com/questions/22194920

23.637165 = sum of:
  10.065297 = weight(title:googl in 401658357) [BM25Similarity], result of:
10.065297 = score(doc=401658357,freq=1.0 = termFreq=1.0
), product of:
  7.3866553 = idf(docFreq=414179, docCount=668609139)
  1.3626325 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
7.3254013 = avgFieldLength
2.56 = fieldLength
  13.571868 = weight(title:hangout in 401658357) [BM25Similarity], result
of:
13.571868 = score(doc=401658357,freq=1.0 = termFreq=1.0
), product of:
  9.960035 = idf(docFreq=31592, docCount=668609139)
  1.3626325 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
7.3254013 = avgFieldLength
2.56 = fieldLength

Regards,
Mossaab

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to add ASCIIFoldingFilter in ClassicAnalyzer

2016-10-11 Thread Ahmet Arslan
Hi,

I forgot to include : .addTokenFilter("asciifolding")

Ahmet


On Tuesday, October 11, 2016 5:37 PM, Ahmet Arslan <iori...@yahoo.com> wrote:
Hi Kumaran,

Writing a custom analyzer is easier than it seems.

Please see how I added kstem to classic analyzer:

return CustomAnalyzer.builder()
.withTokenizer("classic")
.addTokenFilter("classic")
.addTokenFilter("lowercase")
.addTokenFilter("kstem")
.build();

Ahmet




On Tuesday, October 11, 2016 5:22 PM, Kumaran Ramasubramanian 
<kums@gmail.com> wrote:
Hi All,

  Is there any way to add ASCIIFoldingFilter over ClassicAnalyzer without
writing a new custom analyzer ? should i extend StopwordAnalyzerBase again?


I know that ClassicAnalyzer is final. any special purpose for making it as
final? Because, StandardAnalyzer was not final before ?

public final class ClassicAnalyzer extends StopwordAnalyzerBase
>


--
Kumaran R

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to add ASCIIFoldingFilter in ClassicAnalyzer

2016-10-11 Thread Ahmet Arslan
Hi Kumaran,

Writing a custom analyzer is easier than it seems.

Please see how I added kstem to classic analyzer:

return CustomAnalyzer.builder()
.withTokenizer("classic")
.addTokenFilter("classic")
.addTokenFilter("lowercase")
.addTokenFilter("kstem")
.build();

Ahmet



On Tuesday, October 11, 2016 5:22 PM, Kumaran Ramasubramanian 
 wrote:
Hi All,

  Is there any way to add ASCIIFoldingFilter over ClassicAnalyzer without
writing a new custom analyzer ? should i extend StopwordAnalyzerBase again?


I know that ClassicAnalyzer is final. any special purpose for making it as
final? Because, StandardAnalyzer was not final before ?

public final class ClassicAnalyzer extends StopwordAnalyzerBase
>


--
Kumaran R

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How can I list all the terms from a document?

2016-09-16 Thread Ahmet Arslan
Hi,

I thought the link/url below has the example code, no?


http://makble.com/what-is-term-vector-in-lucene

If not, in the source tree, under the tests folder, there should be some test 
cases for termVectors, which can be used as en example code.

I guess internal lucene document id, which easy to get when you have lucene 
query, is used.
new termquery(new term("fileName","something"));

Ahmet



On Friday, September 16, 2016 4:09 PM, szzoli  wrote:
If I have a TermVector, is it possible to give it a filename, so that it
could enumerate through ll the terms? (I need the number of occurances in
the document, too.)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-can-I-list-all-the-terms-from-a-document-tp4294797p4296455.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How can I list all the terms from a document?

2016-09-13 Thread Ahmet Arslan
Hi,

First you need to enable term vectors at index time.
Then you can access terms and their statistics in a document.


http://makble.com/what-is-term-vector-in-lucene
Ahmet



On Tuesday, September 13, 2016 11:53 AM, szzoli  wrote:
Hi,

how can I use TermVectors ? I have read the API, but it is not clear to me.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-can-I-list-all-the-terms-from-a-document-tp4294797p4295922.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Is it possible to search for a paragraph in Lucene?

2016-09-12 Thread Ahmet Arslan
Hi,

If you have some tool/mechanism to detect paragraph boundaries, yes it is 
possible to search for a paragraph.
But Lucene it self cannot detect sentence/paragraph for you.
There are other libraries for this.

Ahmet



On Monday, September 12, 2016 1:06 PM, szzoli  wrote:
Hi All,

Is it possible to search for a paragraph in Lucene? 

Thx
Zoli



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-possible-to-search-for-a-paragraph-in-Lucene-tp4295705.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How can I list all the terms from a document?

2016-09-07 Thread Ahmet Arslan
Hi,

TermVectors perhaps?

Ahmet



On Tuesday, September 6, 2016 4:21 PM, szzoli  wrote:
Hi All, 

How can I list all the terms from a document? I also need the counts of each
term per document.
I use Lucene 6.2. I found some solutions for older versions. These din't
work with 6.2

Thank you in advance.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-can-I-list-all-the-terms-from-a-document-tp4294797.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Doc length nomalization in Lucene LM

2016-07-22 Thread Ahmet Arslan


Hi,

Yes, as you discovered, there is some precision loss during the encode/decode 
process.

Ahmet


On Friday, July 22, 2016 1:59 PM, Dwaipayan Roy <dwaipayan@gmail.com> wrote:
Thanks for your reply. But I still have some doubts.

>From your answer, I think you mean to say that the document length is just
saved in byte format for less memory consumption. But while debugging, I
found that the doc length, that is passed in score() is 2621.44 where the
actual doc length is 2355.

I am confused. Please help.

On Fri, Jul 22, 2016 at 1:46 PM, Ahmet Arslan <iori...@yahoo.com> wrote:

> Hi Roy,
>
> It is about storing the document length into a byte (to use less memory).
> Please edit the source code to avoid this encode/decode thing:
>
> /**
> * Encodes the document length in a lossless way
> */
> @Override
> public long computeNorm(FieldInvertState state) {
> return state.getLength() - state.getNumOverlap();
> }
>
> @Override
> public float score(int doc, float freq) {
> // We have to supply something in case norms are omitted
> return ModelBase.this.score(stats, freq,
> norms == null ? 1L : norms.get(doc));
> }
>
> @Override
> public Explanation explain(int doc, Explanation freq) {
> return ModelBase.this.explain(stats, doc, freq,
> norms == null ? 1L : norms.get(doc));
> }
>
>
>
> On Thursday, July 21, 2016 6:06 PM, Dwaipayan Roy <dwaipayan@gmail.com>
> wrote:
>
>
>
> ​Hello,
>
> In *SimilarityBase.java*, I can see that the length of the document is is
> getting normalized by using the function *decodeNormValue()*. But I can't
> understand how the normalizations is done. Can you please help? Also, is
> there any way to avoid this doc-length normalization, to use the raw
> doc-length (as used in LM-JM Zhai et al. SIGIR-2001)?
>
> Thanks..
>
> P.S. I am using Lucene 4.10.4

>



-- 
Dwaipayan Roy.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Doc length nomalization in Lucene LM

2016-07-22 Thread Ahmet Arslan
Hi Roy,

It is about storing the document length into a byte (to use less memory).
Please edit the source code to avoid this encode/decode thing:

/**
* Encodes the document length in a lossless way
*/
@Override
public long computeNorm(FieldInvertState state) {
return state.getLength() - state.getNumOverlap();
}

@Override
public float score(int doc, float freq) {
// We have to supply something in case norms are omitted
return ModelBase.this.score(stats, freq,
norms == null ? 1L : norms.get(doc));
}

@Override
public Explanation explain(int doc, Explanation freq) {
return ModelBase.this.explain(stats, doc, freq,
norms == null ? 1L : norms.get(doc));
}



On Thursday, July 21, 2016 6:06 PM, Dwaipayan Roy  
wrote:



​Hello,

In *SimilarityBase.java*, I can see that the length of the document is is
getting normalized by using the function *decodeNormValue()*. But I can't
understand how the normalizations is done. Can you please help? Also, is
there any way to avoid this doc-length normalization, to use the raw
doc-length (as used in LM-JM Zhai et al. SIGIR-2001)?

Thanks..

P.S. I am using Lucene 4.10.4

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Help Relevance Feedback (Rocchio) with lucene

2016-06-28 Thread Ahmet Arslan
Hi Andres,

While there can be other ways, in general term vectors are used to extract 
"important terms" from top-k documents returned by the initial query.
Please see getTopTerms() method in 
http://www.cortecostituzionale.it/documenti/news/advancedluceneeu_69.pdf

Ahmet


On Tuesday, June 28, 2016 6:27 PM, Andres Fernando Wilches Riano 
 wrote:
Hello

I want to implement rocchio with lucene. Somebody has idea how to do it?

Thanks.

-- 
Atentamente,


*Andrés Fernando Wilches Riaño*
Ingeniero de Sistemas y Computación
Estudiante de Maestría en Ingeniería de Sistemas y Computación
Asistente Docente
Universidad Nacional de Colombia

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Favoring Terms Occurring in Close Proximity

2016-06-27 Thread Ahmet Arslan
Hi Daniel,

Solr has (e)dismax just for the propose you described.
https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser

Please see pf pf2 pf3 parameters 

Ahmet


On Monday, June 27, 2016 3:55 PM, Daniel Bigham <dani...@wolfram.com> wrote:
Hi Ahmet, 

Yes, thanks... that did come to mind and is the strategy I'm playing with. 

However, if you are giving a user a plain text field and using the Lucene query 
parser, it doesn't create optional clauses for boosting purposes. 

Does this imply that anyone wanting to use Lucene in conjunction with an input 
field needs to write a custom query parser if they want reasonable results? 


- On Jun 24, 2016, at 12:25 PM, Ahmet Arslan <iori...@yahoo.com.INVALID> 
wrote: 

> Hi Daniel,

> You can add optional clauses to your query for boosting purposes.

> for example,

> temperate OR climates OR "temperate climates"~5^100

> ahmet

> On Friday, June 24, 2016 5:07 PM, Daniel Bigham <dani...@wolfram.com> wrote:
> Something significant that I've noticed about using the default Lucene
> query parser is that if your user enters a query like:

> "temperate climates"

> ... it will get turned into an OR query:

> temperate OR climates

> This means that a document that contains the literal substring
> "temperate climates" will be on equal footing with a document that
> contains "temperate emotions may go a long way to keeping the peace as
> we continue to discuss climate change".

> So far as I know, your typical search engine definitely does not ignore
> the relative positions of terms.

> And so my question is -- how do people typically deal with this when
> using Lucene? What is wanted is a query that desires search terms to be
> close together, but failing that, is ok with the terms simply occurring
> in the document.

> And again -- the ultimate desire isn't just to construct a Query object
> to accomplish that, but to hook things up in such a way that a user can
> enter a query in an input box and have the system take their flat string
> and turn it into an intelligent query that acts somewhat like today's
> modern search engines in terms of wanting terms to be close to each other.

> This is such a "basic" use case of a search system that I'm tempted to
> think there must be well worn paths for doing this in Lucene.

> Thanks,
> Daniel

> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Favoring Terms Occurring in Close Proximity

2016-06-24 Thread Ahmet Arslan
Hi Daniel,

You can add optional clauses to your query for boosting purposes.

for example, 

temperate OR climates OR "temperate climates"~5^100

ahmet


On Friday, June 24, 2016 5:07 PM, Daniel Bigham  wrote:
Something significant that I've noticed about using the default Lucene 
query parser is that if your user enters a query like:

"temperate climates"

... it will get turned into an OR query:

temperate OR climates

This means that a document that contains the literal substring 
"temperate climates" will be on equal footing with a document that 
contains "temperate emotions may go a long way to keeping the peace as 
we continue to discuss climate change".

So far as I know, your typical search engine definitely does not ignore 
the relative positions of terms.

And so my question is -- how do people typically deal with this when 
using Lucene?  What is wanted is a query that desires search terms to be 
close together, but failing that, is ok with the terms simply occurring 
in the document.

And again -- the ultimate desire isn't just to construct a Query object 
to accomplish that, but to hook things up in such a way that a user can 
enter a query in an input box and have the system take their flat string 
and turn it into an intelligent query that acts somewhat like today's 
modern search engines in terms of wanting terms to be close to each other.

This is such a "basic" use case of a search system that I'm tempted to 
think there must be well worn paths for doing this in Lucene.

Thanks,
Daniel

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Preprocess input text before tokenizing

2016-06-24 Thread Ahmet Arslan
Hi Jaime,

Please see o.a.l.analysis.custom.CustomAnalyzer.builder() to create custom 
analyzers using a builder-style API.

Ahmet


On Friday, June 24, 2016 10:54 AM, Jaime <j.par...@estructure.es> wrote:
Thank you very much, that seems to solve my issue.

However, I find this a little cumbersome. I need to filter the text 
before any tokenizing takes place, so I have to implement a filtered 
version of every analyzer I'm using (StandardAnalyzer and 
SpanishAnalyzer and a custom analyzer right now).

If I need to support another analyzer in the future (a very plausible 
possibility) I will need to create another version of that analyzer. 
Whenever any of those analyzer is changed, I will need to manually apply 
the changes.

Isn't there a better way to do this?

El 23/06/2016 a las 20:28, Ahmet Arslan escribió:
> Hi,
>
> Zero or more CharFilter(s) is the way to manipulate text before the tokenizer.
> I think init reader is the method you want to plug char filters.
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/morfologik/src/java/org/apache/lucene/analysis/uk/UkrainianMorfologikAnalyzer.java
>
> Ahmet
>
> On Thursday, June 23, 2016 6:47 PM, Jaime <j.par...@estructure.es> wrote:
> Hello,
>
> I want to change the input text before tokenizing. I think I just need
> to use some characters as word separators, and maybe remove some others
> completely.
>
> I was planning to use MappingCharFilterFactory to replace some chars
> with " " and others with "", but I feel like I'm not in the right track.
>
> First, I've implemented a custom analyzer to use my custom tokenizer. My
> idea was to inherit from StandardTokenizer and, in setReader, calling
> MappingCharFilterFactory.create(reader) from within.
>
> However, setReader is final, so I can't override it.
>
> Is there a better way to do this?
> In any case, how should I use MappingCharFilter in case I really needed it?
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org

-- 
Jaime Pardos
ESTRUCTURE MEDIA SYSTEMS, S.L.
Avda. de Madrid nº 120 nave 10, 28500, Arganda del Rey, MADRID,
j.par...@estructure.es
910088429
  
AVISO LEGAL: Este mensaje y sus archivos adjuntos van dirigidos exclusivamente 
a su destinatario, pudiendo contener información confidencial sometida a 
secreto confidencial. No está permitida su reproducción o distribución sin la 
autorización expresa de ESTRUCTURE MEDIA SYSTEMS, S.L.. Si usted no es el 
destinatario final por favor elimínelo e infórmenos por esta vía. De acuerdo 
con lo establecido en la Ley Orgánica 15/1999, de 13 de diciembre, de 
Protección de Datos de Carácter Personal (LOPD), le informamos que sus datos 
están incorporados en un fichero del que es titular ESTRUCTURE MEDIA SYSTEMS, 
S.L. con la finalidad de realizar la gestión administrativa, contable, y 
fiscal, así como enviarle comunicaciones comerciales sobre nuestros productos 
y/o servicios. Asimismo, le informamos de la posibilidad de ejercer los 
derechos de acceso, rectificación, cancelación y oposición de sus datos en el 
domicilio de ESTRUCTURE MEDIA SYSTEMS, S.L., sito en Avda. de Madrid nº 120 
nave 10, 28500, Arganda del Rey, MADRID, o a la dirección de correo electrónico 
i...@estructure.es.
  
This message and its attachments are intended solely for the addressee and may 
contain confidential information submitted to confidential secret. It is not 
allowed its reproduction or distribution without the express permission of 
ESTRUCTURE MEDIA SYSTEMS, S.L. .. If you are not the intended recipient please 
delete it and inform us in this way. According to the provisions of Law 
15/1999, of December 13, Protection of Personal Data (LOPD), we inform you that 
your data is incorporated into a file which is owned by ESTRUCTURE MEDIA 
SYSTEMS, S.L. in order to perform administrative, accounting and fiscal 
management, as well as send you communications about our products and / or 
services. Also we advised of the possibility of exercising rights of access, 
rectification, cancellation and opposition of their data at the home of 
ESTRUCTURE MEDIA SYSTEMS, SL, located in Avda. De Madrid # 120 ship 10 28500, 
Arganda del Rey, Madrid , or email address i...@estructure.es.




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Preprocess input text before tokenizing

2016-06-23 Thread Ahmet Arslan
Hi,

Zero or more CharFilter(s) is the way to manipulate text before the tokenizer.
I think init reader is the method you want to plug char filters.
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/morfologik/src/java/org/apache/lucene/analysis/uk/UkrainianMorfologikAnalyzer.java

Ahmet

On Thursday, June 23, 2016 6:47 PM, Jaime  wrote:
Hello,

I want to change the input text before tokenizing. I think I just need 
to use some characters as word separators, and maybe remove some others 
completely.

I was planning to use MappingCharFilterFactory to replace some chars 
with " " and others with "", but I feel like I'm not in the right track.

First, I've implemented a custom analyzer to use my custom tokenizer. My 
idea was to inherit from StandardTokenizer and, in setReader, calling 
MappingCharFilterFactory.create(reader) from within.

However, setReader is final, so I can't override it.

Is there a better way to do this?
In any case, how should I use MappingCharFilter in case I really needed it?


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to prevent WordDelimiterFilter tokenize the string with underscore?

2016-06-15 Thread Ahmet Arslan
Hi,

You can supply custom types. 
please see WordDelimiterFilterFactory and wdfftypes.txt for an example.

ahmet


On Wednesday, June 15, 2016 10:32 PM, Xiaolong Zheng  
wrote:
Hi,

How can I prevent WordDelimiterFilter tokenize the string with underscore,
e.g. word_with_underscore.

I am using WordDelimiterFilter to create my own Camel Case analyzer, I was
using the configuration flag:

flags |= GENERATE_WORD_PARTS;
flags |= SPLIT_ON_CASE_CHANGE;
flags |= PRESERVE_ORIGINAL;


But I realize that one of the side effect for using the
SPLIT_ON_CASE_CHANGE is it also tokenize the string with underscore.

I am wondering how can I prevent it to tokenize the string with underscores?




Sincerely,

--Xiaolong

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Cache Lucene based index.

2016-05-22 Thread Ahmet Arslan
Hi Singhal,

May be MemoryIndex or RAMDirectory?

Ahmet



On Saturday, May 21, 2016 1:42 PM, Prateek Singhal  
wrote:
You can consider that I want to store the lucene index in some sort of
temporary memory or a HashMap so that I do not need to index the documents
every time as it is a costly operation. I can directly return the lucene
index from that HashMap and use it to answer my queries.

Just want to know if I can access the lucene index object which lucene has
created so that I can cache it.



On Sat, May 21, 2016 at 3:46 PM, Uwe Schindler  wrote:

> Hi,
>
> What do you mean with "cache"?
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> > -Original Message-
> > From: Prateek Singhal [mailto:prateek.b...@gmail.com]
> > Sent: Saturday, May 21, 2016 11:27 AM
> > To: java-user@lucene.apache.org
> > Subject: Cache Lucene based index.
> >
> > Hi Lucene lovers,
> >
> > I have a use-case where I want to *create a lucene based index* of
> multiple
> > documents and then *want to cache that index*.
> >
> > Can anyone suggest if this is possible ?
> > And which *type of cache* will be most efficient for this use case.
> >
> > Also if you can provide me with any *example *of the same then it will be
> > really very helpful.
> >
> > Thanks.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Regards,

Prateek Singhal
Software Development Engineer @ Amazon.com

"Believe in yourself and you can do unbelievable things."

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Query Grammar

2016-05-16 Thread Ahmet Arslan
Hi Taher,

Please find and see QueryParser.jj file in the source tree.

You can find all operators such as && || AND OR !.

Ahmet


On Sunday, May 15, 2016 1:57 PM, Taher Galal  wrote:
Hi All,

I was just checking the query grammer found in the java docs of the query
parser :

Query  ::= ( Clause )*
Clause ::= ["+", "-"] [ ":"] (  | "(" Query ")" )


This is what is available the question lies I can't see any place

that shows the operators SUCH as the AND and the OR for example how
can they be added?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Simple Similarity Implementation to Count the Number of Hits

2016-05-12 Thread Ahmet Arslan
Hi Luis,


Thats an interesting question. Can you share your similarity?
I suspect you return 1 expect Similarity#coord method.

Not sure but, for phrase query, one may require to modify 
ExactPhraseScorer/ExactPhraseScorer etc.

ahmet

On Thursday, May 12, 2016 5:41 AM, Luís Filipe Nassif  
wrote:



Hi,

In the past (lucene 4) I have tried to implement a simple Similarity to
only count the number of occurrences (term frequencies) into the documents,
ignoring norms, doc frequencies, boosts... It worked for some queries like
term and wildcard queries, but not for others, like phrase and range
queries. Phrase query scores were being squared, eg, a phrase query with 2
terms was returning score 4 and a phrase query with 3 terms was returning
score 9, for a document with only one occurrence of the phrase.

Does someone have a working example or guideline for that implementation?

Thank you,
Luis

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Query Expansion for Synonyms

2016-04-28 Thread Ahmet Arslan
Hi Daniel,

Since you are restricting inOrder=true and proximity=0 in the top level query, 
there is no problem in your particular example.

If you weren't restricting, injecting synonyms with plain OR, sometimes cause 
'query drift': injection/addition of one term changes result list drastically.

When there is a big term statistics (document frequency, collection frequency, 
etc) difference between the injected term and the original term, there can be 
unexpected results.

BlendedTermQuery and SynonymQuery implementations could be used.

Ahmet

On Thursday, April 28, 2016 6:26 PM, Daniel Bigham  wrote:
I'm investigating various ways of supporting synonyms in Lucene.

One such approach that looks potentially interesting is to do a kind of 
"query expansion".

For example, if the user searches for "us 1888", one might expand the 
query as follows:

 SpanNearQuery query =
 new SpanNearQuery(
 new SpanQuery[]
 {
 new SpanOrQuery(
 new SpanTermQuery(new Term("Plaintext", "us")),
 new SpanNearQuery(
 new SpanQuery[]
 {
 new SpanTermQuery(new Term("Plaintext", "united")),
 new SpanTermQuery(new Term("Plaintext", "states"))
 },
 0,
 true
 )
 ),
 new SpanTermQuery(new Term("Plaintext", "1888"))
 },
 0,
 true
 );

A couple of questions:

- Is this approach in use within the community?
- Are there "gotchas" with this approach that make it undesirable?

I've done a few quick tests wrt query performance on a test index and 
found that a query can indeed take 10x longer if enough synonyms are 
used, but if the baseline search time is around 1 ms, then 10 ms is 
still plently fast enough. (that said, my test was on a 70 MB index, so 
my 10 ms might turn into something nasty with a 7 GB index)

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Evaluate if a document satisfies a query

2016-04-25 Thread Ahmet Arslan
Hi,

MemoryIndex is used for that purpose.

Please see :

https://github.com/flaxsearch/luwak

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html

http://lucene.apache.org/core/6_0_0/memory/index.html?org/apache/lucene/index/memory/MemoryIndex.html
Ahmet




On Monday, April 25, 2016 5:04 PM, Andres de la Peña  
wrote:
Hi all,

Is it possible to evaluate if a document satisfies a query? Of course it
can be done indexing the document in a RAMIndex and querying it, but I
wonder if it is possible to do it in a more efficient way.

Thanks,

-- 
Andrés de la Peña

Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
*

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: BlendedTermQuery causing negative IDF?

2016-04-19 Thread Ahmet Arslan
Thanks Dough for letting us know that Lucene's BM25 avoids negative IDF values.
I didn't know that. 

Markus, out of curiosity, why do you need BlendedTermQuery?
I knew SynonymQuery is now part of query parser base, I think they do similar 
things?

Ahmet




On Tuesday, April 19, 2016 5:33 PM, Doug Turnbull 
<dturnb...@opensourceconnections.com> wrote:
Lucene's BM25 avoids negatives scores for this by adding 1 inside the log
term of BM25's IDF

Compare this:
https://github.com/apache/lucene-solr/blob/5e5fd662575105de88d8514b426bccdcb4c76948/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L71

to the Wikipedia article's BM25 IDF
https://en.wikipedia.org/wiki/Okapi_BM25

Markus another thing to add is that when Elasticsearch uses
BlendedTermQuery, they add a lot of invariants that must be true. For
example the fields must share the same analyzer. You may need to research
what else happens in Elasticsearch outside BlendedTermQuery to fet this
behavior to work.

Another testing philosophy point: when I do this kind of work I like to
isolate the Lucene behavior seperate from the Solr behavior. I might
suggest creating a Lucene unit test to validate your assumptions around
BlendedTermQuery. Just to help isolate the issues. Here's Lucene's tests
for BlendedTermQuery as a basis

https://github.com/apache/lucene-solr/blob/5e5fd662575105de88d8514b426bccdcb4c76948/lucene/core/src/test/org/apache/lucene/search/TestBlendedTermQuery.java









On Tue, Apr 19, 2016 at 10:16 AM Ahmet Arslan <iori...@yahoo.com.invalid>
wrote:

>
>
> Hi Markus,
>
> It is a known property of BM25. It produces negative scores for common
> terms.
> Most of the term-weighting models are developed for indices in which stop
> words are eliminated.
> Therefore, most of the term-weighting models have problems scoring common
> terms.
> By the way, DFI model does a decent job when handling common terms.
>
> Ahmet
>
>
>
> On Tuesday, April 19, 2016 4:48 PM, Markus Jelsma <
> markus.jel...@openindex.io> wrote:
> Hello,
>
> I just made a Solr query parser for BlendedTermQuery on Lucene 6.0 using
> BM25 similarity and i have a very simple unit test to see if something is
> working at all. But to my surprise, one of the results has a negative
> score, caused by a negative IDF because docFreq is higher than docCount for
> that term on that field. Here are the test documents:
>
> assertU(adoc("id", "1", "text", "rare term"));
> assertU(adoc("id", "2", "text_nl", "less rare term"));
> assertU(adoc("id", "3", "text_nl", "rarest term"));
> assertU(commit());
>
> My query parser creates the following Lucene query:
> BlendedTermQuery(Blended(text:rare text:term text_nl:rare text_nl:term))
> which looks fine to me. But this is what i am getting back for issueing
> that query on the above set of documents, the third document is the one
> with a negative score.
>
> 
>   
> 3
> 0.1805489
>   
> 2
> 0.14785346
>   
> 1
> -0.004004207
> 
> 
>   {!blended fl=text,text_nl}rare term
>   {!blended fl=text,text_nl}rare term
>   BlendedTermQuery(Blended(text:rare text:term
> text_nl:rare text_nl:term))
>   Blended(text:rare text:term
> text_nl:rare text_nl:term)
>   
> 
> 0.1805489 = max plus 0.01 times others of:
>   0.1805489 = weight(text_nl:term in 2) [], result of:
> 0.1805489 = score(doc=2,freq=1.0 = termFreq=1.0
> ), product of:
>   0.18232156 = idf(docFreq=2, docCount=2)
>   0.9902773 = tfNorm, computed from:
> 1.0 = termFreq=1.0
> 1.2 = parameter k1
> 0.75 = parameter b
> 2.5 = avgFieldLength
> 2.56 = fieldLength
> 
> 
> 0.14785345 = max plus 0.01 times others of:
>   0.14638956 = weight(text_nl:rare in 1) [], result of:
> 0.14638956 = score(doc=1,freq=1.0 = termFreq=1.0
> ), product of:
>   0.18232156 = idf(docFreq=2, docCount=2)
>   0.8029196 = tfNorm, computed from:
> 1.0 = termFreq=1.0
> 1.2 = parameter k1
> 0.75 = parameter b
> 2.5 = avgFieldLength
> 4.0 = fieldLength
>   0.14638956 = weight(text_nl:term in 1) [], result of:
> 0.14638956 = score(doc=1,freq=1.0 = termFreq=1.0
> ), product of:
>   0.18232156 = idf(docFreq=2, docCount=2)
>   0.8029196 = tfNorm, computed from:
> 1.0 = termFreq=1.0
> 1.2 = parameter k1
> 0.75 = parameter b
> 2.5 = avgFieldLength
> 4.0 = fieldLength
> 
> 
> -0.004004207 = max plus 0.01 times others of:
>   -0.20021036 = weight(text:rare in 0) [], result of:
> -0.20

Re: BlendedTermQuery causing negative IDF?

2016-04-19 Thread Ahmet Arslan
Hi Again,

For those who are interested, I uploaded BM25's Term Frequency graph [0] for 
some common and content-bearing words.


[0] http://2.1m.yt/PgUEcZ.png

Ahmet




On Tuesday, April 19, 2016 5:16 PM, Ahmet Arslan <iori...@yahoo.com.INVALID> 
wrote:


Hi Markus,

It is a known property of BM25. It produces negative scores for common terms.
Most of the term-weighting models are developed for indices in which stop words 
are eliminated.
Therefore, most of the term-weighting models have problems scoring common terms.
By the way, DFI model does a decent job when handling common terms.

Ahmet



On Tuesday, April 19, 2016 4:48 PM, Markus Jelsma <markus.jel...@openindex.io> 
wrote:
Hello,

I just made a Solr query parser for BlendedTermQuery on Lucene 6.0 using BM25 
similarity and i have a very simple unit test to see if something is working at 
all. But to my surprise, one of the results has a negative score, caused by a 
negative IDF because docFreq is higher than docCount for that term on that 
field. Here are the test documents:

assertU(adoc("id", "1", "text", "rare term"));
assertU(adoc("id", "2", "text_nl", "less rare term"));
assertU(adoc("id", "3", "text_nl", "rarest term"));
assertU(commit());

My query parser creates the following Lucene query: 
BlendedTermQuery(Blended(text:rare text:term text_nl:rare text_nl:term)) which 
looks fine to me. But this is what i am getting back for issueing that query on 
the above set of documents, the third document is the one with a negative score.


  
3
0.1805489
  
2
0.14785346
  
1
-0.004004207


  {!blended fl=text,text_nl}rare term
  {!blended fl=text,text_nl}rare term
  BlendedTermQuery(Blended(text:rare text:term 
text_nl:rare text_nl:term))
  Blended(text:rare text:term text_nl:rare 
text_nl:term)
  

0.1805489 = max plus 0.01 times others of:
  0.1805489 = weight(text_nl:term in 2) [], result of:
0.1805489 = score(doc=2,freq=1.0 = termFreq=1.0
), product of:
  0.18232156 = idf(docFreq=2, docCount=2)
  0.9902773 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
2.5 = avgFieldLength
2.56 = fieldLength


0.14785345 = max plus 0.01 times others of:
  0.14638956 = weight(text_nl:rare in 1) [], result of:
0.14638956 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
  0.18232156 = idf(docFreq=2, docCount=2)
  0.8029196 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
2.5 = avgFieldLength
4.0 = fieldLength
  0.14638956 = weight(text_nl:term in 1) [], result of:
0.14638956 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
  0.18232156 = idf(docFreq=2, docCount=2)
  0.8029196 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
2.5 = avgFieldLength
4.0 = fieldLength


-0.004004207 = max plus 0.01 times others of:
  -0.20021036 = weight(text:rare in 0) [], result of:
-0.20021036 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
  -0.22314355 = idf(docFreq=2, docCount=1)
  0.89722675 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
2.0 = avgFieldLength
2.56 = fieldLength
  -0.20021036 = weight(text:term in 0) [], result of:
-0.20021036 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
  -0.22314355 = idf(docFreq=2, docCount=1)
  0.89722675 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
2.0 = avgFieldLength
2.56 = fieldLength


What am i doing wrong? Or did i catch a bug?

Thanks,
Markus

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: BlendedTermQuery causing negative IDF?

2016-04-19 Thread Ahmet Arslan


Hi Markus,

It is a known property of BM25. It produces negative scores for common terms.
Most of the term-weighting models are developed for indices in which stop words 
are eliminated.
Therefore, most of the term-weighting models have problems scoring common terms.
By the way, DFI model does a decent job when handling common terms.

Ahmet



On Tuesday, April 19, 2016 4:48 PM, Markus Jelsma  
wrote:
Hello,

I just made a Solr query parser for BlendedTermQuery on Lucene 6.0 using BM25 
similarity and i have a very simple unit test to see if something is working at 
all. But to my surprise, one of the results has a negative score, caused by a 
negative IDF because docFreq is higher than docCount for that term on that 
field. Here are the test documents:

assertU(adoc("id", "1", "text", "rare term"));
assertU(adoc("id", "2", "text_nl", "less rare term"));
assertU(adoc("id", "3", "text_nl", "rarest term"));
assertU(commit());

My query parser creates the following Lucene query: 
BlendedTermQuery(Blended(text:rare text:term text_nl:rare text_nl:term)) which 
looks fine to me. But this is what i am getting back for issueing that query on 
the above set of documents, the third document is the one with a negative score.


  
3
0.1805489
  
2
0.14785346
  
1
-0.004004207


  {!blended fl=text,text_nl}rare term
  {!blended fl=text,text_nl}rare term
  BlendedTermQuery(Blended(text:rare text:term 
text_nl:rare text_nl:term))
  Blended(text:rare text:term text_nl:rare 
text_nl:term)
  

0.1805489 = max plus 0.01 times others of:
  0.1805489 = weight(text_nl:term in 2) [], result of:
0.1805489 = score(doc=2,freq=1.0 = termFreq=1.0
), product of:
  0.18232156 = idf(docFreq=2, docCount=2)
  0.9902773 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
2.5 = avgFieldLength
2.56 = fieldLength


0.14785345 = max plus 0.01 times others of:
  0.14638956 = weight(text_nl:rare in 1) [], result of:
0.14638956 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
  0.18232156 = idf(docFreq=2, docCount=2)
  0.8029196 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
2.5 = avgFieldLength
4.0 = fieldLength
  0.14638956 = weight(text_nl:term in 1) [], result of:
0.14638956 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
  0.18232156 = idf(docFreq=2, docCount=2)
  0.8029196 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
2.5 = avgFieldLength
4.0 = fieldLength


-0.004004207 = max plus 0.01 times others of:
  -0.20021036 = weight(text:rare in 0) [], result of:
-0.20021036 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
  -0.22314355 = idf(docFreq=2, docCount=1)
  0.89722675 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
2.0 = avgFieldLength
2.56 = fieldLength
  -0.20021036 = weight(text:term in 0) [], result of:
-0.20021036 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
  -0.22314355 = idf(docFreq=2, docCount=1)
  0.89722675 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
2.0 = avgFieldLength
2.56 = fieldLength


What am i doing wrong? Or did i catch a bug?

Thanks,
Markus

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Custom indexing

2016-04-18 Thread Ahmet Arslan


Hi,

Please try letter tokenizer, it should cover your example.

Ahmet

On Monday, April 18, 2016 3:02 PM, PK C <tech.kumar...@gmail.com> wrote:



Hi,

   Thank you very much for your quick responses.

Jack Krupansky,

The main use case is searching in file names. For example, lucene.txt,
lucene_new.txt, lucene_1_new.txt. If I use 'lucene', I need to get all 3
files. with 'new' I need to get last two files. Please note that Standard
analyzer/tokenizer of lucene 3.6 is not giving us the results with
tokenization of  "." and "_". Are you referring to later versions than 3.6 ?

Ahmet,

1. Not sure if LetterTokenizer helps with the above example of having
numbers and letters in file names.
2. WordDelimeterFilter does not seem to be lucene 3.6
3. MappingCharFilter  is what I am already using overriding initReader
method in my CustomAnalyzer (Source copied from StandardAnalyzer (final
class)). Is this a good way to make use of final class StandardAnalyzer
with some custom changes ? Or composition is better ?

Thank you again,
Best Regards


On Tue, Apr 12, 2016 at 8:45 PM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> The standard analyzer/tokenizer should do a decent job of splitting on dot,
> hyphen, and underscore, in addition to whitespace and other punctuation.
>
> Can you post some specific test cases you are concerned with? (You should
> always run some test cases.)
>
> -- Jack Krupansky
>
> On Tue, Apr 12, 2016 at 10:35 AM, Ahmet Arslan <iori...@yahoo.com.invalid>
> wrote:
>
> > Hi Chamarty,
> >
> > Well, there are a lot of options here.
> >
> > 1) Use LetterTokenizer
> > 2) Use WordDelimeterFilter combined with WhiteSpaceTokenizer
> > 3) Use MappingCharFilter to replace those characters with spaces
> > .
> > .
> > .
> >
> > Ahmet
> >
> >
> > On Tuesday, April 12, 2016 3:58 PM, PrasannaKumar Chamarty <
> > tech.kumar...@gmail.com> wrote:
> >
> >
> >
> > Hi,
> >
> > What is the best way (in terms of maintenance required with new lucene
> > releases) to allow splitting of words on "." and "_" for indexing ? Thank
> > you.
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Custom indexing

2016-04-12 Thread Ahmet Arslan
Hi Chamarty,

Well, there are a lot of options here.

1) Use LetterTokenizer
2) Use WordDelimeterFilter combined with WhiteSpaceTokenizer
3) Use MappingCharFilter to replace those characters with spaces
.
.
.

Ahmet


On Tuesday, April 12, 2016 3:58 PM, PrasannaKumar Chamarty 
 wrote:



Hi,

What is the best way (in terms of maintenance required with new lucene
releases) to allow splitting of words on "." and "_" for indexing ? Thank
you.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Regarding the Lucene Proximity Search

2016-04-04 Thread Ahmet Arslan
Hi,

If you are writing your queries programmatically, (without using a query 
parser), nested proximity is possible with SpanQuery family. Actually there 
exists surround query parser for this. Please see 
o.a.lucene.queryparser.surround.parser.QueryParser

Proximity search uses position information. You can restrict how many other 
terms can exist between query terms.

Ahmet


On Monday, April 4, 2016 4:01 PM, lokesh mittal  
wrote:



Hi

I want to know how the proximity search in lucene works? Does lucene
supports the nested proximity search?

Thanks
Lokesh

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Subset Matching

2016-03-25 Thread Ahmet Arslan
Hi Otmar,

For this requirement, you need to create an additional field containing the 
number of words/terms in the field.


For example.

field : blue pill
length = 2


query : if you take the blue pill
length  : 6


Please see my previous responses on the same topic:
http://search-lucene.com/m/eHNluYPa11VSxlf1=Re+search+for+documents+where+all+words+of+field+present+in+the+query

http://search-lucene.com/m/eHNl9Yu6V1xx3rp=Re+Match+All+terms+in+indexed+field+value

I know they are solr responses but Function Queries exists in Lucene as far as 
know.

Ahmet
On Friday, March 25, 2016 11:20 AM, Otmar Caduff  wrote:



Hi all
In Lucene, I know of the possibility of Occur.SHOULD, Occur.MUST and the
“minimum should match” setting on the boolean query.

Now, when querying, I want to
- (1)  match the documents which either contain all the terms of the query
(Occur.MUST for all terms would do that) or,
- (2)  if all terms for a given field of a document are a subset of the
query terms, that document should match as well.

Any clue on how to accomplish this?

Otmar

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Problem with porter stemming

2016-03-14 Thread Ahmet Arslan
Hi Dwaipayan,

Another way is to use KeywordMarkerFilter. Stemmer implementations respect this 
attribute.
If you want to supply your own mappings, StemmerOverrideTokenFilter could be 
used as well.

ahmet


On Monday, March 14, 2016 4:31 PM, Dwaipayan Roy  
wrote:



​I am using EnglishAnalyzer with my own stopword list. EnglishAnalyzer uses
the porter stemmer (snowball) to stem the words. But using the
EnglishAnalyzer, I am getting erroneous result for 'news'. 'news' is
getting stemmed into 'new'.

Any help would be appreciated.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Top terms relevance from specific documents ?

2016-01-27 Thread Ahmet Arslan
Hi Yannick,

More like this (mlt) stuff does this already.
It extracts "interesting terms" from top N documents.
Don't remember but this feature may require "term vectors" to be stored.

Ahmet



On Wednesday, January 27, 2016 10:41 AM, Yannick Martel  
wrote:
Le Tue, 15 Dec 2015 17:56:05 +0100,
Yannick Martel  a écrit :

> Hi !
> 
> I am using (Java) Lucene for data indexation, and I want to produce
> kind of tags cloud for specific data.
> 
> I've found HighFreqTerms to get a top list of terms from *all
> documents* (if I have well understood) (by the bye, I had override it
> to be able to filter on several fields instead only one).
> 
> But, it does not really match with my need : I'd like to get the most
> repeated terms in a single (or several specific) document(s).
> For exemple, considering a document with Terms "Title", "Summary",
> "Description", I try to get the count of each terms (excluding stop
> words from Analyzer).
> 
> I cannot find process to do that : I searched among TopFieldCollector,
> or other collector, but seems it just give document scores :/
> 
> Find documentation is not easy I think, cause lot of questions/answers
> are either not corresponding my need, or with old version (3.x for
> example), and I'm feeling lost in all of this...
> 
> 
> Hopping someone could guide me well.
> 
> Regards,
> 

Hello,

After more than one month with no response, should I conclude what I
want is not possible with Lucene ?


Regards,

-- 
Yannick Martel


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to escape URL at indexing time

2015-12-27 Thread Ahmet Arslan
Hi Daniel,

The exception you have posted is a parse exception. 
Something occurs during querying. Not indexing.

There are some special characters that are part of query parsing syntax.
You need to escape them.

Ahmet




On Sunday, December 27, 2015 10:53 PM, Daniel Valdivia 
 wrote:
Hi

I'm trying to index documents that have a URL in some field, however as soon as 
I try to index a URL like "http://yahoo.com; I get error:

org.apache.lucene.queryparser.classic.ParseException: Cannot parse 
'id:'http://www.yahoo.com'': Encountered " ":" ": "" at line 1, column 8.

I asume I need to escape the URL, but not sure if encoding the URL is the right 
way to go.

my indexing code:

Document doc = new Document();

doc.add(new StringField("id", url, Field.Store.YES));
doc.add(new StringField("domain", domain, Field.Store.NO));
doc.add(new StringField("title", pageTitle, Field.Store.NO));
doc.add(new TextField("body", pageBody, Field.Store.NO));
w.addDocument(doc);

Any ideas on how I can avoid the parsing issue?

I’m using Lucene 5.4.0

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Jensen–Shannon divergence

2015-12-13 Thread Ahmet Arslan
Hi Shay,

I suggest you to extend o.a.l.search.similarities.SimilarityBase.
All you need to implement a score() method. After all fancy names (language 
models, etc), a similarity is a function of seven salient statistics. It is 
actually six: avgFieldLength can derived from other two (numberOfFieldTokens 
divided by numberOfDocuments)

Seven Statistics come from,
Corpus statistics : numberOfDocuments, numberOfFieldTokens, avgFieldLength
Term statistics: totalTermFreq and docFreq
About the document being scored : within document term frequency (freq) and 
document length (docLen)

If you can express your ranking method in terms of these seven variables, you 
are ready to go. For example my Dirichlet LM model implementation is nothing 
but :

return log2(1 + (tf / (c * (termFrequency / numberOfTokens + log2(c / 
(docLength + c));

If you need additional statistics, number of unique terms in a document for 
example, you need to calculate it by your self and embed it to the index 
(possibly using DocValues). During scoring, you can retrieve it.

Personally I wondered about your similarity, If possible please let community 
know about its effectiveness.

Please also see Robert's write-up : 
http://lucidworks.com/blog/2011/09/12/flexible-ranking-in-lucene-4/

Thanks,
Ahmet


On Sunday, December 13, 2015 6:28 PM, will martin  wrote:
Sorry it was early.

If you go looking on the web, you can find, as I did reputable work on 
implementing DiricletLanguage Models. However, at this hour you might get 
answers here. Extrapolating others work into a lucene implantation is only 
slightly different from getting answers here. imo

g'luck



> On Dec 13, 2015, at 10:55 AM, Shay Hummel  wrote:
> 
> Hi
> 
> I am sorry but I didn't understand your answer. Can you please elaborate?
> 
> Shay
> 
> On Sun, Dec 13, 2015 at 3:41 PM will martin  wrote:
> 
>> expand your due diligence beyond wikipedia:
>> i.e.
>> 
>> http://ciir.cs.umass.edu/pubfiles/ir-464.pdf
>> 
>> 
>> 
>>> On Dec 13, 2015, at 8:30 AM, Shay Hummel  wrote:
>>> 
>>> LMDiricletbut its feasibilit
>> 
> -- 
> Regards,
> Shay Hummel


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Position and Range Information

2015-12-11 Thread Ahmet Arslan
Hi,

Yes, TextField includes positions.

Ahmet



On Friday, December 11, 2015 5:40 PM, Douglas Kunzma  
wrote:
All -

I'm using a TextField and a BufferedReader to add text to a Lucene Document
object.
Can I still get all of the matches in a Document including the position
information and start and end offset using Lucene 5.3.1?

Thanks,
Doug

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: dynamic pruning (WAND) supported ??

2015-12-03 Thread Ahmet Arslan
Hi Zong,


I don't think Lucene has this. People usually needs all candidate documents to 
be scored.
They sometimes sort by price, popularity, etc, sometimes combined with document 
relevancy scores. 

However, with time limited collector, closest thing could be: 
https://issues.apache.org/jira/browse/LUCENE-2482

Ahmet
On Thursday, December 3, 2015 7:53 AM, search engine 
 wrote:



Hi,

Does Lucene have any dynamic pruning mechanism in place now to make posting
scoring more efficient?

thanks,
Zong

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene classpath

2015-12-03 Thread Ahmet Arslan
Hi,

May be windows path separator messing things.
Can you try to copy jars to current working directory and re-try
java -classpath lucene-demo-5.3.1.jar;lucene-core-5.3.1.jar
Ahmet



On Thursday, December 3, 2015 11:57 PM, jerrittpace  
wrote:
I am trying to set the classpath for the lucene jars

I have tried many different variations of the following:

java -classpath
C:\Users\User5\Documents\lucene\lucene-5.3.1\demo\lucene-demo-5.3.1.jar;C:\Users\User5\Documents\lucene\lucene-5.3.1\core\lucene-core-5.3.1.jar
org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}/src

That code is similar to the commands I have found in the various examples
i've found on the web addressing this sort of issue.

I also tried

java -classpath
C:\Users\User5\Documents\lucene\lucene-5.3.1\demo\lucene-demo-5.3.1.jar;C:\Users\User5\Documents\lucene\lucene-5.3.1\core\lucene-core-5.3.1.jar
org.apache.lucene.demo.IndexFiles -docs
{C:\Users\User5\Documents\lucene\lucene-5.3.1}/src

The lucene site at https://lucene.apache.org/core/2_9_4/demo.html basically
just says "Put both of these files in your Java CLASSPATH."

I am not able to solve this concern as of now, and so I would really
appreciate any help I can find to steer me in the right direction to solve
this problem.

Thank you in advance for your help!!




--
View this message in context: 
http://lucene.472066.n3.nabble.com/lucene-classpath-tp4243489.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Access query length inside similarity

2015-11-03 Thread Ahmet Arslan
Hi,

I only use BooleanQuery with TermQuery clauses.

I found following methods that seems relevant to my need.
There is a variable named maxOverlap, which is the total number of terms in the 
query.

BooleanScorer's constructor has maxCoord variable
Similarity#coord
BooleanWeight#coord


How can I pass query length(maxOverlap/maxCoord) inside the 
Similarity.SimScorer#score method?

Any help on this is really appreciated.

Thanks,
Ahmet



On Tuesday, October 27, 2015 10:27 AM, Ahmet Arslan <iori...@yahoo.com> wrote:
Hi,

How can I access length of the query (number of words in the query) inside a 
SimilarityBase implementation?

P.S. I am implementing multi-aspect TF [1] for an experimental study.
So it does not have to be fast/optimized as production code.

[1] http://dl.acm.org/citation.cfm?doid=2484028.2484070

Thanks,
Ahmet

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Access query length inside similarity

2015-10-27 Thread Ahmet Arslan
Hi,

How can I access length of the query (number of words in the query) inside a 
SimilarityBase implementation?

P.S. I am implementing multi-aspect TF [1] for an experimental study.
So it does not have to be fast/optimized as production code.

[1] http://dl.acm.org/citation.cfm?doid=2484028.2484070

Thanks,
Ahmet

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Dubious stuff spotted in LowerCaseFilter

2015-10-22 Thread Ahmet Arslan
Hi Uwe,

What is the meaning of "the Unicode Policeman" ?


Thanks,
Ahmet

On Thursday, October 22, 2015 2:59 PM, Uwe Schindler  wrote:



Hi,


> >> Setting aside the fact that Character.toLowerCase is already dubious
> >> in some locales (e.g. Turkish),
> >
> > This is not true. Character.toLowerCase() works locale-independent.
> > It is only String.toLowerCase that works using default locale.

So you mean the opposite. You wanted to have it locale-dependent. That’s 
already possible: LowercaseFilter is documented to only use default unicode 
folding, no locale specific stuff. If you have a turkish lucene field, you need 
to do locale-specific analysis anyways (e.g. use TukishAnalyzer). This one uses 
TurkishLowercaseFilter. Having both variant as synonyms needs more work, but 
out of the scope of this mail thread.

> Yet if you have a field like "title" and the user and system are Turkish, the
> user would expect their locale to apply, yet LowerCaseFilter will not handle
> that. So whereas it is "safe" for English hard-coded strings, it isn't safe 
> for all
> fields you might index in general.

That's documented like that!

> Dawid's response shows, though, that at least for the time being, there is
> nothing to worry about. Hopefully Unicode will never add a code point which
> lowercases to one with less code units (or I guess changes one of the lower
> ones to lowercase to more than one...)

There was a discussion about that in JIRA already at the time of rewriting 
LowercaseFilter to allow suppl characters outside BMP. I have to lookup the 
issue, but I am quite sure that the Unicode Policeman did a lot of recherche 
and found some statement in Unicode spec that the upper and lowercase letters 
are always in same block. I will try to look this up.


Uwe


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Learning to Rank algorithms in Lucene

2015-08-18 Thread Ahmet Arslan
Hi Ajinkya,

I don't think there exists any production-ready LtR-Lucene/Solr setup.

LtR simply re-rank top N (typically 1000) documents. 
Fetching top N documents is what we do today with Lucene.

There is an API for re-rank in Lucene/Solr but no LtR support yet.
https://cwiki.apache.org/confluence/display/solr/Query+Re-Ranking

Here are the difficulties/problems :

* LtR requires training data (probably labelled by humans)
* It is hard to decide the feature set. Also it differs from system to system.
* Query-dependeny features must be calculated for the top N documents at 
query/retrieval time, which may be slow.

Today, generally function queries are used to combine recency, popularity, 
star, product/document quality, price, etc into scoring function.
This approach is unsupervised therefore requires no training data.

Ahmet



On Tuesday, August 18, 2015 10:34 AM, Ajinkya Kale kaleajin...@gmail.com 
wrote:
Are there any existing packages/examples or prior experience on using
Learning to Rank (or Machine Learned Ranking) algorithms as custom
Scorer/Ranker for lucene or solr ?
How do people deploy Learning to Rank models with Lucene backends ?

--ajinkya

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Using lucene queries to search StringFields

2015-06-19 Thread Ahmet Arslan
Hi,

Why don't you create your query with API?

Term term = new Term(B, 1 2);
Query query = new TermQuery(term);

Ahmet



On Friday, June 19, 2015 9:31 AM, Gimantha Bandara giman...@wso2.com wrote:
Correction..

second time I used the following code to test. Then I got the above
IllegalStateException issue.

w = new QueryParser(null, new WhitespaceAnalyzer()).parse(*B:\1 2\*);

not the below one.

w = new QueryParser(null, new WhitespaceAnalyzer()).parse(*\**B:1 2\*);

Can someone point out the correct way to query for StringFields?

Thanks,

On Thu, Jun 18, 2015 at 2:12 PM, Gimantha Bandara giman...@wso2.com wrote:

 Hi all,

 I have created lucene documents like below.

 Document doc = new Document();
 doc.add(new TextField(A, 1, Field.Store.YES));
 doc.add(new StringField(B, 1 2 3, Field.Store.NO));
 doc.add(new TextField(Publish Date, 2010, Field.Store.NO));
 indexWriter.addDocument(doc);

 doc = new Document();
 doc.add(new TextField(A, 2, Field.Store.YES));
 doc.add(new StringField(B, 1 2, Field.Store.NO));
 doc.add(new TextField(Publish Date, 2010, Field.Store.NO));
 indexWriter.addDocument(doc);

 doc = new Document();
 doc.add(new TextField(A, 3, Field.Store.YES));
 doc.add(new StringField(B, 1, Field.Store.NO));
 doc.add(new TextField(Publish Date, 2012, Field.Store.NO));
 indexWriter.addDocument(doc);

 Now I am using the following code to test the StringField behavior.

 Query w = null;
 try {
 w = new QueryParser(null, new WhitespaceAnalyzer()).parse(B:1
 2);
 } catch (ParseException e) {
 e.printStackTrace();
 }
 TopScoreDocCollector collector = TopScoreDocCollector.create(100,
 true);
 searcher.search(w, collector);
 ScoreDoc[] hits = collector.topDocs(0).scoreDocs;
 Document indexDoc;
 for (ScoreDoc doc : hits) {
 indexDoc = searcher.doc(doc.doc);
 System.out.println(indexDoc.get(A));
 }

 Above code should print only the second document's 'A' value as it is the
 only one where 'B' has value '1 2'. But it returns the 3rd document. So I
 tried using double quotation marks for 'B' value as below.

 w = new QueryParser(null, new WhitespaceAnalyzer()).parse(\B:1 2\);

 It gives the following error.

 Exception in thread main java.lang.IllegalStateException: field B was
 indexed without position data; cannot run PhraseQuery (term=1)
 at
 org.apache.lucene.search.PhraseQuery$PhraseWeight.scorer(PhraseQuery.java:277)
 at org.apache.lucene.search.Weight.bulkScorer(Weight.java:131)
 at
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:618)
 at
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:309)   Is
 my searching query wrong? (Note: I am using whitespace analyzer everywhere)

 --
 Gimantha Bandara
 Software Engineer
 WSO2. Inc : http://wso2.com
 Mobile : +94714961919





-- 
Gimantha Bandara
Software Engineer
WSO2. Inc : http://wso2.com
Mobile : +94714961919

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Tf and Df in lucene

2015-06-15 Thread Ahmet Arslan
Hi Hummel,

regarding df,

Term term = new Term(field, word);
TermStatistics termStatistics = searcher.termStatistics(term, 
TermContext.build(reader.getContext(), term));
System.out.println(query + \t totalTermFreq \t  + 
termStatistics.totalTermFreq());
System.out.println(query + \t docFreq \t  + termStatistics.docFreq());

regarding tf,

Term term = new Term(field, word);
Bits bits = MultiFields.getLiveDocs(reader);
PostingsEnum postingsEnum = MultiFields.getTermDocsEnum(reader, bits, field, 
term.bytes());

if (postingsEnum == null) return;

int max = 0;
while (postingsEnum.nextDoc() != PostingsEnum.NO_MORE_DOCS) {
final int freq = postingsEnum.freq();
int docID = postingsEnum.docID();}


Ahmet




On Monday, June 15, 2015 9:12 AM, Shay Hummel shay.hum...@gmail.com wrote:
Hi

I was wondering, what is the easiest way to get the term frequency of a
term t in document d, namely tf(t,d) ?
In the same spirit - what is the easieast way the get the document
frequency of a term in the collection, i.e. how many contain the term t,
namely df(t) ?

Regards,
Shay

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Tf and Df in lucene

2015-06-15 Thread Ahmet Arslan
Hi,

If you are interested in summed up tf values of multiple terms, 
I suggest to extend SimilarityBase class to return raw freq as score.

float score(BasicStats stats, float freq, float docLen){
return freq;
}

When you use this similarity, search for three term query, scores will summed 
tf values. Also you can extract additional info from explain feature.

Ahmet




On Monday, June 15, 2015 5:50 PM, Shay Hummel shay.hum...@gmail.com wrote:
Hi Ahmet

Thank you for the reply.
Can the term reflect a multi word expression?
For example:
I want to find the term frequency \ document frequency of united states
(two terms) or free speech zones (three terms).

Shay


On Mon, Jun 15, 2015 at 4:55 PM Ahmet Arslan iori...@yahoo.com.invalid
wrote:

 Hi Hummel,

 regarding df,

 Term term = new Term(field, word);
 TermStatistics termStatistics = searcher.termStatistics(term,
 TermContext.build(reader.getContext(), term));
 System.out.println(query + \t totalTermFreq \t  +
 termStatistics.totalTermFreq());
 System.out.println(query + \t docFreq \t  + termStatistics.docFreq());

 regarding tf,

 Term term = new Term(field, word);
 Bits bits = MultiFields.getLiveDocs(reader);
 PostingsEnum postingsEnum = MultiFields.getTermDocsEnum(reader, bits,
 field, term.bytes());

 if (postingsEnum == null) return;

 int max = 0;
 while (postingsEnum.nextDoc() != PostingsEnum.NO_MORE_DOCS) {
 final int freq = postingsEnum.freq();
 int docID = postingsEnum.docID();}


 Ahmet




 On Monday, June 15, 2015 9:12 AM, Shay Hummel shay.hum...@gmail.com
 wrote:
 Hi

 I was wondering, what is the easiest way to get the term frequency of a
 term t in document d, namely tf(t,d) ?
 In the same spirit - what is the easieast way the get the document
 frequency of a term in the collection, i.e. how many contain the term t,
 namely df(t) ?

 Regards,
 Shay

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IllegalArgumentException: docID must be = 0 and maxDoc=48736112 (got docID=2147483647)

2015-05-30 Thread Ahmet Arslan
Hi Robert,

Great info. I prevented corner cases in similarities 
that were producing NaN or Negative Infinity scores.

All is well with -ea now.

Thanks,
Ahmet



On Friday, May 29, 2015 3:32 PM, Robert Muir rcm...@gmail.com wrote:
Hi Ahmet,

Its due to the use of sentinel values by your collector in its
priority queue by default.

TopScoreDocCollector warns about this, and if you turn on assertions
(-ea) you will hit them in your tests:

* pbNOTE/b: The values {@link Float#NaN} and
* {@link Float#NEGATIVE_INFINITY} are not valid scores.  This
* collector will not properly collect hits with such
* scores.
*/
public abstract class TopScoreDocCollector extends TopDocsCollectorScoreDoc {

I don't think a fix is simple, I only know of the following ideas:
* somehow sneaky use of NaN as sentinels instead of -Inf, to allow
-Inf to be used. It seems a bit scary!
* remove the sentinels optimization. I am not sure if collectors could
easily have the same performance without them.

To me, such scores seem always undesirable and only bugs, and the
current assertions are a good tradeoff.


On Fri, May 29, 2015 at 8:18 AM, Ahmet Arslan iori...@yahoo.com.invalid wrote:
 Hello List,

 When a similarity returns NEGATIVE_INFINITY, hits[i].doc becomes 2147483647.
 Thus, exception is thrown in the following code:

 for (int i = 0; i  hits.length; i++) {
 int docId = hits[i].doc;
 Document doc = searcher.doc(docId);
 }

 I know it is an awkward to return infinity (comes from log(0)), but exception 
 looks like equally
 awkward and uniformative.

 Do you think is this something improvable? Can we do better handling here?

 Thanks,
 Ahmet

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



IllegalArgumentException: docID must be = 0 and maxDoc=48736112 (got docID=2147483647)

2015-05-29 Thread Ahmet Arslan
Hello List,

When a similarity returns NEGATIVE_INFINITY, hits[i].doc becomes 2147483647.
Thus, exception is thrown in the following code:

for (int i = 0; i  hits.length; i++) {
int docId = hits[i].doc;
Document doc = searcher.doc(docId);
}

I know it is an awkward to return infinity (comes from log(0)), but exception 
looks like equally 
awkward and uniformative.

Do you think is this something improvable? Can we do better handling here?
 
Thanks,
Ahmet

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



access query term in similarity calcuation

2015-05-23 Thread Ahmet Arslan
Hi,

I have a number of similarity implementation that extends SimilarityBase.
I need to learn which term I am scoring inside the method :
abstract float score(BasicStats stats, float freq, float docLen); 

What is the easiest way to access the query term that I am scoring in 
similarity class?

Thanks,
Ahmet

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




intersection of two posting lists

2015-05-08 Thread Ahmet Arslan
Hello All,

I am traversing posting list of a single term by following code. (not sure if 
there is a better way)
Now I need to handle/aggregate multiple terms. Traverse intersection of 
multiple posting lists and obtain summed freq() of multiple terms per document. 
What is the easiest way to obtain these statistics? Is there an api/method to 
do that?

Term term = new Term(field, word);
Bits bits = MultiFields.getLiveDocs(reader);
PostingsEnum postingsEnum = MultiFields.getTermDocsEnum(reader, bits, field, 
term.bytes());

while (postingsEnum.nextDoc() != PostingsEnum.NO_MORE_DOCS) {
postingsEnum.freq();
postingsEnum.docID()}

Thanks,
Ahmet

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Phrase query given a word

2015-04-23 Thread Ahmet Arslan
Hi,

May be LUCENE-5317 relevant?

Ahmet

On Thursday, April 23, 2015 8:33 PM, Shashidhar Rao 
raoshashidhar...@gmail.com wrote:
Hi,

I have a large text and from that I need to calculated the top frequencies
of words ,
say 'Driving' occurs the most.

Now , I need to find phrase containing 'Driving' in the given text and the
frequency count of that phrase. The phrase could be three words where
driving could be in the middle or the first text with two words after that
or it could be the last text with two words before the given text.

I would appreciate if someone could post the source code , I am using
lucene4.10 ,
please let me know if this is possible.

Currently, I am doing a brute force but that isn't helping me much.

Please help

Thanks
sd

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Text dependent analyzer

2015-04-17 Thread Ahmet Arslan
Hi Hummel,

There was an effort to bring open-nlp capabilities to Lucene: 
https://issues.apache.org/jira/browse/LUCENE-2899

Lance was working on it to keep it up-to-date. But, it looks like it is not 
always best to accomplish all things inside Lucene.
I personally would do the sentence detection outside of the Lucene.

By the way, I remember there was a way to consume all upstream token stream.

I think it was consuming all input and injecting one concatenated huge 
term/token.

KeywordTokenizer has similar behaviour. It injects a single token.
http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/analysis/KeywordAnalyzer.html

Ahmet


On Wednesday, April 15, 2015 3:12 PM, Shay Hummel shay.hum...@gmail.com wrote:
Hi Ahment,
Thank you for the reply,
That's exactly what I am doing. At the moment, to index a document, I break
it to sentences, and each sentence is analyzed (lemmatizing, stopword
removal etc.)
Now, what I am looking for is a way to create an analyzer (a class which
extends lucene's analyzer). This analyzer will be used for index and query
processing. It (a like the english analyzer) will receive the text and
produce tokens.
The Api of Analyzer requires implementing the createComponents which
is not dependent
on the text being analyzed. This fact is problematic since as you know the
OpenNlp sentence breaking depends on the text it gets (OpenNlp uses the
model files to provide spans of each sentence and then break them).
Is there a way around it?

Shay


On Wed, Apr 15, 2015 at 3:50 AM Ahmet Arslan iori...@yahoo.com.invalid
wrote:

 Hi Hummel,

 You can perform sentence detection outside of the solr, using opennlp for
 instance, and then feed them to solr.

 https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect

 Ahmet




 On Tuesday, April 14, 2015 8:12 PM, Shay Hummel shay.hum...@gmail.com
 wrote:
 Hi
 I would like to create a text dependent analyzer.
 That is, *given a string*, the analyzer will:
 1. Read the entire text and break it into sentences.
 2. Each sentence will then be tokenized, possesive removal, lowercased,
 mark terms and stemmed.

 The second part is essentially what happens in english analyzer
 (createComponent). However, this is not dependent of the text it receives -
 which is the first part of what I am trying to do.

 So ... How can it be achieved?

 Thank you,

 Shay Hummel

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Text dependent analyzer

2015-04-14 Thread Ahmet Arslan
Hi Hummel,

You can perform sentence detection outside of the solr, using opennlp for 
instance, and then feed them to solr.
https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect

Ahmet




On Tuesday, April 14, 2015 8:12 PM, Shay Hummel shay.hum...@gmail.com wrote:
Hi
I would like to create a text dependent analyzer.
That is, *given a string*, the analyzer will:
1. Read the entire text and break it into sentences.
2. Each sentence will then be tokenized, possesive removal, lowercased,
mark terms and stemmed.

The second part is essentially what happens in english analyzer
(createComponent). However, this is not dependent of the text it receives -
which is the first part of what I am trying to do.

So ... How can it be achieved?

Thank you,

Shay Hummel

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: CachingTokenFilter tests fail when using MockTokenizer

2015-03-23 Thread Ahmet Arslan
Hi Spyros,

Not 100% sure but I think you should override reset method.

@Override
public void reset() throws IOException {
super.reset();

cachedInput = null;
}

Ahmet


On Monday, March 23, 2015 1:29 PM, Spyros Kapnissis ska...@yahoo.com.INVALID 
wrote:
Hello, 
We have a couple of custom token filters that use CachingTokenFilter 
internally. However, when we try to test them with MockTokenizer so that we can 
have these nice TokenStream API checks that it provides, the tests fail with: 
java.lang.AssertionError: end() called before incrementToken() returned false!

Here is a link with a unit test to reproduce the issue: 
https://gist.github.com/spyk/c783c72689410070811b
Do we misuse CachingTokenFilter? Or is it an issue of MockTonenizer when used 
with CachingTokenFilter?
Thanks!Spyros

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Would Like to contribute to Lucene

2015-03-19 Thread Ahmet Arslan
Hi Gimanta,

Not sure about the lucene internals, but here are some pointers :

http://find.searchhub.org/document/a81b4c9af49c3d0f

http://find.searchhub.org/?q=contribute#%2Fp%3Alucene%2Fs%3Aemail


Ahmet



On Thursday, March 19, 2015 3:58 PM, Gimantha Bandara giman...@wso2.com wrote:
Any clue on where to start from?

On Fri, Mar 13, 2015 at 11:24 AM, Gimantha Bandara giman...@wso2.com
wrote:

 Hi all,

 I am willing to contribute to Lucene project. I have already been
 referring to Lucene in Action 2nd edition recently. But I think it is
 outdated. It is based on lucene 3.0.x I guess. Even through online
 resources, it is very hard to learn the internals of lucene because of the
 lack of up-to-date resources. Can someone recommend a recently released
 book on lucene internals or has someone planned to write one? What would be
 the starting point if I need to learn the internals of Lucene?

 Thanks,

 --
 Gimantha Bandara
 Software Engineer
 WSO2. Inc : http://wso2.com
 Mobile : +94714961919





-- 
Gimantha Bandara
Software Engineer
WSO2. Inc : http://wso2.com
Mobile : +94714961919

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: understanding the norm encode and decode

2015-03-05 Thread Ahmet Arslan


Hi András,

Thats a good catch! Do you want to correct that javadoc mistake and create a 
patch?
https://wiki.apache.org/lucene-java/HowToContribute

If you don't have a jira account, anyone can create it.
https://issues.apache.org/jira/browse/lucene

Ahmet


On Thursday, March 5, 2015 11:15 AM, András Péteri 
apet...@b2international.com wrote:
Sorry, I also got it wrong in the previous message. :) It goes 0.89f
- 123 - 0.875f.

On Thu, Mar 5, 2015 at 10:08 AM, András Péteri
apet...@b2international.com wrote:
 Hi Andrew,

 If you are using Lucene 3.6.1, you can take a look at the method which
 creates a single byte value out of the received float using bit
 manipulation at [1]. There is also a 256-element decoder table in
 Similarity, where each byte corresponds to a decoded float value
 computed by [2].

 The first method encodes 0.89f to byte 123. 123 is decoded to 0.85f
 via the second method, so it seems that the documentation is incorrect
 in this regard.

 [1] 
 https://github.com/apache/lucene-solr/blob/lucene_solr_3_6_1/lucene/core/src/java/org/apache/lucene/util/SmallFloat.java#L75
 [2] 
 https://github.com/apache/lucene-solr/blob/lucene_solr_3_6_1/lucene/core/src/java/org/apache/lucene/util/SmallFloat.java#L88

 On Thu, Mar 5, 2015 at 3:45 AM, wangdong hrdxwa...@gmail.com wrote:
 thank you for your disscussion.

 I am a junior user of lucene, so i am not**familiar with some deep concept
 you mentioned.
 my question is simple. I just want to know how to get 0.75 from
 decode(encode(0.89)) in offical document.

 why not 0.875?   (0.875=0.5+0.25+0.125)

 thanks
 andrew

 在 2015/3/4 22:54, Adrien Grand 写道:

 Norms and doc values are indeed using the same API. However
 implementations differ a bit (eg. norms are stored in memory and use
 different compression schemes).

 The precision loss is up to the similarity. You could write a
 similarity impl which keeps full float precision, but scoring being
 fuzzy anyway this would multiply your memory needs for norms by 4
 while not really improving the quality of the scores of your
 documents. This precision loss is the right trade-off for most
 use-cases.

 On Wed, Mar 4, 2015 at 3:04 PM, Ahmet Arslan iori...@yahoo.com.invalid
 wrote:

 Hi Adrien,

 I read somewhere that norms are stored using docValues.
 In my understanding, docvalues can store lossless float values.
 So the question is, why are still several decode/encode methods exist in
 similarity implementations?
 Intuitively switching to docvalues for norms should prevent precision
 loss thing.

 Ahmet


 On Wednesday, March 4, 2015 3:22 PM, Adrien Grand jpou...@gmail.com
 wrote:
 Hi,

 Floats require 32 bits but norms are encoded on a single byte. So
 there is a precision loss when encoding float values into a single
 byte. In your example, 0.75 and 0.89 are sufficiently close to each
 other so that they are encoded to the same byte.


 On Wed, Mar 4, 2015 at 4:48 AM, wangdong hrdxwa...@gmail.com wrote:

 I read the article about the scoring section in lucene as follows:

 Encoding and decoding of the resulted float norm in a single byte are
 done
 by the static methods of the class Similarity:encodeNorm()

 http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html#encodeNorm%28float%29anddecodeNorm()

 http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html#decodeNorm%28byte%29.
 Due to loss of precision, it is not guaranteed that decode(encode(x)) =
 x,
 e.g. decode(encode(0.89)) = 0.75. At scoring (search) time, this norm is
 brought into the score of document as*norm(t, d)*, as shown by the
 formula
 inSimilarity

 http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html.

 I can not understand the formula decode(encode(0.89)) = 0.75
 how can i get the 0.75 from the left.

 Is anyone can help me ?
 thanks ahead!

 andrew



 --
 Adrien

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





 --
 András



-- 
Péteri András


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: getting number of terms in a document/field

2015-02-08 Thread Ahmet Arslan
Hi,

Sorry for my ignorance, how do I obtain AtomicReader from a IndexReader?

I figured above code but it gives me a list of atomic readers.

for (AtomicReaderContext context : reader.leaves()) {

NumericDocValues docValues = context.reader().getNormValues(field);

if (docValues != null) 
normValue = docValues.get(docID);
}

I implemented a custom similarity you advised by merging tfidf similarity and 
default similarity.
computeNorm(FieldInvertState state) method was final in tfidf similarity so I 
just couldn't extend it.
I was able to retrieve those long values from a single segment index, but i 
didn't like this solution.
Because I am experimenting with different similarity implementations.

It looks like there is no easy way to access 
FieldInvertState.lenght() and index this value into an independent 
NumericDocValues, say numTerms, other than norms.


I think I will compute length of fields by myself.

Thanks,
Ahmet


On Friday, February 6, 2015 5:31 PM, Michael McCandless 
luc...@mikemccandless.com wrote:
On Fri, Feb 6, 2015 at 8:51 AM, Ahmet Arslan iori...@yahoo.com.invalid wrote:
 Hi Michael,

 Thanks for the explanation. I am working with a TREC dataset,
 since it is static, I set size of that array experimentally.

 I followed the DefaultSimilarity#lengthNorm method a bit.

 If default similarity and no index time boost is used,
 I assume that norm equals to  1.0 / Math.sqrt(numTerms).

 First option is somehow obtain pre-computed norm value and apply reverse 
 operation to obtain numTerms.
 numTerms = (1/norm)^2  This will be an approximation because norms are stored 
 in a byte.
 How do I access that norm value for a given docid and a field?

See the AtomicReader.getNormValues method.

 Second option, I store numTerms as a separate field, like any other organic 
 fields.
 Do I need to calculate it by myself? Or can I access above already computed 
 numTerms value during indexing?

 I think I will follow second option.
 Is there a pointer where reading/writing a DocValues based field example is 
 demostrated?

You could just make your own Similarity impl, that encodes the norm
directly as a length?  It's a long so you don't have to compress if
you don't want to.

That custom Similarity is passed FieldInvertState which contains the
number of tokens in the current field, so you can just use that
instead of computing it yourself.


Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


 


On Friday, February 6, 2015 5:31 PM, Michael McCandless 
luc...@mikemccandless.com wrote:
On Fri, Feb 6, 2015 at 8:51 AM, Ahmet Arslan iori...@yahoo.com.invalid wrote:
 Hi Michael,

 Thanks for the explanation. I am working with a TREC dataset,
 since it is static, I set size of that array experimentally.

 I followed the DefaultSimilarity#lengthNorm method a bit.

 If default similarity and no index time boost is used,
 I assume that norm equals to  1.0 / Math.sqrt(numTerms).

 First option is somehow obtain pre-computed norm value and apply reverse 
 operation to obtain numTerms.
 numTerms = (1/norm)^2  This will be an approximation because norms are stored 
 in a byte.
 How do I access that norm value for a given docid and a field?

See the AtomicReader.getNormValues method.

 Second option, I store numTerms as a separate field, like any other organic 
 fields.
 Do I need to calculate it by myself? Or can I access above already computed 
 numTerms value during indexing?

 I think I will follow second option.
 Is there a pointer where reading/writing a DocValues based field example is 
 demostrated?

You could just make your own Similarity impl, that encodes the norm
directly as a length?  It's a long so you don't have to compress if
you don't want to.

That custom Similarity is passed FieldInvertState which contains the
number of tokens in the current field, so you can just use that
instead of computing it yourself.


Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: getting number of terms in a document/field

2015-02-06 Thread Ahmet Arslan
Hi Michael,

Thanks for the explanation. I am working with a TREC dataset, 
since it is static, I set size of that array experimentally. 

I followed the DefaultSimilarity#lengthNorm method a bit.

If default similarity and no index time boost is used, 
I assume that norm equals to  1.0 / Math.sqrt(numTerms).

First option is somehow obtain pre-computed norm value and apply reverse 
operation to obtain numTerms.
numTerms = (1/norm)^2  This will be an approximation because norms are stored 
in a byte.
How do I access that norm value for a given docid and a field?

Second option, I store numTerms as a separate field, like any other organic 
fields.
Do I need to calculate it by myself? Or can I access above already computed 
numTerms value during indexing? 

I think I will follow second option.
Is there a pointer where reading/writing a DocValues based field example is 
demostrated?

Thanks,
Ahmet


On Friday, February 6, 2015 11:08 AM, Michael McCandless 
luc...@mikemccandless.com wrote:
How will you know how large to allocate that array?  The within-doc
term freq can in general be arbitrarily large...

Lucene does not directly store the total number of terms in a
document, but it does store it approximately in the doc's norm value.
Maybe you can use that?  Alternatively, you can store this statistic
yourself, e.g as a doc value.

Mike McCandless

http://blog.mikemccandless.com



On Thu, Feb 5, 2015 at 7:24 PM, Ahmet Arslan iori...@yahoo.com.invalid wrote:
 Hello Lucene Users,

 I am traversing all documents that contains a given term with following code :

 Term term = new Term(field, word);
 Bits bits = MultiFields.getLiveDocs(reader);
 DocsEnum docsEnum = MultiFields.getTermDocsEnum(reader, bits, field, 
 term.bytes());

 while (docsEnum.nextDoc() != DocsEnum.NO_MORE_DOCS) {

 array[docsEnum.freq()]++;

 // how to retrieve term count for this document?
x(docsEnum.docID(), field);


 }

 How can I get field term count values for these documents using Lucene 4.10.3?

 Is above code OK for traversing posting list of term?

 Thanks,
 Ahmet

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: disabling all scoring?

2015-02-05 Thread Ahmet Arslan
Hi Rob,

May be you wrap your query in a ConstantScoreQuery?

ahmet


On Thursday, February 5, 2015 9:17 AM, Rob Audenaerde 
rob.audenae...@gmail.com wrote:
Hi all,

I'm doing some analytics with a custom Collector on a fairly large number
of searchresults (+-100.000, all the hits that return from a query). I need
to retrieve them by a query (so using search), but I don't need any scoring
nor keeping the documents in any order.

When profiling the application, I saw that for my tests, my entire search
takes about 2.4 seconds, and BulkScorer takes 0.4 seconds. So I figured
that without scoring, I would be able to chop off 0.4 seconds (+- 17% speed
increase). That seems reasonable.

What would be the best approach to disable all the 'search-goodies' and
just pass the results as fast as possible into my Collector?

Thanks for your insights.

-Rob

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



getting number of terms in a document/field

2015-02-05 Thread Ahmet Arslan
Hello Lucene Users,

I am traversing all documents that contains a given term with following code :

Term term = new Term(field, word);
Bits bits = MultiFields.getLiveDocs(reader);
DocsEnum docsEnum = MultiFields.getTermDocsEnum(reader, bits, field, 
term.bytes());

while (docsEnum.nextDoc() != DocsEnum.NO_MORE_DOCS) {

array[docsEnum.freq()]++;

// how to retrieve term count for this document?
   x(docsEnum.docID(), field); 


}

How can I get field term count values for these documents using Lucene 4.10.3?

Is above code OK for traversing posting list of term?

Thanks,
Ahmet

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Analyzer: Access to document?

2015-02-04 Thread Ahmet Arslan
Hi Ralf,

Does following code fragment work for you?

/**
* Modified from : 
http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/analysis/package-summary.html
*/
public ListString getAnalyzedTokens(String text) throws IOException {

final ListString list = new ArrayList();
try (TokenStream ts = analyzer().tokenStream(field, new StringReader(text))) {

final CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
ts.reset(); // Resets this stream to the beginning. (Required)
while (ts.incrementToken())
list.add(termAtt.toString());

ts.end();   // Perform end-of-stream operations, e.g. set the final offset.
}
return list;
}





On Wednesday, February 4, 2015 2:45 PM, Ralf Bierig ralf.bie...@gmail.com 
wrote:
Hi all,

an Analyzer has access to content on a per-field level by overwriting 
this method:

protected TokenStreamComponents createComponents(String fieldName, 
Reader reader);

Is it possible to get to the document? I want to collect the text 
content from the entire document within my analyzer to be processed by 
an external component.

Best,
Ralf

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: AW: LowercaseFilter, preserveOriginal?

2015-01-27 Thread Ahmet Arslan
Hi Clemens,

Please see : https://issues.apache.org/jira/browse/LUCENE-5620

Ahmet



On Tuesday, January 27, 2015 10:56 AM, Clemens Wyss DEV clemens...@mysign.ch 
wrote:
 I very much preserveOriginal=true when applying the 
ASCIIFoldingFilter for (german)suggestions
Must revise my statement, as I just noticed that the original token is just 
appended tot he stream/token e.g. 
chamaleon chamäeleon
And suggest returns the two, whereas I'd like to have the original only ...


-Ursprüngliche Nachricht-
Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch] 
Gesendet: Dienstag, 27. Januar 2015 09:08
An: java-user@lucene.apache.org
Betreff: LowercaseFilter, preserveOriginal?

Why does the LowecaseFilter, opposed to the ASCIIFoldingFilter, have no 
preserveOriginal-argument?

I very much preserveOriginal=true when applying the ASCIIFoldingFilter for 
(german)suggestions

B‹CB•È[œÝXœØÜšX™KK[XZ[ˆ˜]˜K]\Ù\‹][œÝXœØÜšX™PXÙ[™K˜\XÚK›Ü™ÃB‘›ÜˆY][Û˜[ÛÛ[X[™ËK[XZ[ˆ˜]˜K]\Ù\‹Z[XÙ[™K˜\XÚK›Ü™ÃBƒB

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Looking for docs that have certain fields empty (an/or not set)

2015-01-07 Thread Ahmet Arslan
Hi Clemens,

Since you are a lucene user, you might be interested in Uwe's response on a 
similar topic : 
http://find.searchhub.org/document/abb73b45a48cb89e

Ahmet


On Wednesday, January 7, 2015 6:30 PM, Erick Erickson erickerick...@gmail.com 
wrote:
Should be, but it's a bit confusing because the query syntax is not
pure boolean,
so there's no set to take away the docs with entries in field 1, you need the
match all docs bit, i.e.
*:* -field1:[* TO *]

(That's asterisk:asterisk -field1:[* TO *] in case the silly list
interprets the asterisks
as markup)

There's some special magic in filter query processing to handle this case, but
not in the main query parser.

Best,
Erick

On Wed, Jan 7, 2015 at 8:14 AM, Clemens Wyss DEV clemens...@mysign.ch wrote:
 Say I wanted to find documents which have no content in field1 (or 
 dosuments that have no field 'field1'), wouldn't that be the following query?
 -field1:[* TO *]

 Thanks for you help
 Clemens

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IndexSearcher.setSimilarity thread-safety

2015-01-05 Thread Ahmet Arslan
Hi Barry,

Thanks for chiming in. Then javadocs needs correction, right?

multiple threads can call any of its methods, concurrently

Ahmet


On Monday, January 5, 2015 3:28 PM, Barry Coughlan b.coughl...@gmail.com 
wrote:
Just had a glance at the IndexSearcher code.

Changing the similarity would not cause any failures. However the change
may not be immediately seen by all threads because the variable is
non-volatile (I'm open to correction on that...).

If you need multiple threads to have different Similarity implementations
then you will need separate IndexSearcher instances. You can use a single
IndexReader for the IndexSearchers

Barry


On Mon, Jan 5, 2015 at 1:10 PM, Ahmet Arslan iori...@yahoo.com.invalid
wrote:



 anyone?



 On Thursday, December 25, 2014 4:42 PM, Ahmet Arslan
 iori...@yahoo.com.INVALID wrote:
 Hi all,

 Javadocs says IndexSearcher instances are completely thread safe, meaning
 multiple threads can call any of its
 methods, concurrently

 Is this true for setSimilarity() method?

 What happens when every thread uses different similarity implementations?

 Thanks,
 Ahmet

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IndexSearcher.setSimilarity thread-safety

2015-01-05 Thread Ahmet Arslan


Thanks for the explanation, Uwe.


On Monday, January 5, 2015 7:30 PM, Uwe Schindler u...@thetaphi.de wrote:
Hi,

The documentation may be a bit incorrect, but in general it means: 
IndexSearcher is thread safe in regards to searching. Getters/Setters are 
generally not thread safe for most classes. The documentation is mainly to 
prevent people from synchronizing any external calls, because this would be a 
disaster to do!

About your problem: Please use a new IndexSearcher for each different 
similarity. IndexSearcher is a very chaep object (it is just a Wrapper around 
the IndexReader), so it is only important to keep the IndexReader open. But for 
simplification, I would personally create a new IndexSearcher instance for 
every search request (...and I always do this).

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Barry Coughlan [mailto:b.coughl...@gmail.com]
 Sent: Monday, January 05, 2015 3:40 PM
 To: java-user@lucene.apache.org; Ahmet Arslan
 Subject: Re: IndexSearcher.setSimilarity thread-safety
 
 Hi Ahmet,
 
 The IndexSearcher is thread-safe, it's just that the similarity field is 
 shared
 between threads. I think that to most people it is implied that the 
 similarity is
 not thread-local, as this would be surprising behavior.
 
 Ideally the similarity field would not be mutable to indicate this, but I
 suppose this would make the constructors very awkward.
 
 Barry
 
 On Mon, Jan 5, 2015 at 2:02 PM, Ahmet Arslan iori...@yahoo.com.invalid
 wrote:
 
  Hi Barry,
 
  Thanks for chiming in. Then javadocs needs correction, right?
 
  multiple threads can call any of its methods, concurrently
 
  Ahmet
 
 
  On Monday, January 5, 2015 3:28 PM, Barry Coughlan
  b.coughl...@gmail.com
  wrote:
  Just had a glance at the IndexSearcher code.
 
  Changing the similarity would not cause any failures. However the
  change may not be immediately seen by all threads because the variable
  is non-volatile (I'm open to correction on that...).
 
  If you need multiple threads to have different Similarity
  implementations then you will need separate IndexSearcher instances.
  You can use a single IndexReader for the IndexSearchers
 
  Barry
 
 
  On Mon, Jan 5, 2015 at 1:10 PM, Ahmet Arslan
  iori...@yahoo.com.invalid
  wrote:
 
  
  
   anyone?
  
  
  
   On Thursday, December 25, 2014 4:42 PM, Ahmet Arslan
   iori...@yahoo.com.INVALID wrote:
   Hi all,
  
   Javadocs says IndexSearcher instances are completely thread safe,
  meaning
   multiple threads can call any of its methods, concurrently
  
   Is this true for setSimilarity() method?
  
   What happens when every thread uses different similarity
 implementations?
  
   Thanks,
   Ahmet
  
   
   - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org

  
   
   - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene query with additional clause field not null

2014-12-01 Thread Ahmet Arslan
Hi Sascha,

Generally RangeQuery is used for that, e.g. fieldName:[* TO *]

Ahmet


On Monday, December 1, 2014 9:44 PM, Sascha Janz sascha.j...@gmx.net wrote:
Hi,



is there a chance to add a additional clause to a query for a field that
should not be null ? 



greetings 

sascha

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to improve the performance in Lucene when query is long?

2014-11-11 Thread Ahmet Arslan
Hi Harry,

May be you can use BooleanQuery#setMinimumNumberShouldMatch method. What 
happens when you use set it to half of the numTerms?

ahmet


On Tuesday, November 11, 2014 8:35 AM, Harry Yu 502437...@qq.com wrote:
Hi everyone,



I have been using Lucene to build a POI searching  geocoding system. After 
test, I found that when query is long(above 10 terms). And the speed of 
searching is too slow near to 1s. I think the bottleneck is that I used OR to 
generate my BooleanQuery. It would get plenty of candidates documents. And it 
would also consume too many time to score and rank.

I changed to use AND to generate my BooleanQuery. But it decrease the accuracy 
of hits. So I want to find a solution to reduce candidate documents and do not 
decrease the accuracy in this situation.

Thanks for your help!‍



--
Harry YuInstitute of Remote Sensing and Geographic Information System.
School of Earth and Space Sciences, Peking University;
Beijing, China, 100871;
Email: 502437...@qq.com OR harryyu1...@163.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

2014-11-11 Thread Ahmet Arslan
Hi,

With that analyser, your searches (for same word, but different capitalised) 
could return different results.

Ahmet


On Tuesday, November 11, 2014 6:57 PM, Martin O'Shea app...@dsl.pipex.com 
wrote:
In the end I edited the code of the StandardAnalyzer and the
SnowballAnalyzer to disable the calls to the LowerCaseFilter. This seems to
work.

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] 
Sent: 10 Nov 2014 15 19
To: java-user@lucene.apache.org
Subject: Re: How to disable LowerCaseFilter when using SnowballAnalyzer in
Lucene 3.0.2

Hi,

Regarding Uwe's warning, 

NOTE: SnowballFilter expects lowercased text. [1]

[1]
https://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/anal
ysis/snowball/SnowballFilter.html



On Monday, November 10, 2014 4:43 PM, Uwe Schindler u...@thetaphi.de wrote:
Hi,

 Uwe
 
 Thanks for the reply. Given that SnowBallAnalyzer is made up of a 
 series of filters, I was thinking about something like this where I 
 'pipe' output from one filter to the next:
 
 standardTokenizer =new StandardTokenizer (...); standardFilter = new 
 StandardFilter(standardTokenizer,...);
 stopFilter = new StopFilter(standardFilter,...); snowballFilter = new 
 SnowballFilter(stopFilter,...);
 
 But ignore LowerCaseFilter. Does this make sense?

Exactly. Create a clone of SnowballAnalyzer (from Lucene source package) in
your own package and remove LowercaseFilter. But be aware, it could be that
snowball needs lowercased terms to correctly do stemming!!! I don't know
about this filter, I just want to make you aware.

The same applies to stop filter, but this one allows to handle that: You
should make stop-filter case insensitive (there is a boolean to do this):
StopFilter(boolean enablePositionIncrements, TokenStream input, Set?
stopWords, boolean ignoreCase)

Uwe

 Martin O'Shea.
 -Original Message-
 From: Uwe Schindler [mailto:u...@thetaphi.de]
 Sent: 10 Nov 2014 14 06
 To: java-user@lucene.apache.org
 Subject: RE: How to disable LowerCaseFilter when using 
 SnowballAnalyzer in Lucene 3.0.2
 
 Hi,
 
 In general, you cannot change Analyzers, they are examples and can 
 be seen as best practise. If you want to modify them, write your own 
 Analyzer subclass which uses the wanted Tokenizers and TokenFilters as 
 you like. You can for example clone the source code of the original 
 and remove LowercaseFilter. Analyzers are very simple, there is no 
 logic in them, it's just some configuration (which Tokenizer and 
 which TokenFilters). In later Lucene 3 and Lucene 4, this is very 
 simple: You just need to override createComponents in Analyzer class and
add your configuration there.
 
 If you use Apache Solr or Elasticsearch you can create your analyzers 
 by XML or JSON configuration.
 
 Uwe
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
  -Original Message-
  From: Martin O'Shea [mailto:m.os...@dsl.pipex.com]
  Sent: Monday, November 10, 2014 2:54 PM
  To: java-user@lucene.apache.org
  Subject: How to disable LowerCaseFilter when using SnowballAnalyzer 
  in Lucene 3.0.2
 
  I realise that 3.0.2 is an old version of Lucene but if I have Java 
  code as
  follows:
 
 
 
  int nGramLength = 3;
 
  SetString stopWords = new SetString();
 
  stopwords.add(the);
 
  stopwords.add(and);
 
  ...
 
  SnowballAnalyzer snowballAnalyzer = new 
  SnowballAnalyzer(Version.LUCENE_30,
  English, stopWords);
 
  ShingleAnalyzerWrapper shingleAnalyzer = new 
  ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength);
 
 
 
  Which will generate the frequency of ngrams from a particular a 
  string of text without stop words, how can I disable the 
  LowerCaseFilter which forms part of the SnowBallAnalyzer? I want to 
  preserve the case of the ngrams generated so that I can perform 
  various counts according to the presence / absence of upper case
characters in the ngrams.
 
 
 
  I am something of a Lucene newbie. And I should add that upgrading 
  the version of Lucene is not an option here.
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org






 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe

Re: Document Term matrix

2014-11-11 Thread Ahmet Arslan
Hi,

Mahout and Carrot2 can cluster the documents from lucene index.

ahmet



On Tuesday, November 11, 2014 10:37 PM, Elshaimaa Ali 
elshaimaa@hotmail.com wrote:
Hi All,
I have a Lucene index built with Lucene 4.9 for 584 text documents, I need to 
extract a Document-term matrix, and Document Document similarity matrix 
in-order to use it to cluster the documents. My questions:1- How can I extract 
the matrix and compute the similarity between documents in Lucene.2- Is there 
any java based code that can cluster the documents from Lucene index.
RegardsShaimaa 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

2014-11-10 Thread Ahmet Arslan
Hi,

Regarding Uwe's warning, 

NOTE: SnowballFilter expects lowercased text. [1]

[1] 
https://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/analysis/snowball/SnowballFilter.html



On Monday, November 10, 2014 4:43 PM, Uwe Schindler u...@thetaphi.de wrote:
Hi,

 Uwe
 
 Thanks for the reply. Given that SnowBallAnalyzer is made up of a series of
 filters, I was thinking about something like this where I 'pipe' output from
 one filter to the next:
 
 standardTokenizer =new StandardTokenizer (...); standardFilter = new
 StandardFilter(standardTokenizer,...);
 stopFilter = new StopFilter(standardFilter,...); snowballFilter = new
 SnowballFilter(stopFilter,...);
 
 But ignore LowerCaseFilter. Does this make sense?

Exactly. Create a clone of SnowballAnalyzer (from Lucene source package) in 
your own package and remove LowercaseFilter. But be aware, it could be that 
snowball needs lowercased terms to correctly do stemming!!! I don't know about 
this filter, I just want to make you aware.

The same applies to stop filter, but this one allows to handle that: You should 
make stop-filter case insensitive (there is a boolean to do this):
StopFilter(boolean enablePositionIncrements, TokenStream input, Set? 
stopWords, boolean ignoreCase)

Uwe

 Martin O'Shea.
 -Original Message-
 From: Uwe Schindler [mailto:u...@thetaphi.de]
 Sent: 10 Nov 2014 14 06
 To: java-user@lucene.apache.org
 Subject: RE: How to disable LowerCaseFilter when using SnowballAnalyzer in
 Lucene 3.0.2
 
 Hi,
 
 In general, you cannot change Analyzers, they are examples and can be
 seen as best practise. If you want to modify them, write your own Analyzer
 subclass which uses the wanted Tokenizers and TokenFilters as you like. You
 can for example clone the source code of the original and remove
 LowercaseFilter. Analyzers are very simple, there is no logic in them, it's 
 just
 some configuration (which Tokenizer and which TokenFilters). In later
 Lucene 3 and Lucene 4, this is very simple: You just need to override
 createComponents in Analyzer class and add your configuration there.
 
 If you use Apache Solr or Elasticsearch you can create your analyzers by XML
 or JSON configuration.
 
 Uwe
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
  -Original Message-
  From: Martin O'Shea [mailto:m.os...@dsl.pipex.com]
  Sent: Monday, November 10, 2014 2:54 PM
  To: java-user@lucene.apache.org
  Subject: How to disable LowerCaseFilter when using SnowballAnalyzer in
  Lucene 3.0.2
 
  I realise that 3.0.2 is an old version of Lucene but if I have Java
  code as
  follows:
 
 
 
  int nGramLength = 3;
 
  SetString stopWords = new SetString();
 
  stopwords.add(the);
 
  stopwords.add(and);
 
  ...
 
  SnowballAnalyzer snowballAnalyzer = new
  SnowballAnalyzer(Version.LUCENE_30,
  English, stopWords);
 
  ShingleAnalyzerWrapper shingleAnalyzer = new
  ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength);
 
 
 
  Which will generate the frequency of ngrams from a particular a string
  of text without stop words, how can I disable the LowerCaseFilter
  which forms part of the SnowBallAnalyzer? I want to preserve the case
  of the ngrams generated so that I can perform various counts according
  to the presence / absence of upper case characters in the ngrams.
 
 
 
  I am something of a Lucene newbie. And I should add that upgrading the
  version of Lucene is not an option here.
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: analyzers for Thai, Telugu, Vietnamese, Korean, Urdu,...

2014-11-09 Thread Ahmet Arslan
Hi,

Thai has this for example : 
org.apache.lucene.analysis.th.ThaiAnalyzer

Ahmet


On Saturday, November 8, 2014 12:48 PM, Olivier Binda 
olivier.bi...@wanadoo.fr wrote:
Hello

What should I use for analysing languages like Thai, Telugu, Vietnamese, 
Korean, Urdu ?
The StandardAnalyzer ? The ICUAnalyzer ?

It doesn't look like they have dedicated analyzers (I'm using Lucene 
4.7.2 on Android)

Best regards,
Olivier





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: custom token filter generates empty tokens

2014-10-09 Thread Ahmet Arslan
Hi G.Long,

You can use TrimFilter+LengthFilter to remove empty/whitespace tokens.


Ahmet

On Thursday, October 9, 2014 5:54 PM, G.Long jde...@gmail.com wrote:
Hi :)

I wrote a custom token filter which removes special characters. 
Sometimes, all characters of the token are removed so the filter 
procudes an empty token. I would like to remove this token from the 
tokenstream but i'm not sure how to do that.

Is there something missing in my custom token filter or do I need to 
chain another custom token filter to remove empty tokens?

Regards :)

ps:

this is the code of my custom filter :

public class SpecialCharFilter extends TokenFilter {

 private final CharTermAttribute termAtt = 
addAttribute(CharTermAttribute.class);

 protected SpecialCharFilter(TokenStream input) {
 super(input);
 }

 @Override
 public boolean incrementToken() throws IOException {

 if (!input.incrementToken()) {
 return false;
 }

 final char[] buffer = termAtt.buffer();
 final int length = termAtt.length();
 final char[] newBuffer = new char[length];

 int newIndex = 0;
 for (int i = 0; i  length; i++) {
 if (!isFilteredChar(buffer[i])) {
 newBuffer[newIndex] = buffer[i];
 newIndex++;
 }
 }

 String term = new String(newBuffer);
 term = term.trim();
 char[] characters = term.toCharArray();
 termAtt.setEmpty();
 termAtt.copyBuffer(characters, 0, characters.length);

 return true;
 }
}

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Two-pass TokenFilter

2014-08-24 Thread Ahmet Arslan
Hi,

Can you elaborate more, what do you mean by I need to know all tokens in 
advance.

Ahmet


On Wednesday, August 20, 2014 6:48 PM, Christian Beil 
christian.a.b...@gmail.com wrote:
Hey guys,

I need a TokenFilter that filters some tokens like the FilteringTokenFilter.
The problem is, in order to do the filtering I need to know all tokens in
advance.

I thought I'll adapt the CachingTokenFilter in order to collect all tokens
in the first pass.
In the second pass it can use this information to filter the tokens.

Or is there a better solution to do this?

Thanks,
Christian


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Relevancy tests

2014-06-12 Thread Ahmet Arslan
Hi,

Relevance Judgments are labor intensive and expensive. Some Information 
Retrieval forums ( TREC, CLEF, etc) provide these golden sets. But they are not 
public.

http://rosenfeldmedia.com/books/search-analytics/ talks about how to create a 
golden set for your top n queries.


Also there are some works describing how to tune parameters of search system 
using click trough data.



On Thursday, June 12, 2014 8:47 PM, Ivan Brusic i...@brusic.com wrote:
Perhaps more of an NLP question, but are there any tests regarding
relevance for Lucene? Given an example corpus of documents, what are the
golden sets for specific queries? The Wikidump dump is used as a
benchmarking tool for both indexing and querying in Lucene, but there are
no metrics in terms of precision.

The Open Relevance project was closed yesterday (
http://lucene.apache.org/openrelevance/), which is what prompted me to ask
this question. Was the sub-project closed because others have found
alternate solutions?

Relevancy is of course extremely context-dependent and objective, but my
hope is that there is an example catalog somewhere with defined golden sets.

Cheers,

Ivan


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



  1   2   3   >