[ANNOUNCE] Apache Lucene 8.3.1 released

2019-12-03 Thread Ishan Chattopadhyaya
## 3 December 2019, Apache Lucene™ 8.3.1 available

The Lucene PMC is pleased to announce the release of Apache Lucene 8.3.1.

Apache Lucene is a high-performance, full-featured text search engine
library written entirely in Java. It is a technology suitable for
nearly any application that requires full-text search, especially
cross-platform.

This release contains numerous bug fixes, optimizations, and
improvements, some of which are highlighted below. The release is
available for immediate download at:

  

### Lucene 8.3.1 Release Highlights:

  * Bugfix: MultiTermIntervalsSource.visit() was not calling back to its visitor

Please read CHANGES.txt for a full list of changes:

  

Note: The Apache Software Foundation uses an extensive mirroring network for
distributing releases. It is possible that the mirror you are using may not have
replicated the release yet. If that is the case, please try another mirror.
This also applies to Maven access.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Need suggestions on implementing a custom query (offload R-tree filter to fully in-memory) on Lucene-8.3

2019-12-03 Thread 小鱼儿
Background: i need to implement a document indexing and search for
POIs(point of interest) under LBS scene. A POI has name, address, and
location(LatLonPoint), and i want to combine a text query with a
geo-spatial 2d range filter.

The problem is, when i first build a native in-memory index which use a
simple BitSet as DocIDSet type and STRTree class from the famous JTS lib, i
get 20ms/1000qps perf metrics with 1w8 POIs on my laptop(Windows 7 x64, use
mmap codec). But when i use Lucene-8.3 to implement the same
functionality(which use LatLonPoint.newDistanceQuery which seems use the
default BKD tree index), i only get 150ms/130qps which is a very bad
degrade?

So my idea is, can i do a custom filter query, which builds a fully
in-memory R-tree index to boost the spatial2d range filter performance? I
need to access Lucene's internal DocIDSet class so i can do a fast merge
with no scoring needed. Hope this will improve the query performance.

Any suggestions?


Re: Multi-IDF for a single term possible?

2019-12-03 Thread Ameer Albahem
IDF is a simple measure to calculate. So, if building a separate index for
each user is not an ideal solution, then I suggest you could try to
calculate these statistics upfront. Just maintain these statistics for each
user, then use them in the query process.

As the search time, you use these stats in your ranking. One possible way
is to write a similarity wrapper that will read the needed information from
a hash map.

Regards
Ameer



On Wed, 4 Dec 2019 at 00:55, Ravikumar Govindarajan <
ravikumar.govindara...@gmail.com> wrote:

> >
> > it is enough to give each its own field.
> >
>
> I kind of over-simplified the problem at hand. Apologies.
>
> DOC_TYPE is just one aspect of the problem. The other one is that, it is
> actually shared index where there are multiple-users (100-3000 users per
> index). There are many hundreds of such shared-indexes in our cluster
>
> Search happens per-user & it doesn't make sense to have a single IDF. We
> are ideally looking at some lucene extensions/tricks to store & retrieve
> IDF in  pairs.
>
> Is there any reason why you are not storing each DOC_TYPE in its own index?
>
>
> There are some common-fields across all DOC_TYPES (Ex: content/attachment
> et al..)  & to provide unified-search for a user, we colocate them in a
> single index
>
> --
> Ravi
>
> On Tue, Dec 3, 2019 at 6:30 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> dceccarel...@bloomberg.net> wrote:
>
> > Hi Ravi,
> > Can you give more details on how you store an entity into lucene? what is
> > a doc type?
> > what fields do you have?
> >
> > Cheers
> >
> > From: java-user@lucene.apache.org At: 12/03/19 12:50:40To:
> > java-user@lucene.apache.org
> > Subject: Multi-IDF for a single term possible?
> >
> > Hello,
> >
> > We are using TF-IDF for scoring (Yet to migrate to BM25). Different
> > entities (DOC_TYPES) are crunched & stored together in a single index.
> >
> > When it comes to IDF, I find that there is a single value computed across
> > documents & stored as part of TermStats, whereas our documents are not
> > homogeneous. So, a single IDF value doesn't work for us
> >
> > We would like to compute IDF for each  pair, store it &
> > later use the paired-IDF values during query time. Is something like this
> > possible via Codecs or other mechanisms?
> >
> > Any help is much appreciated
> >
> > --
> > Ravi
> >
> >
> >
>


Re: Multi-IDF for a single term possible?

2019-12-03 Thread Ravikumar Govindarajan
>
> it is enough to give each its own field.
>

I kind of over-simplified the problem at hand. Apologies.

DOC_TYPE is just one aspect of the problem. The other one is that, it is
actually shared index where there are multiple-users (100-3000 users per
index). There are many hundreds of such shared-indexes in our cluster

Search happens per-user & it doesn't make sense to have a single IDF. We
are ideally looking at some lucene extensions/tricks to store & retrieve
IDF in  pairs.

Is there any reason why you are not storing each DOC_TYPE in its own index?


There are some common-fields across all DOC_TYPES (Ex: content/attachment
et al..)  & to provide unified-search for a user, we colocate them in a
single index

--
Ravi

On Tue, Dec 3, 2019 at 6:30 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
dceccarel...@bloomberg.net> wrote:

> Hi Ravi,
> Can you give more details on how you store an entity into lucene? what is
> a doc type?
> what fields do you have?
>
> Cheers
>
> From: java-user@lucene.apache.org At: 12/03/19 12:50:40To:
> java-user@lucene.apache.org
> Subject: Multi-IDF for a single term possible?
>
> Hello,
>
> We are using TF-IDF for scoring (Yet to migrate to BM25). Different
> entities (DOC_TYPES) are crunched & stored together in a single index.
>
> When it comes to IDF, I find that there is a single value computed across
> documents & stored as part of TermStats, whereas our documents are not
> homogeneous. So, a single IDF value doesn't work for us
>
> We would like to compute IDF for each  pair, store it &
> later use the paired-IDF values during query time. Is something like this
> possible via Codecs or other mechanisms?
>
> Any help is much appreciated
>
> --
> Ravi
>
>
>


Re:Multi-IDF for a single term possible?

2019-12-03 Thread Diego Ceccarelli (BLOOMBERG/ LONDON)
Hi Ravi, 
Can you give more details on how you store an entity into lucene? what is a doc 
type? 
what fields do you have? 

Cheers

From: java-user@lucene.apache.org At: 12/03/19 12:50:40To:  
java-user@lucene.apache.org
Subject: Multi-IDF for a single term possible?

Hello,

We are using TF-IDF for scoring (Yet to migrate to BM25). Different
entities (DOC_TYPES) are crunched & stored together in a single index.

When it comes to IDF, I find that there is a single value computed across
documents & stored as part of TermStats, whereas our documents are not
homogeneous. So, a single IDF value doesn't work for us

We would like to compute IDF for each  pair, store it &
later use the paired-IDF values during query time. Is something like this
possible via Codecs or other mechanisms?

Any help is much appreciated

--
Ravi




Re: Multi-IDF for a single term possible?

2019-12-03 Thread Robert Muir
it is enough to give each its own field.

On Tue, Dec 3, 2019 at 7:57 AM Adrien Grand  wrote:

> Is there any reason why you are not storing each DOC_TYPE in its own index?
>
> On Tue, Dec 3, 2019 at 1:50 PM Ravikumar Govindarajan
>  wrote:
> >
> > Hello,
> >
> > We are using TF-IDF for scoring (Yet to migrate to BM25). Different
> > entities (DOC_TYPES) are crunched & stored together in a single index.
> >
> > When it comes to IDF, I find that there is a single value computed across
> > documents & stored as part of TermStats, whereas our documents are not
> > homogeneous. So, a single IDF value doesn't work for us
> >
> > We would like to compute IDF for each  pair, store it &
> > later use the paired-IDF values during query time. Is something like this
> > possible via Codecs or other mechanisms?
> >
> > Any help is much appreciated
> >
> > --
> > Ravi
>
>
>
> --
> Adrien
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Multi-IDF for a single term possible?

2019-12-03 Thread Adrien Grand
Is there any reason why you are not storing each DOC_TYPE in its own index?

On Tue, Dec 3, 2019 at 1:50 PM Ravikumar Govindarajan
 wrote:
>
> Hello,
>
> We are using TF-IDF for scoring (Yet to migrate to BM25). Different
> entities (DOC_TYPES) are crunched & stored together in a single index.
>
> When it comes to IDF, I find that there is a single value computed across
> documents & stored as part of TermStats, whereas our documents are not
> homogeneous. So, a single IDF value doesn't work for us
>
> We would like to compute IDF for each  pair, store it &
> later use the paired-IDF values during query time. Is something like this
> possible via Codecs or other mechanisms?
>
> Any help is much appreciated
>
> --
> Ravi



-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Multi-IDF for a single term possible?

2019-12-03 Thread Ravikumar Govindarajan
Hello,

We are using TF-IDF for scoring (Yet to migrate to BM25). Different
entities (DOC_TYPES) are crunched & stored together in a single index.

When it comes to IDF, I find that there is a single value computed across
documents & stored as part of TermStats, whereas our documents are not
homogeneous. So, a single IDF value doesn't work for us

We would like to compute IDF for each  pair, store it &
later use the paired-IDF values during query time. Is something like this
possible via Codecs or other mechanisms?

Any help is much appreciated

--
Ravi