Re: ANN search current state

2020-07-17 Thread Tommaso Teofili
would it make sense to create a separate Lucene module for ANN search ?
we could then experiment with the different approaches and compare them
across the same benchmarks.

On Thu, 16 Jul 2020 at 23:14, Ali Akhtar  wrote:

> I’m a bit of a layman in this area, but if we are talking about formats for
> vectors, I vote for the one used by FastAI word vectors. It’s pretty easy
> to work with.
>
> If we are talking about the same / similiar things, if not just ignore me
> 
>
> On Thu, 16 Jul 2020 at 7:06 PM, Michael Sokolov 
> wrote:
>
> > We have some prototype implementations in the issues you found.  If
> > you want to try out the approaches in those issues, you could build
> > Lucene from source and patch it, but there is no release containing
> > KNN/vector support. We're still working to establish consensus on what
> > the best way forward is. I think the most fruitful thing we can do at
> > the moment is establish a format for storing and accessing vectors
> > that will support different approaches since there is such a rich
> > variety of algorithms and approaches in this area. The last issue you
> > pointed to is focused on the format.
> >
> > On Wed, Jul 15, 2020 at 11:20 AM Alex K  wrote:
> > >
> > > Hi Mikhail,
> > >
> > > I'm not sure about the state of ANN in lucene proper. Very interested
> to
> > > see the response from others.
> > > I've been doing some work on ANN for an Elasticsearch plugin:
> > > http://elastiknn.klibisz.com/
> > > I think it's possible to extract my custom queries and modeling code so
> > > that it's elasticsearch-agnostic and can be used directly in Lucene
> apps.
> > > However I'm much more familiar with Elasticsearch's APIs and
> > usage/testing
> > > patterns than I am with raw Lucene, so I'd likely need to get some help
> > > from the Lucene community.
> > > Please LMK if that sounds interesting to anyone.
> > >
> > > - Alex
> > >
> > >
> > >
> > > On Wed, Jul 15, 2020 at 11:11 AM Mikhail 
> > wrote:
> > >
> > > >
> > > > Hi,
> > > >
> > > >I want to incorporate semantic search in my project, which
> > uses
> > > > Lucene. I want to use sentence embeddings and ANN (approximate
> nearest
> > > > neighbor) search. I found the related Lucene issues:
> > > > https://issues.apache.org/jira/browse/LUCENE-9004 ,
> > > > https://issues.apache.org/jira/browse/LUCENE-9136 ,
> > > > https://issues.apache.org/jira/browse/LUCENE-9322 . I see that there
> > > > are some related work and related PRs. What is the current state of
> > this
> > > > functionality?
> > > >
> > > > --
> > > > Thanks,
> > > > Mikhail
> > > >
> > > >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>


Re: Optimizing a boolean query for 100s of term clauses

2020-06-25 Thread Tommaso Teofili
hi Alex,

I had worked on a similar problem directly on Lucene (within Anserini
toolkit) using LSH fingerprints of tokenized feature vector values.
You can find code at [1] and some information on the Anserini documentation
page [2] and in a short preprint [3].
As a side note my current thinking is that it would be very cool if we
could leverage Lucene N dimensional point support by properly reducing the
dimensionality of the original vectors, however that is hard to do without
losing important information.

My 2 cents,
Tommaso

[1] :
https://github.com/castorini/anserini/tree/master/src/main/java/io/anserini/ann
[2] :
https://github.com/castorini/anserini/blob/master/docs/approximate-nearestneighbor.md
[3] : https://arxiv.org/abs/1910.10208





On Wed, 24 Jun 2020 at 19:47, Alex K  wrote:

> Hi Toke. Indeed a nice coincidence. It's an interesting and fun problem
> space!
>
> My implementation isn't specific to any particular dataset or access
> pattern (i.e. infinite vs. subset).
> So far the plugin supports exact L1, L2, Jaccard, Hamming, and Angular
> similarities with LSH variants for all but L1.
> My exact implementation is generally faster than the approximate LSH
> implementation, hence the thread.
> You make a good point that this is valuable by itself if you're able to
> filter down to a small subset of docs.
> I put a lot of work into optimizing the vector serialization speed and the
> exact query execution.
> I imagine with my current implementation there is some breaking point where
> LSH becomes faster than exact, but so far I've tested with ~1.2M
> ~300-dimensional vectors and exact is still faster, especially when
> parallelized across many shards.
> So speeding up LSH is the current engineering challenge.
>
> Are you using Elasticsearch or Lucene directly?
> If you're using ES and have the time, I'd love some feedback on my plugin.
> It sounds like you want to compute hamming similarity on your bitmaps?
> If so that's currently supported.
> There's an example here:
> http://demo.elastiknn.klibisz.com/dataset/mnist-hamming?queryId=64121
>
> Also I've compiled a small literature review on some related research here:
>
> https://docs.google.com/document/d/14Z7ZKk9dq29bGeDDmBH6Bsy92h7NvlHoiGhbKTB0YJs/edit
> *Fast and Exact NNS in Hamming Space on Full-Text Search Engines* describes
> some clever tricks to speed up Hamming similarity.
> *Large Scale Image Retrieval with Elasticsearch* describes the idea of
> using the largest absolute magnitude values instead of the full vector.
> Perhaps you've already read them but I figured I'd share.
>
> Cheers
> - AK
>
>
>
> On Wed, Jun 24, 2020 at 8:44 AM Toke Eskildsen  wrote:
>
> > On Tue, 2020-06-23 at 09:50 -0400, Alex K wrote:
> > > I'm working on an Elasticsearch plugin (using Lucene internally) that
> > > allows users to index numerical vectors and run exact and approximate
> > > k-nearest-neighbors similarity queries.
> >
> > Quite a coincidence. I'm looking into the same thing :-)
> >
> > >   1. When indexing a vector, apply a hash function to it, producing
> > > a set of discrete hashes. Usually there are anywhere from 100 to 1000
> > > hashes.
> >
> > Is it important to have "infinite" scaling with inverted index or is it
> > acceptable to have a (fast) sequential scan through all documents? If
> > the use case is to combine the nearest neighbour search with other
> > filters, so that the effective search-space is relatively small, you
> > could go directly to computing the Euclidian distance (or whatever you
> > use to calculate the exact similarity score).
> >
> > >   4. As the BooleanQuery produces results, maintain a fixed-size
> > > heap of its scores. For any score exceeding the min in the heap, load
> > > its vector from the binary doc values, compute the exact similarity,
> > > and update the heap.
> >
> > I did something quite similar for a non-Lucene bases proof of concept,
> > except that I delayed the exact similarity calculation and over-
> > collected on the heap.
> >
> > Fleshing that out: Instead of producing similarity hashes, I extracted
> > the top-X strongest signals (entries in the vector) and stored them as
> > indexes from the raw vector, so the top-3 signals from [10, 3, 6, 12,
> > 1, 20] are [0, 3, 5]. The query was similar to your "match as many as
> > possible", just with indexes instead of hashes.
> >
> > >- org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of
> > > runtime)
> >
> > This sounds strange. How large is your queue? Object-based priority
> > queues tend to become slow when they get large (100K+ values).
> >
> > > Maybe I could optimize this by implementing a custom query or scorer?
> >
> > My plan for a better implementation is to use an autoencoder to produce
> > a condensed representation of the raw vector for a document. In order
> > to do so, a network must be trained on (ideally) the full corpus, so it
> > will require a bootstrap process and will probably work poorly if
> > incoming 

Re: [VOTE] Lucene logo contest

2020-06-17 Thread Tommaso Teofili
PMC vote: option C (current)

On Wed, 17 Jun 2020 at 07:58, Ignacio Vera Sequeiros
 wrote:

> PMC vote:  option A
>
> On Wed, Jun 17, 2020 at 7:36 AM Jeroen Lauwers 
> wrote:
>
> > A. Definitely.
> >
> > Verstuurd vanaf mijn telefoon
> >
> > > Op 17 jun. 2020 om 03:46 heeft Jason Gerlowski 
> > het volgende geschreven:
> > >
> > > Option "A"
> > >
> > >> On Tue, Jun 16, 2020 at 8:37 PM Man with No Name
> > >>  wrote:
> > >>
> > >> A, clean and modern.
> > >>
> > >>> On Mon, Jun 15, 2020 at 6:08 PM Ryan Ernst  wrote:
> > >>>
> > >>> Dear Lucene and Solr developers!
> > >>>
> > >>> In February a contest was started to design a new logo for Lucene
> [1].
> > That contest concluded, and I am now (admittedly a little late!) calling
> a
> > vote.
> > >>>
> > >>> The entries are labeled as follows:
> > >>>
> > >>> A. Submitted by Dustin Haver [2]
> > >>>
> > >>> B. Submitted by Stamatis Zampetakis [3] Note that this has several
> > variants. Within the linked entry there are 7 patterns and 7 color
> > palettes. Any vote for B should contain the pattern number, like B1 or
> B3.
> > If a B variant wins, we will have a followup vote on the color palette.
> > >>>
> > >>> C. The current Lucene logo [4]
> > >>>
> > >>> Please vote for one of the three (or nine depending on your
> > perspective!) above choices. Note that anyone in the Lucene+Solr
> community
> > is invited to express their opinion, though only Lucene+Solr PMC cast
> > binding votes (indicate non-binding votes in your reply, please). This
> vote
> > will close one week from today, Mon, June 22, 2020.
> > >>>
> > >>> Thanks!
> > >>>
> > >>> [1] https://issues.apache.org/jira/browse/LUCENE-9221
> > >>> [2]
> >
> https://issues.apache.org/jira/secure/attachment/12999548/Screen%20Shot%202020-04-10%20at%208.29.32%20AM.png
> > >>> [3]
> >
> https://issues.apache.org/jira/secure/attachment/12997768/zabetak-1-7.pdf
> > >>> [4]
> > https://lucene.apache.org/theme/images/lucene/lucene_logo_green_300.png
> > >>
> > >> --
> > >> Sent from Gmail for IPhone
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>


Re: German decompounding/tokenization with Lucene?

2017-09-16 Thread Tommaso Teofili
+1, some time ago I also used the decompounder mentioned by Dawid and was
satisfied back then.

Regards,
Tommaso


Il giorno sab 16 set 2017 alle ore 09:29 Dawid Weiss 
ha scritto:

> Hi Mike. Search lucene dev archives. I did write a decompounder with Daniel
> Naber. The quality was not ideal but perhaps better than nothing. Also,
> Daniel works on languagetool.org? They should have something in there.
>
> Dawid
>
> On Sep 16, 2017 1:58 AM, "Michael McCandless" 
> wrote:
>
> > Hello,
> >
> > I need to index documents with German text in Lucene, and I'm wondering
> how
> > people have done this in the past?
> >
> > Lucene already has a DictionaryCompoundWordTokenFilter ... is this what
> > people use?  Are there good, open-source friendly German dictionaries
> > available?
> >
> > Thanks,
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
>


Re: Using POS payloads for chunking

2017-06-14 Thread Tommaso Teofili
I think it'd be interesting to also investigate using TypeAttribute [1]
together with TypeTokenFilter [2].

Regards,
Tommaso

[1] :
https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/analysis/tokenattributes/TypeAttribute.html
[2] :
https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/core/TypeTokenFilter.html

Il giorno mer 14 giu 2017 alle ore 23:33 Markus Jelsma <
markus.jel...@openindex.io> ha scritto:

> Hello Erick, no worries, i recognize you two.
>
> I will take a look at your references tomorrow. Although i am still fine
> with eight bits, i cannot spare any more but one. If Lucene allows us to
> pass longer bitsets to the BytesRef, it would be awesome and easy to encode.
>
> Thanks!
> Markus
>
> -Original message-
> > From:Erick Erickson 
> > Sent: Wednesday 14th June 2017 23:29
> > To: java-user 
> > Subject: Re: Using POS payloads for chunking
> >
> > Markus:
> >
> > I don't believe that payloads are limited in size at all. LUCENE-7705
> > was done in part because there _was_ a hard-coded 256 limit for some
> > of the tokenizers. The Payload (at least recent versions) just have
> > some bytes after them, and (with LUCENE-7705) can be arbitrarily long.
> >
> > Of course if you put anything other than a number in there you have to
> > provide your own decoders and the like to make sense of your
> > payload
> >
> > Best,
> > Erick (Erickson, not Hatcher)
> >
> > On Wed, Jun 14, 2017 at 2:22 PM, Markus Jelsma
> >  wrote:
> > > Hello Erik,
> > >
> > > Using Solr, or actually more parts are Lucene, we have a CharFilter
> adding treebank tags to whitespace delimited word using a delimiter,
> further on we get these tokens with the delimiter and the POS-tag. It won't
> work with some Tokenizers and put it before WDF, it'll split as you know.
> That TokenFilter is configured with a tab delimited mapping config
> containing \t, and there the bitset is encoded as payload.
> > >
> > > Our edismax extension rewrites queries to payload supported
> equivalents, this is quite trivial, except for all those API changes in
> Lucene you have to put up with. Finally a BM25 extension that has, amongst
> others, a mapping of bitset to score. Nouns get a bonus, prepositions and
> other useless pieces get a punishment etc.
> > >
> > > Payloads are really great things to use! We also use it to distinguish
> between compounds and their subwords, o.a. we supply Dutch and German
> speaking countries.  And stemmed words and non-stemmed words. Although the
> latter also benefit from IDF statistics, payloads just help to control
> boosting more precisely regardless of your corpus.
> > >
> > > I still need to take a look at your recent payload QParsers for Solr
> and see how different, probably better, they are compared to our older
> implementations. Although we don't use PayloadTermQParser equivalent for
> regular search, we do use it for scoring recommendations via delimited
> multi valued fields. Payloads are versatile!
> > >
> > > The downside of payloads is that they are limited to 8 bits. Although
> we can easily fit our reduced treebank in there, we also use single bits to
> signal for compound/subword, and stemmed/unstemmed and some others.
> > >
> > > Hope this helps.
> > >
> > > Regards,
> > > Markus
> > >
> > > -Original message-
> > >> From:Erik Hatcher 
> > >> Sent: Wednesday 14th June 2017 23:03
> > >> To: java-user@lucene.apache.org
> > >> Subject: Re: Using POS payloads for chunking
> > >>
> > >> Markus - how are you encoding payloads as bitsets and use them for
> scoring?   Curious to see how folks are leveraging them.
> > >>
> > >>   Erik
> > >>
> > >> > On Jun 14, 2017, at 4:45 PM, Markus Jelsma <
> markus.jel...@openindex.io> wrote:
> > >> >
> > >> > Hello,
> > >> >
> > >> > We use POS-tagging too, and encode them as payload bitsets for
> scoring, which is, as far as is know, the only possibility with payloads.
> > >> >
> > >> > So, instead of encoding them as payloads, why not index your
> treebanks POS-tags as tokens on the same position, like synonyms. If you do
> that, you can use spans and phrase queries to find chunks of multiple
> POS-tags.
> > >> >
> > >> > This would be the first approach i can think of. Treating them as
> regular tokens enables you to use regular search for them.
> > >> >
> > >> > Regards,
> > >> > Markus
> > >> >
> > >> >
> > >> >
> > >> > -Original message-
> > >> >> From:José Tomás Atria 
> > >> >> Sent: Wednesday 14th June 2017 22:29
> > >> >> To: java-user@lucene.apache.org
> > >> >> Subject: Using POS payloads for chunking
> > >> >>
> > >> >> Hello!
> > >> >>
> > >> >> I'm not particularly familiar with lucene's search api (as I've
> been using
> > >> >> the library mostly as a dumb index rather than a search engine),
> but I am
> > >> >> almost certain that, using its payload 

Re: Possible to cause documents to be contiguous after forceMerge?

2016-11-16 Thread Tommaso Teofili
improved locality of "near" documents could be used to avoid loading some
segments during the retrieval phase for certain use cases (e.g. spatial
search).


Il giorno mer 16 nov 2016 alle ore 09:45 Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> ha scritto:

http://shaierera.blogspot.com/2013/04/index-sorting-with-lucene.html

On Wed, Nov 16, 2016 at 11:15 AM, Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> wrote:

> Can IndexSort help here?
> --
> From: Erick Erickson 
> Sent: ‎11/‎16/‎2016 9:29
> To: java-user 
> Subject: Re: Possible to cause documents to be contiguous after
> forceMerge?
>
> Well, codecs are pluggable so if you can show that you'd get
> an improvement (however you measure them) and that whatever
> you have in mind wouldn't penalize the general case you could
> submit it as a proposal/patch.
>
> Best,
> Erick
>
> On Tue, Nov 15, 2016 at 6:21 PM, Kevin Burton  wrote:
> > On Tue, Nov 15, 2016 at 6:16 PM, Erick Erickson  >
> > wrote:
> >
> >> You can make no assumptions about locality in terms of where separate
> >> documents land on disk. I suppose if you have the whole corpus at index
> >> time you
> >> could index these "similar" documents contiguously. T
> >>
> >
> > Wow.. that's shockingly frightening. There are a ton of optimizations if
> > you can trick the underlying content store into performing locality.
> >
> > Not trying to be overly negative so another way to phrase it is that at
> > least there's room for improvement !
> >
> >
> >> My base question is why you'd care about compressing 500G. Disk space
> >> is so cheap that the expense of trying to control this dwarfs any
> >> imaginable
> >> $avings, unless you're talking about a lot of 500G indexes. In other
> words
> >> this seems like an
> >> XY problem, you're asking about compressing when you are really
> concerned
> >> with something else.
> >>
> >
> > 500GB per day... additionally, disk is cheap, but IOPS are not. The more
> we
> > can keep in ram and on SSD the better.
> >
> > And we're trying to get as much in RAM then SSD as possible... plus we
> have
> > about 2 years of content.  It adds up ;)
> >
> > Kevin
> >
> > --
> >
> > We’re hiring if you know of any awesome Java Devops or Linux Operations
> > Engineers!
> >
> > Founder/CEO Spinn3r.com
> > Location: *San Francisco, CA*
> > blog: http://burtonator.wordpress.com
> > … or check out my Google+ profile
> > 
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: POS tagging in Lucene

2016-10-19 Thread Tommaso Teofili
I think it might be helpful to handle POS tags as TypeAttributes so that
the input and output texts would cleaner and you can still filter and
retrieve tokens by type (e.g. with TypeTokenFilter).

My 2 cents,
Tommaso


Il giorno mer 19 ott 2016 alle ore 11:56 Niki Pavlopoulou 
ha scritto:

> Hi Steve,
>
> thank you for your answer. I created a custom Lucene Analyser in the end.
> Just to clarify on what I mean, Lucene works perfectly for pure words, but
> since it does not support POS tagging some workaround needs to be done for
> the analysis of tokens with POS tags. For example:
>
> Input without POS tags: "I love Lucene's library. It is perfect."
> Output: List(love, lucene, library, perfect)
>
> Input with POS tags: "I[PRP] love[VBP] Lucene's[NNP] library[NN] It[PRP]
> is[VBZ] perfect[JJ]"
> Output: List(i[prp], love[vbp], lucene's[nnp], library[nn], it[prp],
> is[vbz], perfect[jj])
> *Desired output*: List(love[vbp], lucene[nnp], library[nn], perfect[jj])
>
> If one does the POS tagging after the analysis, then the tags might be
> wrong as the right syntax has been lost. This is why the POS tagging needs
> to happen early on and then the analysis to take place.
>
> Regards,
> Niki.
>
> On 18 October 2016 at 19:59, Steve Rowe  wrote:
>
> > Hi Niki,
> >
> > > On Oct 18, 2016, at 7:27 AM, Niki Pavlopoulou  wrote:
> > >
> > > Hi all,
> > >
> > > I am using Lucene and OpenNLP for POS tagging. I would like to support
> > > biGrams with POS tags as well. For example, I would like something like
> > > that:
> > >
> > > Input: (I[PRP], am[VBP], using[VBG], Lucene[NNP])
> > > Output: (I[PRP] am[VBP], am[VBP] using[VBG], using[VBG] Lucene[NNP])
> > >
> > > The problem above is that I do not have "pure" tokens, like "I", "am"
> > etc.,
> > > so the analysis could be wrong if I add the POS tags as an input in
> > Lucene.
> > > Is there a way to solve this, apart from creating my custome Lucene
> > > analyser?
> >
> > To create your bigrams, check out ShingleFilter: <
> > http://lucene.apache.org/core/6_2_1/analyzers-common/org/
> > apache/lucene/analysis/shingle/ShingleFilter.html>
> >
> > I’m not sure what you mean by “the analysis could be wrong if I add the
> > POS tags as an input in Lucene” - can you give an example?
> >
> > You may be interested in the work-in-progress addition of OpenNLP
> > integration with Lucene here:  > jira/browse/LUCENE-2899>
> >
> > --
> > Steve
> > www.lucidworks.com
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>


Re: [blog post] Comparing Document Classification Functions of Lucene and Mahout

2014-03-08 Thread Tommaso Teofili
yes, that's a good suggestion. I'll open a Jira issue for that soon.
Thanks,
Tommaso


2014-03-07 17:22 GMT+01:00 Koji Sekiguchi k...@r.email.ne.jp:

 Hi Tommaso,

 Thank you for your reply and tweet!


  Some useful points / suggestions come out of it, let's see if we can
 follow
  up :)

 Let's see simple one first. :-) Why don't we consider adding Analyzer
 parameter
 to assignClass()?

 koji


 (14/03/07 17:18), Tommaso Teofili wrote:

 cool Koji, thanks a lot for sharing.
 Some useful points / suggestions come out of it, let's see if we can
 follow
 up :)

 Regards,
 Tommaso


 2014-03-07 3:30 GMT+01:00 Koji Sekiguchi k...@r.email.ne.jp:

  Hello,

 I just posted an article on Comparing Document Classification Functions
 of Lucene and Mahout.


 http://soleami.com/blog/comparing-document-classification-functions-of-
 lucene-and-mahout.html

 Comments are welcome. :)

 Thanks!

 koji
 --

 http://soleami.com/blog/comparing-document-classification-functions-of-
 lucene-and-mahout.html

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





 --
 http://soleami.com/blog/comparing-document-classification-functions-of-
 lucene-and-mahout.html

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: [blog post] Comparing Document Classification Functions of Lucene and Mahout

2014-03-07 Thread Tommaso Teofili
cool Koji, thanks a lot for sharing.
Some useful points / suggestions come out of it, let's see if we can follow
up :)

Regards,
Tommaso


2014-03-07 3:30 GMT+01:00 Koji Sekiguchi k...@r.email.ne.jp:

 Hello,

 I just posted an article on Comparing Document Classification Functions
 of Lucene and Mahout.


 http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

 Comments are welcome. :)

 Thanks!

 koji
 --

 http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: [blog post] Automatically Acquiring Synonym Knowledge from Wikipedia

2013-05-29 Thread Tommaso Teofili
2013/5/29 Koji Sekiguchi k...@r.email.ne.jp

 Hi Rajesh,

 Thanks!
 I'm planning to open an NLP tool kit for Lucene, and the tool kit will
 include
 the following synonym library.


sounds nice, looking forward to it.

Tommaso



 koji


 (13/05/28 14:12), Rajesh Nikam wrote:

 Hello Koji,

 This is seems pretty useful post on how to create synonyms file.
 Thanks a lot for sharing this !

 Have you shared source code / jar for the same so at it could be used ?

 Thanks,
 Rajesh



 On Mon, May 27, 2013 at 8:44 PM, Koji Sekiguchi k...@r.email.ne.jp
 wrote:

  Hello,

 Sorry for cross post. I just wanted to announce that I've written a blog
 post on
 how to create synonyms.txt file automatically from Wikipedia:


 http://soleami.com/blog/**automatically-acquiring-**
 synonym-knowledge-from-**wikipedia.htmlhttp://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

 Hope that the article gives someone a good experience!

 koji
 --

 http://soleami.com/blog/**lucene-4-is-super-convenient-**
 for-developing-nlp-tools.htmlhttp://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html




 --
 http://soleami.com/blog/**automatically-acquiring-**
 synonym-knowledge-from-**wikipedia.htmlhttp://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

 --**--**-
 To unsubscribe, e-mail: 
 java-user-unsubscribe@lucene.**apache.orgjava-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: 
 java-user-help@lucene.apache.**orgjava-user-h...@lucene.apache.org




Re: Reg Lucene Naive Bayesian classifier.

2013-01-15 Thread Tommaso Teofili
2013/1/15 VIGNESH S vigneshkln...@gmail.com

 Hi All,

 Thanks for your replies..

 Actually I am trying to classify the email mail data in to categories
 and also spam mails .. I have tried clustering but it is not useful
 since we can not control categories.

 I am looking for a light weight implementation which can be used in
 mobiles in client side.

 I thought Lucene Naive Bayesian Would be useful...

 Please Suggest me Whether classifying emails will be done using this
 Lucene Naive Bayesian or any other Lucene Classifiers..


You could actually use one of the existing ones (naive bayes or nearest
neighbor) or even implement a new one (just implement the Classifier
interface [1]) if you already have enough labeled data in your index (one
field containing mail test and one field containing assigned category).
To use those just call Classifier#train method to train the classifier and
Classifier#assignClass to assign a class/category to a new text.
If your task is just spam detection IMHO one of the above should be enough,
if you have also to assign different categories depending on proper
semantics then I'd recommend use some other library which is more focused
for that purpose like Apache Mahout, Apache OpenNLP, etc..

My 2 cents,
Tommaso


[1] :
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/classification/src/java/org/apache/lucene/classification/Classifier.java




 Thanks and Regards
 Vignesh Srinivasan


 On Mon, Jan 14, 2013 at 7:23 PM, VIGNESH S vigneshkln...@gmail.com
 wrote:
  Hi,
 
  Anyone Used the Naive Bayesian Classifier?
 
  It will be really helpful if some one Can  post how to use the
  classifiers in Lucene ..
 
  --
  Thanks and Regards
  Vignesh Srinivasan
  9739135640



 --
 Thanks and Regards
 Vignesh Srinivasan
 9739135640

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Help needed Regarding classification of Text Data using Lucene..

2013-01-09 Thread Tommaso Teofili
Hi,

you can have a look at the (early stage) Lucene classification module on
trunk [1], see also a brief introduction given at last ApacheCon EU [2].

Hope this helps,
Tommaso

[1] :
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/classification/
[2] :
http://www.slideshare.net/teofili/text-categorization-with-lucene-and-solr


2013/1/9 VIGNESH S vigneshkln...@gmail.com

 Hi,

 can anyone suggest me how can i use lucene for text classification.

 --
 Thanks and Regards
 Vignesh Srinivasan

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: ANN: UweSays Query Operator

2012-11-20 Thread Tommaso Teofili
that's nice!

Tommaso


2012/11/19 Uwe Schindler u...@thetaphi.de

 Lol!

 Many thanks for this support!

 Uwes



 Otis Gospodnetic otis.gospodne...@gmail.com schrieb:

 Hi,
 
 Quick announcement for Uwe  Friends.
 
 UweSays is now a super-duper-special query operator over on
 http://search-lucene.com/ .  Now whenever you want to know what Uwe
 says
 about something just start the query with UweSays.
 
 Example:
   http://search-lucene.com/?q=UweSays+mmap
 
 It's not case sensitive, so you can lay off the shift key.
 There are some other similar Easter eggs in there if you want to hunt.
 
 Otis
 --
 Performance Monitoring - http://sematext.com/spm/index.html
 Search Analytics - http://sematext.com/search-analytics/index.html

 --
 Uwe Schindler
 H.-H.-Meier-Allee 63, 28213 Bremen
 http://www.thetaphi.de


Re: Lucene index on NFS

2012-10-02 Thread Tommaso Teofili
Ok, that saves you from concurrency issue, but in my experience is just
much slower than local file system, so still NFS can be used but with some
tradeoff on performance.

My 2 cents,
Tommaso

2012/10/2 Jong Kim jong.luc...@gmail.com

 The setup is I have a home-grown server process that has exclusive access
 to the index files. All reads and writes are done through this server. No
 other process is reading the same index files whether it's local or over
 NFS.
 /Jong
 On Tue, Oct 2, 2012 at 8:56 AM, Ian Lea ian@gmail.com wrote:

  I agree that reliability/corruption is not an issue.
 
  I would also put it that performance is likely to suffer, but that's
  not certain.  A fast disk mounted over NFS can be quicker than a slow
  local disk.  And how much do you care about performance?  Maybe it
  would be fast enough over NFS to make the ease of deployment balance
  out lesser speed.
 
  What's the setup here?  Will you be writing to an index on local disk
  of server A and reading it, over NFS, from server B (and C and ...) or
  what?
 
  --
  Ian.
 
 
  On Tue, Oct 2, 2012 at 1:45 PM, Paul Libbrecht p...@hoplahup.net
 wrote:
   I doubt NFS is an unreliable file-system.
   Lucene uses normal random access to files and this has no reason to be
  unreliable unless bad things such as network drops happen (in which case
  you'd get direct failures or  timeouts rather than corruption). I've seen
  fairly large infrastructures being based on NFS and corruption is
 something
  I've never heard about.
  
   Note: no concurrent access to a lucene index, right?
  
   Paul
  
  
   Le 2 oct. 2012 à 14:01, Jong Kim a écrit :
  
   Thank you all for reply.
  
   So it soudns like it is a known fact that the performance would suffer
   rather significantly when the index files are accessed over NFS. But
 how
   about reliability and robustness (which seems even more important)?
  Isn't
   there any increased possibility for intermittent errors such as index
  file
   corruption (due to cache inconsistency, difference in delete
 semantics,
   etc.) when using NFS? Has anyone run into such trouble? Or is it
  strictly
   just a performance issue?
  
   /Jong
   On Tue, Oct 2, 2012 at 5:17 AM, Paul Libbrecht p...@hoplahup.net
  wrote:
  
   My experience in the Lucene 1.x times were a factor of at least four
 in
   writing to NFS and about two when reading from there. I'd discourage
  this
   as much as possible!
  
   (rsync is way more your friend for transporting and replication à la
  solr
   should also be considered)
  
   paul
  
  
   Le 2 oct. 2012 à 11:10, Ian Lea a écrit :
  
   You'll certainly need to factor in the performance of NFS versus
 local
   disks.
  
   My experience is that smallish low activity indexes work just fine
 on
   NFS, but large high activity indexes are not so good, particularly
 if
   you have a lot of modifications to the index.
  
   You may want to install a custom IndexDeletionPolicy.  See the
   javadocs for details with specific reference to NFS.
  
  
   --
   Ian.
  
   On Tue, Oct 2, 2012 at 3:21 AM, Vitaly Funstein 
 vfunst...@gmail.com
   wrote:
   How tolerant is your project of decreased search and indexing
   performance?
   You could probably write a simple test that compares search and
 write
   speeds of local and NFS-mounted indexes and make the decision based
  on
   the
   results.
  
   On Mon, Oct 1, 2012 at 3:06 PM, Jong Kim jong.luc...@gmail.com
  wrote:
  
   Hi,
  
   According to the Lucene In Action (Second Edition), the section
  2.11.2
   Accessing an index over a remote file system explains that there
  are
   issues related to accessing a Lucene index across remote file
 system
   including NFS.
  
   I'm particuarly interested in NFS compatibility, and wondering if
   there has
   been any work done to solve or mitigate this problem. Has this
 issue
   been
   addressed? If not, are there some reliable work-arounds that make
  this
   possible at the expense of some sacrifice in other areas?
  
   Any information would be greatly appreciated, since my project
  heavily
   depends on the feasibility of this.
  
   Thanks
   /Jong
  
  
  
 -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  
  
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  
  
  
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: 

Re: Custom Payload Analyzer and Query

2012-02-07 Thread Tommaso Teofili
2012/2/6 Ian Lea ian@gmail.com

 Not sure if you got an answer to this or not.  Don't recall seeing one
 and gmail threading says not.

  Is the use of payloads I've described appropriate?

 Sounds OK to me, although I'm not sure why you can't store the
 metadata as a Document Field.

  Can I exclude/filter the matching terms based on the payload within a
 query itself ?

 I think not.  Could if the metadata was an indexed Field.


What you may do is initially put your metadata inside the token type, then
use the TypeTokenFilter to filter out some of them then copy“ them inside
the payloads using TypeAsPayloadTokenFilter and search with
PayloadSpanUtil/PayloadTermQuery/etc.

HTH,
Tommaso





 --
 Ian.


 On Mon, Jan 30, 2012 at 10:24 PM,  kt...@mmm.com wrote:
  I'm working on providing advanced searching for annotated Medical
  Documents (using UIMA).  In the context of an annotated document, I
  identify relevant medical terms, as well as the negation of certain
 terms.
   Following what I've read and seen in Lucene examples, I've been able to
  provide a search that takes into account the metadata contained in the
  payload.  Although very primitive, I've implemented a search which
 returns
  the payloads (using PayloadSpanUtil), and then excludes those terms where
  the payload doesn't meet the criteria.
 
  Is the use of payloads I've described appropriate?  Can I exclude/filter
  the matching terms based on the payload within a query itself ?   Are
  there any examples that do this?
 
  Cheers,
  Kyley

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?

2011-01-18 Thread Tommaso Teofili
[X] ASF Mirrors (linked in our release announcements or via the Lucene
website)

[X] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.)

[] I/we build them from source via an SVN/Git checkout.

[] Other (someone in your company mirrors them internally or via a
downstream project)

2011/1/18 Grant Ingersoll gsing...@apache.org

 As devs of Lucene/Solr, due to the way ASF mirrors, etc. works, we really
 don't have a good sense of how people get Lucene and Solr for use in their
 application.  Because of this, there has been some talk of dropping Maven
 support for Lucene artifacts (or at least make them external).  Before we do
 that, I'd like to conduct an informal poll of actual users out there and see
 how you get Lucene or Solr.

 Where do you get your Lucene/Solr downloads from?

 [] ASF Mirrors (linked in our release announcements or via the Lucene
 website)

 [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.)

 [] I/we build them from source via an SVN/Git checkout.

 [] Other (someone in your company mirrors them internally or via a
 downstream project)

 Please put an X in the box that applies to you.  Multiple selections are OK
 (for instance, if one project uses a mirror and another uses Maven)

 Please do not turn this thread into a discussion on Maven and it's
 (de)merits, I simply want to know, informally, where people get their JARs
 from.  In other words, no discussion is necessary (we already have that
 going on d...@lucene.apache.org which you are welcome to join.)

 Thanks,
 Grant
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org