Paul, for a pair-wise comparison Cosine Similarity does pretty good
for most purposes.
On Wed, Dec 3, 2014 at 10:45 AM, Paul Taylor wrote:
> On 03/12/2014 15:14, Barry Coughlan wrote:
>>
>> Hi Paul,
>>
>> I don't have much expertise in this area so hopefully others will answer,
>> but maybe this
To 2nd Vitaly's suggestion. You should consider using Apache Solr
instead - it handles such issues OOTB .
On Fri, May 23, 2014 at 7:52 PM, Vitaly Funstein wrote:
> At the risk of sounding overly critical here, I would say you need to scrap
> your entire approach of building one small index per
You are probably better off working with Solr in a multi-user system,
and since you seem be on .net, use Solrnet wrapper to call Solr from
your .net app.
On Mon, Dec 23, 2013 at 1:37 AM, raju wrote:
> a high level I understand the lucence searcher will get the
> directory path and search the que
http://www.slideshare.net/teofili/text-categorization-with-lucene-and-solr
On Wed, Jan 9, 2013 at 5:46 AM, VIGNESH S wrote:
> Hi,
>
> can anyone suggest me how can i use lucene for text classification.
>
> --
> Thanks and Regards
> Vignesh Srinivasan
>
> -
A related thread on Stackoverflow:
http://stackoverflow.com/questions/3215029/nosql-mongodb-vs-lucene-or-solr-as-your-database/3216550#3216550
On Fri, May 18, 2012 at 10:44 AM, Konstantyn Smirnov
wrote:
> Hi all,
>
> apologies, if this question was already asked before.
>
> If I need to store a l
This book is your best buddy: http://www.manning.com/hatcher3/
On Fri, Mar 2, 2012 at 2:01 PM, rahul reddy wrote:
> Hi ,
>
>
> I'm new to Lucene.Can anyone tell me how can i start learning about it with
> the code base.
> I have knowledge of endeca search engine and have worked on it.
> So, if
You might want to post this on sites such as odesk.com, rentacoder.com,
guru.com, freelancer.com
On Mon, Feb 13, 2012 at 9:31 AM, SearchTech wrote:
> am currently working on a search engine based on lucene and have some
> issues because java is not my regular programming language, which ma
I had posted this earlier on this list, hope this provides some answers
http://engineering.socialcast.com/2011/05/realtime-search-solr-vs-elasticsearch/
On Wed, Nov 16, 2011 at 9:53 AM, Federico Fissore wrote:
> Peyman Faratin, il 16/11/2011 15:12, ha scritto:
>
> Hi
>>
>> A client is conside
Using Lucene as a recommendation engine.
On Sat, Oct 22, 2011 at 6:33 PM, Grant Ingersoll wrote:
>
> On Oct 22, 2011, at 6:03 PM, Sujit Pal wrote:
>
>> Hi Grant,
>>
>> Not sure if this qualifies as a "bet you didn't know", but one could use
>> Lucene term vectors to construct document vectors for
Look up payload feature.
On Aug 14, 2011 2:37 PM, "Saar Carmi" wrote:
> Hi
> Does Lucene support setting word confidence for every word in the
document,
> to influence the scoring?
> As suggested by MAVIS project, when indexing Speech Recognition text one
> need to take into account how confident
Alternatively, you could create a multivalued field whereby each
sentence is in the same document, but retrievable in order.
On Fri, Jul 22, 2011 at 11:10 AM, Glen Newton wrote:
> So to use Lucene-speak, each sentence is a document.
>
> I don't know how you are indexing and what code you are usi
https://issues.apache.org/jira/browse/LUCENE-1522
On Wed, May 25, 2011 at 3:46 PM, Leroy Stone wrote:
> document ("paragraphs") that contain my search phrase, rather than simply
> pointers to the whole document. in searching among applications based upon
> the Lucene, I have found only one that
Have you considered using document similarity metrics such as Cosine Similarity?
On Thu, Dec 30, 2010 at 6:05 AM, Amel Fraisse wrote:
> Hello,
>
> I am using Lucene for plagiarism detection.
>
> The goal is that: when I have a new document, I will check on the solr index
> if there is a document
> yes, but if they are typing away, they likely aren't also searching at
> the same time unless they have two keyboards and four hands... so why
> update anything in real time?
Presumably the OP meant user-A was editing the doc and other Users ,
or a monitoring app, are searching said doc simulta
There are multiple measures of similarity for documents: Cosine similarity
is a frequently used one.
On Sat, Nov 13, 2010 at 9:23 AM, Ciprian URSU wrote:
> Hi Guys,
>
>I just find out about Lucene; after reading the main things on wiki
> it seems to be a great tool, but I still didn't f
For Part-of-Speech (POS) identification you are better off looking at a took
like OpenNLP or NLTK.
2010/10/30 Mário André
>
> Hi,
> I need a Java API that identify the grammatical terms in noun phrase (NP).
> Eg: I see the words when you are talking.
> See: Verb
> Words: Noun
> are: Verb
> talk
Hello, I am familiar with the SpanQuery construct and set an upper Slop limit.
1. But when I get the hit results, is there any way I can access the
actual slop and the span text itself in that particular hit.
2. Also it is possible to have multiple matches within the same
document. So how do I acc
On Tue, Jul 13, 2010 at 5:17 PM, Max Lynch wrote:
> Hi,
> I would like to continuously iterate over the documents in my lucene index
> as the index is updated. Kind of like a "stream" of documents. Is there a
> way I can achieve this?
>
> Would something like this be sufficient (untested):
>
>
I have used TagSoup to parse the HTML and get the elements of interest.
http://ccil.org/~cowan/XML/tagsoup/
On Mon, Jun 28, 2010 at 4:06 PM, Boris Aleksandrovsky
wrote:
> I was wondering if any of you know of any open-source solutions for general
> issues which arise in web crawling - how do yo
r Java objects) feels the
>>> same. I saw something from Grant about 2 months ago how Lucene is
>>> "nosql-ish".
>>>
>>> Otis
>>>
>>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>>> Lucene ecosystem search :
Based on your description, I would recommend Solr. It provides several
features such as spelling suggestion, faceting etc.
OOTB.
http://lucene.apache.org/solr/features.html
should answer all your questions.
On Mon, May 31, 2010 at 7:54 PM, Frank A wrote:
> Thanks a bunch.
>
> Since I'm already
You are certainly in the right place - Apache Solr (a search server
built using Lucene) provides what you are looking for out of the box.
On Mon, May 31, 2010 at 7:20 PM, Frank A wrote:
> Hello all,
> I'm considering Lucene for a specific application and am trying to ensure
> that it is the righ
There seems to be considerable buzz on the internets about document
oriented dbs such as MongoDB, CouchDB etc. I am at a loss as to what
are the principal differences between Lucene and the "DODBs". I could
very use Lucene as any of the above (schema-free, Document oriented)
and perform similar que
Add the common terms such as "University", "School", "Medicine",
"Institute" etc. to stopwords list, so you are left with Stanford,
"Palo Alto" etc.
Then use Ahmet's suggestion of using a booleanquery
.setMinimumNumberShouldMatch() to (say) 75% of the query string
length.
Finally, if you wish to
or a CRC
On Sun, Jan 31, 2010 at 11:58 AM, Shashi Kant wrote:
> If all you want is to flag a document "dirty" you could hash the
> fields in the document and and check for an update.
>
>
>
> On Sun, Jan 31, 2010 at 11:51 AM, Robert Koberg wrote:
>> Hi,
>&g
If all you want is to flag a document "dirty" you could hash the
fields in the document and and check for an update.
On Sun, Jan 31, 2010 at 11:51 AM, Robert Koberg wrote:
> Hi,
>
> Just coming back to Lucene after a few years.
>
> Is there some convenient way to compare Lucene Documents?
>
> I
Hi, if you want to search by substring (i.e. "lp" should return "lpg"
as a result) you should look at wildcards.
So a search for "lp*" (* is the wildcard character) would return lpg,
lpghxyz, lp12345 and so on...
On Thu, Jan 28, 2010 at 1:41 PM, andy green wrote:
>
> hello,
>
> I programmed wit
Couldn't you just mod the PorterStemmer class for your requirements?
(we did and provided it a list of ignore words & phrases specific to
our needs)
On Sat, Jan 9, 2010 at 4:00 AM, Jamie wrote:
> Hi All
>
> Is there another stemmer we can use that is perhaps not as aggressive as the
> Porter Stem
You can enable that by
QueryParser.setAllowLeadingWildcard( true )
On Mon, Dec 28, 2009 at 2:46 AM, liujb wrote:
>
> oh,my god,
>
> Query Parser Syntax
>
> CANNOT use a * or ? symbol as the first character of a search.
>
> that's mean I can't wrinte a search string like '*test'. this will be ca
Have you looked at BooleanQuery? Create individual TermQuery and OR them
using BooleanQuery.
On Thu, Dec 10, 2009 at 10:34 AM, comparis.ch - Roman Baeriswyl <
roman.baeris...@comparis.ch> wrote:
> Hi,
>
> I do have some indices where I need to get results based on a fixed number
> list (not a ran
This forum is probably not the best place to ask this question, since this
is Lucene developers/users forum. If you want to write a tool, then this is
the place is to be. If you want an ready tool, one I am aware of is
searchmyfiles.exe from Nirsoft.
http://www.nirsoft.net/utils/search_my_files.ht
Hi, I have not worked on a petascale (yet!) - mostly on the scale of tens of
terabyes - but I do think Lucene would be very helpful for such usecase. I
would indeed suggest partitioning the index by users (seems the most
logical., straightforward way, also offers the security of insulating one
user
Use this tool to examine the index: http://www.getopt.org/luke/
I would also suggest getting hold of a Lucene book such as Lucene In Action
(http://www.manning.com/hatcher2/) to get familiar with the basics of
Lucene.
On Mon, Nov 23, 2009 at 4:42 AM, DHIVYA M wrote:
> Sir,
>
> Am using lucene
Take a look at Bayesian text classification, which might be more
efficient for your needs. Google it.
There are several other text classification methods - depending your
needs, you can dig into them.
On Mon, Nov 9, 2009 at 10:33 AM, lucenenew wrote:
>
> i want to classify sentences stored as s
Hi,
Apologies this is not a Lucene question per se...however I am in need
of some sage advice... We anticipate our indices to be very large (on
the order of a a few TB each)., and there will be 5-10 of them. Hence
we are looking at NAS or SANs with 15-20TB storage.
My questions;
1. Any recommend
Here is some code to help you along. This should leave the source
indices intact and merges them into a destination.
//the index to hold our merged index
IndexWriter iw = new IndexWriter(dest, new
StandardAnalyzer(), true);
string[] sourceIndices;
Chris & Erick's arguments are persuasive , however we do live in an
imperfect world. Most of our users want to see the relative importance
of a results vis-a-vis the rest
Relative Importance (%) = (d - dmin)/(dmax-dmin) * 100
Where dmax is the highest Lucene score (score of top result) and dm
ext queries. But they will be expanded into:
>>> Multifield, Boolean.
>>> We are also expanding the original query using SynExpand of lucene. A
>>> simple
>>> query
>>> gets expanded to say a query of page size.
>>>
>>> And we are not sto
Prashant, I have had better luck with even larger sized indices on
similar platforms. Could you elaborate what types of queries you are
running, Multifield? Boolean? combinations? etc. Also you might want
to remove unnecessary stored fields from the index and move them to a
relational db to squeeze
y and constrcuting vector space.
>
> - RB
>
>
> ----- Original Message
> From: Shashi Kant
> To: java-user@lucene.apache.org
> Sent: Tuesday, June 23, 2009 3:20:16 PM
> Subject: Re: Similarity
>
> I suspect what you are looking for is "Latent Semantics
I suspect what you are looking for is "Latent Semantics" - it can
algorithmically infer that "iPod~iPhone" or "Apple~Steve Jobs". Google for
"Latent Semantic Indexing" or "Latent Semantic Analysis" - you can apply
some of those approaches using the TermVectors in Lucene index.
Ontologies such as Wo
Thanks for the up Otis. I will give this some more thought, prototype
some, and possibly put in a proposal for the Apache Incubator.
Ye,
I am not aware of Sixearch, but there are several P2P applications
e.g. WiredReach, Grub, Neurogrid etc.
However, my idea is quite a bit different from the exis
Hi all,
I am writing to gauge the group's interest level in building a P2P
application using Lucene. Nothing fancy, just good old-fashioned P2P
search across one's social-network or work-network (very unlike
Gnutella, Kazaa etc.). The obvious business-case for this could be
many such as document s
Hi all,
I am working on an iPhone application where the Lucene index needs to
reside on-device (for multiple reasons). Has anyone figured out a way
to do that?
As you might know the iPhone contains SQLite - could an index be
embedded inside SQLite? or could it be resident separately as a file?
Th
ugh it more efficiently than grep?
>
> Thanks,
>
> Michael
>
> On Sun, Apr 12, 2009 at 7:53 PM, Shashi Kant wrote:
>
>> Not sure what the business-case for this is and why you cannot use
>> RegEx for this. But you could consider chopping up the document into
>>
Not sure what the business-case for this is and why you cannot use
RegEx for this. But you could consider chopping up the document into
(sub) documents and adding them to the Lucene index. For example, chop
by paragraph or line-break.
HTH,
Shashi
On Sun, Apr 12, 2009 at 1:51 PM, wrote:
> Hi,
>
Hmm..not sure I would call Autonomy a "superb product". IMHO It is anything
but. In fact, it is what one calls bloat-ware.I have had some experience
with Autonomy and it is hardly something you should consider using unless
you are eager to shoot yourself in the foot. I fundamentally disagree with
P
Hi all, I have been trying to get the latest version of LuSQL from the
NRC.ca website but get 404s on the download links. I have written to the
webmaster, but anyone have the jar handy? Could I download from somewhere
else? or could you email it to me?
thanks,
Shashi
Thanks Chris, your suggestion is very appropriate and I am happy to share my
work with the Lucene community,
Regards,
Shashi
On Tue, Mar 24, 2009 at 7:15 PM, Chris Hostetter
wrote:
>
> : This is perfect, exactly what I was looking for. Thanks much Andrzej!
>
> if you code that up and it works o
This is perfect, exactly what I was looking for. Thanks much Andrzej!
On Mon, Mar 23, 2009 at 1:43 AM, Andrzej Bialecki wrote:
> Shashi Kant wrote:
>
>> Is there an "elegant" approach to partitioning a large Lucene index (~1TB)
>> into smaller sub-indexes other t
Is there an "elegant" approach to partitioning a large Lucene index (~1TB)
into smaller sub-indexes other than the obvious method of re-indexing into
partitions?
Any ideas?
Thanks,
Shashi
The BoW approach is simple and highly effective IMO. If you want to get a
bit fancy, you could also use a MultiField query in the combined index.
Another brute-force approach would be to hit all 3 indexes and see which
ones come back with the highest score(s).
On Mon, Mar 9, 2009 at 8:43 AM, Er
Liat, i think what Erick suggested was to use the TOKENIZED setting instead
of UN_TOKENIZED. For example your code should read something like:
Document doc = new Document();
doc.add(new Field(WordIndex.FIELD_WORLDS, "111 222 333", Field.Store.YES, *
Field.Index.TOKENIZED*));
Unless I am missing s
Yes, it is good to learn that Yonik, Erik et al are also human-beings. :-)
Thanks for all your contributions to Lucene/Solr, this list and the OSS
community in general.
Best,
Shashi
On Thu, Mar 5, 2009 at 11:36 AM, Erick Erickson wrote:
> Let's see, you guys generously contributed your time and
A simple solution would be to store the string "NULL" instead of null and
then query.
On Wed, Mar 4, 2009 at 1:26 PM, Chris Lu wrote:
> Allahbaksh,
>
> If you ONLY want to find all document with a particular field that is not
> null, you can loop through the TermEnum and TermDocs to find all th
Not sure what you are asking about, but you might want to take a look at
http://lucene.apache.org/java/2_4_0/api/contrib-surround/index.html
The Surround parser offers many features around the span query (which I
suspect is what you are looking for)
Shashi
On Mon, Mar 2, 2009 at 4:57 AM, shb w
Nada,
You might want to consider writing a custom tokenizer which will allow you to
generate tokens based on your needs (other than whitespace).
Another option would be to look at SpanQuery or SpanNearQuery which would help
with the kind of problem you are trying to solve (assuming I understand
Take a look at Solr - it should be able to handle the scale you describe. My
suggestion is not to partition indexes unless absolutely have to.
- Original Message
From: "spr...@gmx.eu"
To: java-user@lucene.apache.org
Sent: Saturday, February 14, 2009 10:27:58 AM
Subject: RE: Multiple
Thanks Omar, I have looked at Prefuse.
What has been your experience with it given it is still in beta? any "gotchas"
we should look out for?
regards,
shashi
- Original Message
From: Omar Alonso
To: java-user@lucene.apache.org; Shashi Kant
Sent: Thursday, February 12,
Hi all,
Apologies for being slightly off-topic, we are looking at novel visualization
approaches for rendering results from Lucene queries. I was wondering if you
have any recommendations for visualization toolkits (Java) for displaying
heat-maps, dendrograms, cluster maps etc. (preferably free
ndler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
> > -Original Message-
> > From: Shashi Kant [mailto:shashi_k...@yahoo.com]
> > Sent: Friday, January 30, 2009 3:32 PM
> > To: java-user@lucene.apache.org
> >
Unless I am missing something, not sure I see the issue here. You can convert
to Base64 purely for indexing purposes and leave the original binary as-is.
- Original Message
From: Paul Feuer
To: Lucene User List ; Shashi Kant
Sent: Friday, January 30, 2009 10:12:33 AM
Subject: Re
Hi Paul, have you tried persisting the binaries in Base64 format and then
indexing them?
As you are aware, Base64 is a robust representation used in email attachments
for example.
Thanks
Shashi
- Original Message
From: Paul Feuer
To: java-user@lucene.apache.org
Sent: Thursday, Janu
Amin,
Are you calling Close & Optimize after every addDocument?
I would suggest something like this
try
{
while //this could be your looping through a data reader for example
{
indexWriter.addDocument(document);
}
}
finally
{
commitAndOptimise()
}
HTH
Shashi
64 matches
Mail list logo