Re: How best to compare tow sentences

2014-12-03 Thread Shashi Kant
Paul, for a pair-wise comparison Cosine Similarity does pretty good for most purposes. On Wed, Dec 3, 2014 at 10:45 AM, Paul Taylor wrote: > On 03/12/2014 15:14, Barry Coughlan wrote: >> >> Hi Paul, >> >> I don't have much expertise in this area so hopefully others will answer, >> but maybe this

Re: NewBie To Lucene || Perfect configuration on a 64 bit server

2014-05-23 Thread Shashi Kant
To 2nd Vitaly's suggestion. You should consider using Apache Solr instead - it handles such issues OOTB . On Fri, May 23, 2014 at 7:52 PM, Vitaly Funstein wrote: > At the risk of sounding overly critical here, I would say you need to scrap > your entire approach of building one small index per

Re: Regarding Lucene.net

2013-12-23 Thread Shashi Kant
You are probably better off working with Solr in a multi-user system, and since you seem be on .net, use Solrnet wrapper to call Solr from your .net app. On Mon, Dec 23, 2013 at 1:37 AM, raju wrote: > a high level I understand the lucence searcher will get the > directory path and search the que

Re: Help needed Regarding classification of Text Data using Lucene..

2013-01-09 Thread Shashi Kant
http://www.slideshare.net/teofili/text-categorization-with-lucene-and-solr On Wed, Jan 9, 2013 at 5:46 AM, VIGNESH S wrote: > Hi, > > can anyone suggest me how can i use lucene for text classification. > > -- > Thanks and Regards > Vignesh Srinivasan > > -

Re: Performance of storing data in Lucene vs other (No)SQL Databases

2012-05-21 Thread Shashi Kant
A related thread on Stackoverflow: http://stackoverflow.com/questions/3215029/nosql-mongodb-vs-lucene-or-solr-as-your-database/3216550#3216550 On Fri, May 18, 2012 at 10:44 AM, Konstantyn Smirnov wrote: > Hi all, > > apologies, if this question was already asked before. > > If I need to store a l

Re: NEW TO LUCENE

2012-03-05 Thread Shashi Kant
This book is your best buddy: http://www.manning.com/hatcher3/ On Fri, Mar 2, 2012 at 2:01 PM, rahul reddy wrote: > Hi , > > > I'm new to Lucene.Can anyone tell me how can i start learning about it with > the code base. > I have knowledge of endeca search engine and have worked on it. > So, if

Re: Paid Job: Looking for a developer to create a small java application to extract url's from .fdt files

2012-02-13 Thread Shashi Kant
You might want to post this on sites such as odesk.com, rentacoder.com, guru.com, freelancer.com On Mon, Feb 13, 2012 at 9:31 AM, SearchTech wrote: > am currently working on a search engine based on lucene and have some > issues because java is not my regular programming language, which ma

Re: ElasticSearch

2011-11-16 Thread Shashi Kant
I had posted this earlier on this list, hope this provides some answers http://engineering.socialcast.com/2011/05/realtime-search-solr-vs-elasticsearch/ On Wed, Nov 16, 2011 at 9:53 AM, Federico Fissore wrote: > Peyman Faratin, il 16/11/2011 15:12, ha scritto: > > Hi >> >> A client is conside

Re: Bet you didn't know Lucene can...

2011-10-22 Thread Shashi Kant
Using Lucene as a recommendation engine. On Sat, Oct 22, 2011 at 6:33 PM, Grant Ingersoll wrote: > > On Oct 22, 2011, at 6:03 PM, Sujit Pal wrote: > >> Hi Grant, >> >> Not sure if this qualifies as a "bet you didn't know", but one could use >> Lucene term vectors to construct document vectors for

Re: Word Confidence in Lucene

2011-08-14 Thread Shashi Kant
Look up payload feature. On Aug 14, 2011 2:37 PM, "Saar Carmi" wrote: > Hi > Does Lucene support setting word confidence for every word in the document, > to influence the scoring? > As suggested by MAVIS project, when indexing Speech Recognition text one > need to take into account how confident

Re: Index one huge text file

2011-07-22 Thread Shashi Kant
Alternatively, you could create a multivalued field whereby each sentence is in the same document, but retrievable in order. On Fri, Jul 22, 2011 at 11:10 AM, Glen Newton wrote: > So to use Lucene-speak, each sentence is a document. > > I don't know how you are indexing and what code you are usi

Re: Passage retrieval with Lucene-based application

2011-05-25 Thread Shashi Kant
https://issues.apache.org/jira/browse/LUCENE-1522 On Wed, May 25, 2011 at 3:46 PM, Leroy Stone wrote: > document ("paragraphs") that contain my search phrase, rather than simply > pointers to the whole document. in searching among applications based upon > the Lucene, I have found only one that

Re: Using Lucene/Solr for Plagiarism detection

2010-12-30 Thread Shashi Kant
Have you considered using document similarity metrics such as Cosine Similarity? On Thu, Dec 30, 2010 at 6:05 AM, Amel Fraisse wrote: > Hello, > > I am using Lucene for plagiarism detection. > > The goal is that: when I have a new document, I will check on the solr index > if there is a document

Re: Using Lucene to search live, being-edited documents

2010-12-28 Thread Shashi Kant
> yes, but if they are typing away, they likely aren't also searching at > the same time unless they have two keyboards and four hands... so why > update anything in real time? Presumably the OP meant user-A was editing the doc and other Users , or a monitoring app, are searching said doc simulta

Re: Can I use Lucene for this?

2010-11-13 Thread Shashi Kant
There are multiple measures of similarity for documents: Cosine similarity is a frequently used one. On Sat, Nov 13, 2010 at 9:23 AM, Ciprian URSU wrote: > Hi Guys, > >I just find out about Lucene; after reading the main things on wiki > it seems to be a great tool, but I still didn't f

Re: Grammatical terms

2010-10-30 Thread Shashi Kant
For Part-of-Speech (POS) identification you are better off looking at a took like OpenNLP or NLTK. 2010/10/30 Mário André > > Hi, > I need a Java API that identify the grammatical terms in noun phrase (NP). > Eg: I see the words when you are talking. > See: Verb > Words: Noun > are: Verb > talk

Span Query/Slop distance

2010-08-26 Thread Shashi Kant
Hello, I am familiar with the SpanQuery construct and set an upper Slop limit. 1. But when I get the hit results, is there any way I can access the actual slop and the span text itself in that particular hit. 2. Also it is possible to have multiple matches within the same document. So how do I acc

Re: Continuously iterate over documents in index

2010-07-13 Thread Shashi Kant
On Tue, Jul 13, 2010 at 5:17 PM, Max Lynch wrote: > Hi, > I would like to continuously iterate over the documents in my lucene index > as the index is updated.  Kind of like a "stream" of documents.  Is there a > way I can achieve this? > > Would something like this be sufficient (untested): > >  

Re: header/footer identification and general scaping tools

2010-06-28 Thread Shashi Kant
I have used TagSoup to parse the HTML and get the elements of interest. http://ccil.org/~cowan/XML/tagsoup/ On Mon, Jun 28, 2010 at 4:06 PM, Boris Aleksandrovsky wrote: > I was wondering if any of you know of any open-source solutions for general > issues which arise in web crawling - how do yo

Re: Is Lucene a "document oriented database"?

2010-06-01 Thread Shashi Kant
r Java objects) feels the >>> same.  I saw something from Grant about 2 months ago how Lucene is >>> "nosql-ish". >>> >>>  Otis >>> >>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch >>> Lucene ecosystem search :

Re: Lucene Newbie Questions

2010-05-31 Thread Shashi Kant
Based on your description, I would recommend Solr. It provides several features such as spelling suggestion, faceting etc. OOTB. http://lucene.apache.org/solr/features.html should answer all your questions. On Mon, May 31, 2010 at 7:54 PM, Frank A wrote: > Thanks a bunch. > > Since I'm already

Re: Lucene Newbie Questions

2010-05-31 Thread Shashi Kant
You are certainly in the right place - Apache Solr (a search server built using Lucene) provides what you are looking for out of the box. On Mon, May 31, 2010 at 7:20 PM, Frank A wrote: > Hello all, > I'm considering Lucene for a specific application and am trying to ensure > that it is the righ

Is Lucene a "document oriented database"?

2010-05-31 Thread Shashi Kant
There seems to be considerable buzz on the internets about document oriented dbs such as MongoDB, CouchDB etc. I am at a loss as to what are the principal differences between Lucene and the "DODBs". I could very use Lucene as any of the above (schema-free, Document oriented) and perform similar que

Re: Lucene query with long strings

2010-03-24 Thread Shashi Kant
Add the common terms such as "University", "School", "Medicine", "Institute" etc. to stopwords list, so you are left with Stanford, "Palo Alto" etc. Then use Ahmet's suggestion of using a booleanquery .setMinimumNumberShouldMatch() to (say) 75% of the query string length. Finally, if you wish to

Re: best way to compare Documents

2010-01-31 Thread Shashi Kant
or a CRC On Sun, Jan 31, 2010 at 11:58 AM, Shashi Kant wrote: > If all you want is to flag a document "dirty" you could hash the > fields in the document and and check for an update. > > > > On Sun, Jan 31, 2010 at 11:51 AM, Robert Koberg wrote: >> Hi, >&g

Re: best way to compare Documents

2010-01-31 Thread Shashi Kant
If all you want is to flag a document "dirty" you could hash the fields in the document and and check for an update. On Sun, Jan 31, 2010 at 11:51 AM, Robert Koberg wrote: > Hi, > > Just coming back to Lucene after a few years. > > Is there some convenient way to compare Lucene Documents? > > I

Re: lucene search

2010-01-28 Thread Shashi Kant
Hi, if you want to search by substring (i.e. "lp" should return "lpg" as a result) you should look at wildcards. So a search for "lp*" (* is the wildcard character) would return lpg, lpghxyz, lp12345 and so on... On Thu, Jan 28, 2010 at 1:41 PM, andy green wrote: > > hello, > > I programmed wit

Re: Search query problem

2010-01-09 Thread Shashi Kant
Couldn't you just mod the PorterStemmer class for your requirements? (we did and provided it a list of ignore words & phrases specific to our needs) On Sat, Jan 9, 2010 at 4:00 AM, Jamie wrote: > Hi All > > Is there another stemmer we can use that is perhaps not as aggressive as the > Porter Stem

Re: CANNOT use a * or ? symbol as the first character of a search.

2009-12-28 Thread Shashi Kant
You can enable that by QueryParser.setAllowLeadingWildcard( true ) On Mon, Dec 28, 2009 at 2:46 AM, liujb wrote: > > oh,my god, > > Query Parser Syntax > > CANNOT use a * or ? symbol as the first character of a search. > > that's mean I can't wrinte a search string like '*test'. this will be ca

Re: "IN" Query for NumericFields

2009-12-10 Thread Shashi Kant
Have you looked at BooleanQuery? Create individual TermQuery and OR them using BooleanQuery. On Thu, Dec 10, 2009 at 10:34 AM, comparis.ch - Roman Baeriswyl < roman.baeris...@comparis.ch> wrote: > Hi, > > I do have some indices where I need to get results based on a fixed number > list (not a ran

Re: About Lucene ...

2009-12-02 Thread Shashi Kant
This forum is probably not the best place to ask this question, since this is Lucene developers/users forum. If you want to write a tool, then this is the place is to be. If you want an ready tool, one I am aware of is searchmyfiles.exe from Nirsoft. http://www.nirsoft.net/utils/search_my_files.ht

Re: Is Lucene a good choice for PB scale mailbox search?

2009-11-23 Thread Shashi Kant
Hi, I have not worked on a petascale (yet!) - mostly on the scale of tens of terabyes - but I do think Lucene would be very helpful for such usecase. I would indeed suggest partitioning the index by users (seems the most logical., straightforward way, also offers the security of insulating one user

Re: How to find the fields that are indexed?

2009-11-23 Thread Shashi Kant
Use this tool to examine the index: http://www.getopt.org/luke/ I would also suggest getting hold of a Lucene book such as Lucene In Action (http://www.manning.com/hatcher2/) to get familiar with the basics of Lucene. On Mon, Nov 23, 2009 at 4:42 AM, DHIVYA M wrote: > Sir, > > Am using lucene

Re: Lucene - Text Classification.

2009-11-09 Thread Shashi Kant
Take a look at Bayesian text classification, which might be more efficient for your needs. Google it. There are several other text classification methods - depending your needs, you can dig into them. On Mon, Nov 9, 2009 at 10:33 AM, lucenenew wrote: > > i want to classify sentences stored as s

NAS/SAN Devices (slightly off-topic)

2009-10-04 Thread Shashi Kant
Hi, Apologies this is not a Lucene question per se...however I am in need of some sage advice... We anticipate our indices to be very large (on the order of a a few TB each)., and there will be 5-10 of them. Hence we are looking at NAS or SANs with 15-20TB storage. My questions; 1. Any recommend

Re: how can I merge indexes without deleting the original index?

2009-09-04 Thread Shashi Kant
Here is some code to help you along. This should leave the source indices intact and merges them into a destination. //the index to hold our merged index IndexWriter iw = new IndexWriter(dest, new StandardAnalyzer(), true); string[] sourceIndices;

Re: How to het the score in percentage

2009-08-22 Thread Shashi Kant
Chris & Erick's arguments are persuasive , however we do live in an imperfect world. Most of our users want to see the relative importance of a results vis-a-vis the rest Relative Importance (%) = (d - dmin)/(dmax-dmin) * 100 Where dmax is the highest Lucene score (score of top result) and dm

Re: How to improve search time?

2009-08-04 Thread Shashi Kant
ext queries. But they will be expanded into: >>> Multifield, Boolean. >>> We are also expanding the original query using SynExpand of lucene. A >>> simple >>> query >>> gets expanded to say a query of page size. >>> >>> And we are not sto

Re: How to improve search time?

2009-08-04 Thread Shashi Kant
Prashant, I have had better luck with even larger sized indices on similar platforms. Could you elaborate what types of queries you are running, Multifield? Boolean? combinations? etc. Also you might want to remove unnecessary stored fields from the index and move them to a relational db to squeeze

Re: Similarity

2009-06-23 Thread Shashi Kant
y and constrcuting vector space. > > - RB > > > ----- Original Message > From: Shashi Kant > To: java-user@lucene.apache.org > Sent: Tuesday, June 23, 2009 3:20:16 PM > Subject: Re: Similarity > > I suspect what you are looking for is "Latent Semantics

Re: Similarity

2009-06-23 Thread Shashi Kant
I suspect what you are looking for is "Latent Semantics" - it can algorithmically infer that "iPod~iPhone" or "Apple~Steve Jobs". Google for "Latent Semantic Indexing" or "Latent Semantic Analysis" - you can apply some of those approaches using the TermVectors in Lucene index. Ontologies such as Wo

Re: P2P Lucene

2009-06-05 Thread Shashi Kant
Thanks for the up Otis. I will give this some more thought, prototype some, and possibly put in a proposal for the Apache Incubator. Ye, I am not aware of Sixearch, but there are several P2P applications e.g. WiredReach, Grub, Neurogrid etc. However, my idea is quite a bit different from the exis

P2P Lucene

2009-06-04 Thread Shashi Kant
Hi all, I am writing to gauge the group's interest level in building a P2P application using Lucene. Nothing fancy, just good old-fashioned P2P search across one's social-network or work-network (very unlike Gnutella, Kazaa etc.). The obvious business-case for this could be many such as document s

Lucene index on iPhone

2009-05-06 Thread Shashi Kant
Hi all, I am working on an iPhone application where the Lucene index needs to reside on-device (for multiple reasons). Has anyone figured out a way to do that? As you might know the iPhone contains SQLite - could an index be embedded inside SQLite? or could it be resident separately as a file? Th

Re: Searching a single file

2009-04-13 Thread Shashi Kant
ugh it more efficiently than grep? > > Thanks, > > Michael > > On Sun, Apr 12, 2009 at 7:53 PM, Shashi Kant wrote: > >> Not sure what the business-case for this is and why you cannot use >> RegEx for this. But you could consider chopping up the document into >>

Re: Searching a single file

2009-04-12 Thread Shashi Kant
Not sure what the business-case for this is and why you cannot use RegEx for this. But you could consider chopping up the document into (sub) documents and adding them to the Lucene index. For example, chop by paragraph or line-break. HTH, Shashi On Sun, Apr 12, 2009 at 1:51 PM, wrote: > Hi, >

Re: Autonomy search technology

2009-04-03 Thread Shashi Kant
Hmm..not sure I would call Autonomy a "superb product". IMHO It is anything but. In fact, it is what one calls bloat-ware.I have had some experience with Autonomy and it is hardly something you should consider using unless you are eager to shoot yourself in the foot. I fundamentally disagree with P

LuSQL download link error?

2009-04-02 Thread Shashi Kant
Hi all, I have been trying to get the latest version of LuSQL from the NRC.ca website but get 404s on the download links. I have written to the webmaster, but anyone have the jar handy? Could I download from somewhere else? or could you email it to me? thanks, Shashi

Re: Index Partitioning

2009-03-25 Thread Shashi Kant
Thanks Chris, your suggestion is very appropriate and I am happy to share my work with the Lucene community, Regards, Shashi On Tue, Mar 24, 2009 at 7:15 PM, Chris Hostetter wrote: > > : This is perfect, exactly what I was looking for. Thanks much Andrzej! > > if you code that up and it works o

Re: Index Partitioning

2009-03-23 Thread Shashi Kant
This is perfect, exactly what I was looking for. Thanks much Andrzej! On Mon, Mar 23, 2009 at 1:43 AM, Andrzej Bialecki wrote: > Shashi Kant wrote: > >> Is there an "elegant" approach to partitioning a large Lucene index (~1TB) >> into smaller sub-indexes other t

Index Partitioning

2009-03-21 Thread Shashi Kant
Is there an "elegant" approach to partitioning a large Lucene index (~1TB) into smaller sub-indexes other than the obvious method of re-indexing into partitions? Any ideas? Thanks, Shashi

Re: Using Lucene for user query parsing

2009-03-09 Thread Shashi Kant
The BoW approach is simple and highly effective IMO. If you want to get a bit fancy, you could also use a MultiField query in the combined index. Another brute-force approach would be to hit all 3 indexes and see which ones come back with the highest score(s). On Mon, Mar 9, 2009 at 8:43 AM, Er

Re: IndexSearcher

2009-03-08 Thread Shashi Kant
Liat, i think what Erick suggested was to use the TOKENIZED setting instead of UN_TOKENIZED. For example your code should read something like: Document doc = new Document(); doc.add(new Field(WordIndex.FIELD_WORLDS, "111 222 333", Field.Store.YES, * Field.Index.TOKENIZED*)); Unless I am missing s

Re: public apology for company spam

2009-03-05 Thread Shashi Kant
Yes, it is good to learn that Yonik, Erik et al are also human-beings. :-) Thanks for all your contributions to Lucene/Solr, this list and the OSS community in general. Best, Shashi On Thu, Mar 5, 2009 at 11:36 AM, Erick Erickson wrote: > Let's see, you guys generously contributed your time and

Re: Optimum way to find all document without particular field

2009-03-04 Thread Shashi Kant
A simple solution would be to store the string "NULL" instead of null and then query. On Wed, Mar 4, 2009 at 1:26 PM, Chris Lu wrote: > Allahbaksh, > > If you ONLY want to find all document with a particular field that is not > null, you can loop through the TermEnum and TermDocs to find all th

Re: search by word offset

2009-03-02 Thread Shashi Kant
Not sure what you are asking about, but you might want to take a look at http://lucene.apache.org/java/2_4_0/api/contrib-surround/index.html The Surround parser offers many features around the span query (which I suspect is what you are looking for) Shashi On Mon, Mar 2, 2009 at 4:57 AM, shb w

Re: Lucene indexes

2009-02-24 Thread Shashi Kant
Nada, You might want to consider writing a custom tokenizer which will allow you to generate tokens based on your needs (other than whitespace). Another option would be to look at SpanQuery or SpanNearQuery which would help with the kind of problem you are trying to solve (assuming I understand

Re: Multiple indexes vs single index

2009-02-14 Thread Shashi Kant
Take a look at Solr - it should be able to handle the scale you describe. My suggestion is not to partition indexes unless absolutely have to. - Original Message From: "spr...@gmx.eu" To: java-user@lucene.apache.org Sent: Saturday, February 14, 2009 10:27:58 AM Subject: RE: Multiple

Re: Visualization

2009-02-12 Thread Shashi Kant
Thanks Omar, I have looked at Prefuse. What has been your experience with it given it is still in beta? any "gotchas" we should look out for? regards, shashi - Original Message From: Omar Alonso To: java-user@lucene.apache.org; Shashi Kant Sent: Thursday, February 12,

Visualization

2009-02-12 Thread Shashi Kant
Hi all, Apologies for being slightly off-topic, we are looking at novel visualization approaches for rendering results from Lucene queries. I was wondering if you have any recommendations for visualization toolkits (Java) for displaying heat-maps, dendrograms, cluster maps etc. (preferably free

Re: indexing binary files?

2009-01-30 Thread Shashi Kant
ndler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message- > > From: Shashi Kant [mailto:shashi_k...@yahoo.com] > > Sent: Friday, January 30, 2009 3:32 PM > > To: java-user@lucene.apache.org > >

Re: indexing binary files?

2009-01-30 Thread Shashi Kant
Unless I am missing something, not sure I see the issue here. You can convert to Base64 purely for indexing purposes and leave the original binary as-is. - Original Message From: Paul Feuer To: Lucene User List ; Shashi Kant Sent: Friday, January 30, 2009 10:12:33 AM Subject: Re

Re: indexing binary files?

2009-01-30 Thread Shashi Kant
Hi Paul, have you tried persisting the binaries in Base64 format and then indexing them? As you are aware, Base64 is a robust representation used in email attachments for example. Thanks Shashi - Original Message From: Paul Feuer To: java-user@lucene.apache.org Sent: Thursday, Janu

Re: Search Problem

2009-01-03 Thread Shashi Kant
Amin, Are you calling Close & Optimize after every addDocument? I would suggest something like this try { while //this could be your looping through a data reader for example { indexWriter.addDocument(document); } } finally { commitAndOptimise() } HTH Shashi