Re: linkedin group for lucene interest group

2008-01-09 Thread Otis Gospodnetic
Hm, propaganda! :) There is also a Lucene group on Simpy with lots of Lucene/search/IR resources - http://www.simpy.com/group/363 You'll see some familiar names from the list on the right side of the screen. Let me know if you want to join. Otis -- Sematext -- http://sematext.com/ -- Lucene

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Otis Gospodnetic
Ariel, I believe PDFBox is not the fastest thing and was built more to handle all possible PDFs than for speed (just my impression - Ben, PDFBox's author might still be on this list and might comment). Pulling data from NFS to index seems like a bad idea. I hope at least the indices are local

Re: how do I get my own TopDocHitCollector?

2008-01-09 Thread Antony Bowesman
Beard, Brian wrote: Question: The documents that I index have two id's - a unique document id and a record_id that can link multiple documents together that belong to a common record. I'd like to use something like TopDocs to return the first 1024 results that have unique record_id's, but I wil

RE: Empty lucene-similarity jars on maven mirrors

2008-01-09 Thread Steven A Rowe
Hi Sanjay, On 01/09/2008 at 3:02 PM, Sanjay Dahiya wrote: > lucene-similarity (2.1.0 and 2.2.0) jar files available on maven mirrors > don't contain any files. That's because the o.a.l.search.similar package (the sole contents of the contrib/similarity/ directory) has been empty as of the 2.1.0

Re: Design questions

2008-01-09 Thread Erick Erickson
You can do several things: Rather than index one doc per page, you could index a special token between pages. Say you index $ as the special token. So your index looks like this: last of page 1 first of page 2 last of page 2 first of page 3 and so on. Now, if you use

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Antony Bowesman
Ariel wrote: The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omni

Re: Highlighting + phrase queries

2008-01-09 Thread Mark Miller
It works exactly the same as the standard contrib Highlighter except that it tries not to highlight spurious results for a positional query. This is exact with Span queries, but more approximate for phrase queries. The approximation is pretty darn good, but let me know if you find a case that d

Design questions

2008-01-09 Thread spring
Hi, I have to index (tokenized) documents which may have very much pages, up to 10.000. I also have to know on which pages the search phrase occurs. I have to update some stored index fields for my document. The content is never changed. Thus I think I have to add one lucene document with the in

how do I get my own TopDocHitCollector?

2008-01-09 Thread Beard, Brian
Question: The documents that I index have two id's - a unique document id and a record_id that can link multiple documents together that belong to a common record. I'd like to use something like TopDocs to return the first 1024 results that have unique record_id's, but I will want to skip some o

Re: Highlighting + phrase queries

2008-01-09 Thread Marjan Celikik
Mark Miller wrote: The contrib Highlighter doesn't know and highlights them all. Check out my patch here for position sensitive highlighting: https://issues.apache.org/jira/browse/LUCENE-794 OK, before trying it out, I would like to know does the patch work for mixed queries, e.g. "a b" +c -d "

Re: Highlighting + phrase queries

2008-01-09 Thread Mark Miller
The contrib Highlighter doesn't know and highlights them all. Check out my patch here for position sensitive highlighting: https://issues.apache.org/jira/browse/LUCENE-794 Marjan Celikik wrote: Dear all, Let's assume I have a phrase query and a document which contain the phrase but also it co

Highlighting + phrase queries

2008-01-09 Thread Marjan Celikik
Dear all, Let's assume I have a phrase query and a document which contain the phrase but also it contains separate occurrences of each query term. How does the highlighter know that should only display fragments which contain phrases and not fragments which contain only the query words (not as

Empty lucene-similarity jars on maven mirrors

2008-01-09 Thread Sanjay Dahiya
Hi lucene-similarity (2.1.0 and 2.2.0) jar files available on maven mirrors don't contain any files. http://mvnrepository.com/artifact/org.apache.lucene/lucene-similarity/2.2.0Seems like a deployment config problem. -- ~sanjay

linkedin group for lucene interest group

2008-01-09 Thread John Wang
To join: http://www.linkedin.com/e/gis/49647/019FD71A8AEF -John

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Grant Ingersoll
There's also Nutch. However, 10GB isn't that big... Perhaps you can index where the docs/index lives, then just make the index available via NFS? Or, better yet, use rsync to replicate it like Solr does. -Grant On Jan 9, 2008, at 10:49 AM, Steven A Rowe wrote: Hi Ariel, On 01/09/2008 a

Re: Basic Named Entity Indexing

2008-01-09 Thread chris.b
solved it... i was using token.toString() instead of token.termText(); thanks for the help :) -- View this message in context: http://www.nabble.com/Basic-Named-Entity-Indexing-tp14291880p14715727.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --

RE: Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Steven A Rowe
Hi Ariel, On 01/09/2008 at 8:50 AM, Ariel wrote: > Dou you know others distributed architecture application that > uses lucene to index big amounts of documents ? Apache Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit high

Re: Query processing with Lucene

2008-01-09 Thread Paul Elschot
On Tuesday 08 January 2008 22:49:18 Doron Cohen wrote: > This is done by Lucene's scorers. You should however start > in http://lucene.apache.org/java/docs/scoring.html, - scorers > are described in the "Algorithm" section. "Offsets" are used > by Phrase Scorers and by Span Scorer. That is for the

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Erick Erickson
<<< would like to find out why my application has this big delay to index>>> Well, then you have to measure . Tthe first thing I'd do is pinpoint where the time was being spent. Until you have that answered, you simply cannot take any meaningful action. 1> don't do any of the indexing. No new Doc

Re: Basic Named Entity Indexing

2008-01-09 Thread chris.b
taking your example (text by John Bear, old.), the NGramAnalyzerWrapper creates the following tokens: text text by by by John John John Bear, Bear, Bear, old. I have managed to get rid of the error, but now it just doesn't add anything to the index :s I'm attaching the NGramAnalyzerWrapper and NG

Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Ariel
Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home director

Re: Bucketing (was Re: Wikia search goes live today)

2008-01-09 Thread Grant Ingersoll
Would be a nice contrib module, though... -Grant On Jan 9, 2008, at 5:30 AM, Andrzej Bialecki wrote: Otis Gospodnetic wrote: Sounds useful. I suppose this means one would have custom function for within-bucket-reordering? e.g. for a web search you might reorder based on the URL length if you

Re: Bucketing (was Re: Wikia search goes live today)

2008-01-09 Thread Andrzej Bialecki
Otis Gospodnetic wrote: Sounds useful. I suppose this means one would have custom function for within-bucket-reordering? e.g. for a web search you might reorder based on the URL length if you think shorter URLs are an indicator of Yes, that's precisely the idea. It combines the advantages of