Hm, propaganda! :)
There is also a Lucene group on Simpy with lots of Lucene/search/IR resources -
http://www.simpy.com/group/363
You'll see some familiar names from the list on the right side of the screen.
Let me know if you want to join.
Otis
--
Sematext -- http://sematext.com/ -- Lucene
Ariel,
I believe PDFBox is not the fastest thing and was built more to handle all
possible PDFs than for speed (just my impression - Ben, PDFBox's author might
still be on this list and might comment). Pulling data from NFS to index seems
like a bad idea. I hope at least the indices are local
Beard, Brian wrote:
Question:
The documents that I index have two id's - a unique document id and a
record_id that can link multiple documents together that belong to a
common record.
I'd like to use something like TopDocs to return the first 1024 results
that have unique record_id's, but I wil
Hi Sanjay,
On 01/09/2008 at 3:02 PM, Sanjay Dahiya wrote:
> lucene-similarity (2.1.0 and 2.2.0) jar files available on maven mirrors
> don't contain any files.
That's because the o.a.l.search.similar package (the sole contents of the
contrib/similarity/ directory) has been empty as of the 2.1.0
You can do several things:
Rather than index one doc per page, you could index a special
token between pages. Say you index $ as the special
token. So your index looks like this:
last of page 1 first of page 2 last of page 2 first of
page 3
and so on. Now, if you use
Ariel wrote:
The problem I have is that my application spends a lot of time to index all
the documents, the delay to index 10 gb of pdf documents is about 2 days (to
convert pdf to text I am using pdfbox) that is of course a lot of time,
others applications based in lucene, for instance ibm omni
It works exactly the same as the standard contrib Highlighter except
that it tries not to highlight spurious results for a positional query.
This is exact with Span queries, but more approximate for phrase
queries. The approximation is pretty darn good, but let me know if you
find a case that d
Hi,
I have to index (tokenized) documents which may have very much pages, up to
10.000.
I also have to know on which pages the search phrase occurs.
I have to update some stored index fields for my document.
The content is never changed.
Thus I think I have to add one lucene document with the in
Question:
The documents that I index have two id's - a unique document id and a
record_id that can link multiple documents together that belong to a
common record.
I'd like to use something like TopDocs to return the first 1024 results
that have unique record_id's, but I will want to skip some o
Mark Miller wrote:
The contrib Highlighter doesn't know and highlights them all.
Check out my patch here for position sensitive highlighting:
https://issues.apache.org/jira/browse/LUCENE-794
OK, before trying it out, I would like to know does the patch work for
mixed queries, e.g. "a b" +c -d "
The contrib Highlighter doesn't know and highlights them all.
Check out my patch here for position sensitive highlighting:
https://issues.apache.org/jira/browse/LUCENE-794
Marjan Celikik wrote:
Dear all,
Let's assume I have a phrase query and a document which contain the
phrase but also it co
Dear all,
Let's assume I have a phrase query and a document which contain the
phrase but also it contains separate
occurrences of each query term. How does the highlighter know that
should only display fragments which
contain phrases and not fragments which contain only the query words
(not as
Hi
lucene-similarity (2.1.0 and 2.2.0) jar files available on maven mirrors
don't contain any files.
http://mvnrepository.com/artifact/org.apache.lucene/lucene-similarity/2.2.0Seems
like a deployment config problem.
--
~sanjay
To join:
http://www.linkedin.com/e/gis/49647/019FD71A8AEF
-John
There's also Nutch. However, 10GB isn't that big... Perhaps you can
index where the docs/index lives, then just make the index available
via NFS? Or, better yet, use rsync to replicate it like Solr does.
-Grant
On Jan 9, 2008, at 10:49 AM, Steven A Rowe wrote:
Hi Ariel,
On 01/09/2008 a
solved it... i was using token.toString() instead of token.termText();
thanks for the help :)
--
View this message in context:
http://www.nabble.com/Basic-Named-Entity-Indexing-tp14291880p14715727.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
--
Hi Ariel,
On 01/09/2008 at 8:50 AM, Ariel wrote:
> Dou you know others distributed architecture application that
> uses lucene to index big amounts of documents ?
Apache Solr is an open source enterprise search server based on the Lucene Java
search library, with XML/HTTP and JSON APIs, hit high
On Tuesday 08 January 2008 22:49:18 Doron Cohen wrote:
> This is done by Lucene's scorers. You should however start
> in http://lucene.apache.org/java/docs/scoring.html, - scorers
> are described in the "Algorithm" section. "Offsets" are used
> by Phrase Scorers and by Span Scorer.
That is for the
<<< would like to find out why my application has this big
delay to index>>>
Well, then you have to measure . Tthe first thing I'd do
is pinpoint where the time was being spent. Until you have
that answered, you simply cannot take any meaningful action.
1> don't do any of the indexing. No new Doc
taking your example (text by John Bear, old.), the NGramAnalyzerWrapper
creates the following tokens:
text
text by
by
by John
John
John Bear,
Bear,
Bear, old.
I have managed to get rid of the error, but now it just doesn't add anything
to the index :s
I'm attaching the NGramAnalyzerWrapper and NG
Hi:
I have seen the post in
http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and
I am implementing a similar application in a distributed enviroment, a
cluster of nodes only 5 nodes. The operating system I use is Linux(Centos)
so I am using nfs file system too to access the home director
Would be a nice contrib module, though...
-Grant
On Jan 9, 2008, at 5:30 AM, Andrzej Bialecki wrote:
Otis Gospodnetic wrote:
Sounds useful. I suppose this means one would have custom function
for within-bucket-reordering? e.g. for a web search you might reorder
based on the URL length if you
Otis Gospodnetic wrote:
Sounds useful. I suppose this means one would have custom function
for within-bucket-reordering? e.g. for a web search you might reorder
based on the URL length if you think shorter URLs are an indicator of
Yes, that's precisely the idea. It combines the advantages of
23 matches
Mail list logo