Re: Bucketing (was Re: Wikia search goes live today)

2008-01-09 Thread Andrzej Bialecki

Otis Gospodnetic wrote:

Sounds useful.  I suppose this means one would have custom function
for within-bucket-reordering? e.g. for a web search you might reorder
based on the URL length if you think shorter URLs are an indicator of



Yes, that's precisely the idea. It combines the advantages of simple 
(hence fast) scoring inside the IR system, with a complex (hence slow) 
reordering of a small sample of results, performed outside the IR system 
prior to delivering the results.




higher quality.  It also sounds like something that can easily sit
outside Luceneor do you have something else in mind, such as a
mechanism to pass a reordering function in Lucene?


It should definitely be something outside Lucene - it's meant for cases 
that require more complex ranking (or faster) than those available 
through function query. I only mentioned this here because it is simple 
to implement, yet produces useful results difficult to obtain through 
the usual means (similarity, boosting, even function query).



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Ariel
Hi:
I have seen the post in
http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and
I am implementing a similar application in a distributed enviroment, a
cluster of nodes only 5 nodes. The operating system I use is Linux(Centos)
so I am using nfs file system too to access the home directory where the
documents to be indexed reside and I would like to know how much time an
application spends to index a big amount of documents like 10 Gb ?
I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every
nodes, LAN: 1Gbits/s.

The problem I have is that my application spends a lot of time to index all
the documents, the delay to index 10 gb of pdf documents is about 2 days (to
convert pdf to text I am using pdfbox) that is of course a lot of time,
others applications based in lucene, for instance ibm omnifind only takes 5
hours to index the same amount of pdfs documents. I would like to find out
why my application has this big delay to index, any help is welcome.
Dou you know others distributed architecture application that uses lucene to
index big amounts of documents ? How long time it takes to index ?
I hope yo can help me
Greetings


Re: Bucketing (was Re: Wikia search goes live today)

2008-01-09 Thread Grant Ingersoll

Would be a nice contrib module, though...

-Grant

On Jan 9, 2008, at 5:30 AM, Andrzej Bialecki wrote:


Otis Gospodnetic wrote:

Sounds useful.  I suppose this means one would have custom function
for within-bucket-reordering? e.g. for a web search you might reorder
based on the URL length if you think shorter URLs are an indicator of



Yes, that's precisely the idea. It combines the advantages of simple  
(hence fast) scoring inside the IR system, with a complex (hence  
slow) reordering of a small sample of results, performed outside the  
IR system prior to delivering the results.




higher quality.  It also sounds like something that can easily sit
outside Luceneor do you have something else in mind, such as a
mechanism to pass a reordering function in Lucene?


It should definitely be something outside Lucene - it's meant for  
cases that require more complex ranking (or faster) than those  
available through function query. I only mentioned this here because  
it is simple to implement, yet produces useful results difficult to  
obtain through the usual means (similarity, boosting, even function  
query).



--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Basic Named Entity Indexing

2008-01-09 Thread chris.b

taking your example (text by John Bear, old.), the NGramAnalyzerWrapper
creates the following tokens:
text
text by
by
by John
John
John Bear,
Bear,
Bear, old.

I have managed to get rid of the error, but now it just doesn't add anything
to the index :s
I'm attaching the NGramAnalyzerWrapper and NGramFilter which I am referring
to, as well as my own NamedEntityAnalyzer/TokenFilter, which may help you
understand better.
http://www.nabble.com/file/p14712313/rem.rar rem.rar 

-- 
View this message in context: 
http://www.nabble.com/Basic-Named-Entity-Indexing-tp14291880p14712313.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query processing with Lucene

2008-01-09 Thread Paul Elschot
On Tuesday 08 January 2008 22:49:18 Doron Cohen wrote:
 This is done by Lucene's scorers. You should however start
 in http://lucene.apache.org/java/docs/scoring.html, - scorers
 are described in the Algorithm section. Offsets are used
 by Phrase Scorers and by Span Scorer.

That is for the case that offsets were meant to be positions
within a document.

It is also possible that offsets were meant in the sense of using
skipTo(doc) instead of next() on a Scorer. This is done during
query search when at least one term is required.

Regards,
Paul Elschot


 
 Doron
 
 On Jan 8, 2008 11:24 PM, Marjan Celikik  [EMAIL PROTECTED] wrote:
 
  Doron Cohen wrote:
   Hi Marjan,
  
   Lucene process the query in what can be called
   one-doc-at-a-time.
  
   For the example query - x y - (not the phrase query x y) - all
   documents containing either x or y are considered a match.
  
   When processing the query - x y - the posting lists of these two
   index terms are traversed, and for each document met on the way,
   a score is computed (taking into account both terms), and collected.
   At the end of the traversal, usually best N collected docs are returned
  as
   search result. So, this is an exhaustive computation creating a union of
   the two posting. For the query - +x +y - in intersection rather than
   union is required, and the way Lucene does it is again to traverse
   the two posting lists, just that only documents seen in both lists
   are scored and collected. This allows to optimize the search,
   skipping large chunks of the posting lists, especially when
   one term is rarer than the other.
  
  Thank you for your answer.
 
  I am having trouble finding the function which traverses the documents
  such that they get scored. Can you
  please tell me where the posting lists (for a +x +y query) get
  intersected after they get read (by next() I guess)
  from the index?
 
  In particular, I am interested in how does Lucene get the new positions
  (offsets) of the documents seen
  in both posting lists, i.e. positions (in a document) for the query word
  x, and positions for the query word y.
 
  Thank you in advance!
 
  Marjan.
 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Erick Erickson
 would like to find out why my application has this big
delay to index

Well, then you have to measure G. Tthe first thing I'd do
is pinpoint where the time was being spent. Until you have
that answered, you simply cannot take any meaningful action.

1 don't do any of the indexing. No new Documents, don't
add any fields, etc. This will just time the PDF parsing.
(I'd run this for set number of documents rather than the
whole 10G). This'll tell you whether the issue is indexing or
PDFBox.

2 Perhaps try the above with local files rather than files
on the nfs mount.

3 Put back some of the indexing and measure each
step. For instance, create the new documents but don't
add them to the index.

4Then go ahead and add them to the index.

The numbers you get for these measurements will tell
you a lot. At that point, perhaps folks will have more useful
suggestions.

The reason I'm being so unhelpful is that without lots more
detail, there's really nothing we can help with since there
are so many variables that it's just impossible to say
which one is the problem. For instance, is it a single
10G document and you're swapping like crazy? Are you
CPU bound or IO bound? Have you tried profiling your
process at all to find the choke points?

Best
Erick


On Jan 9, 2008 8:50 AM, Ariel [EMAIL PROTECTED] wrote:

 Hi:
 I have seen the post in
 http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.htmland
 I am implementing a similar application in a distributed enviroment, a
 cluster of nodes only 5 nodes. The operating system I use is Linux(Centos)
 so I am using nfs file system too to access the home directory where the
 documents to be indexed reside and I would like to know how much time an
 application spends to index a big amount of documents like 10 Gb ?
 I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
 every
 nodes, LAN: 1Gbits/s.

 The problem I have is that my application spends a lot of time to index
 all
 the documents, the delay to index 10 gb of pdf documents is about 2 days
 (to
 convert pdf to text I am using pdfbox) that is of course a lot of time,
 others applications based in lucene, for instance ibm omnifind only takes
 5
 hours to index the same amount of pdfs documents. I would like to find out
 why my application has this big delay to index, any help is welcome.
 Dou you know others distributed architecture application that uses lucene
 to
 index big amounts of documents ? How long time it takes to index ?
 I hope yo can help me
 Greetings



RE: Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Steven A Rowe
Hi Ariel,

On 01/09/2008 at 8:50 AM, Ariel wrote:
 Dou you know others distributed architecture application that
 uses lucene to index big amounts of documents ?

Apache Solr is an open source enterprise search server based on the Lucene Java 
search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, 
caching, replication, and a web administration interface. It runs in a Java 
servlet container such as Tomcat.

http://lucene.apache.org/solr/

Steve


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Basic Named Entity Indexing

2008-01-09 Thread chris.b

solved it... i was using token.toString() instead of token.termText();
thanks for the help :)
-- 
View this message in context: 
http://www.nabble.com/Basic-Named-Entity-Indexing-tp14291880p14715727.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Grant Ingersoll
There's also Nutch.  However, 10GB isn't that big...  Perhaps you can  
index where the docs/index lives, then just make the index available  
via NFS?  Or, better yet, use rsync to replicate it like Solr does.


-Grant

On Jan 9, 2008, at 10:49 AM, Steven A Rowe wrote:


Hi Ariel,

On 01/09/2008 at 8:50 AM, Ariel wrote:

Dou you know others distributed architecture application that
uses lucene to index big amounts of documents ?


Apache Solr is an open source enterprise search server based on the  
Lucene Java search library, with XML/HTTP and JSON APIs, hit  
highlighting, faceted search, caching, replication, and a web  
administration interface. It runs in a Java servlet container such  
as Tomcat.


http://lucene.apache.org/solr/

Steve


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Empty lucene-similarity jars on maven mirrors

2008-01-09 Thread Sanjay Dahiya
Hi

lucene-similarity (2.1.0 and 2.2.0) jar files available on maven mirrors
don't contain any files.
http://mvnrepository.com/artifact/org.apache.lucene/lucene-similarity/2.2.0Seems
like a deployment config problem.

-- 
~sanjay


Highlighting + phrase queries

2008-01-09 Thread Marjan Celikik

Dear all,

Let's assume I have a phrase query and a document which contain the 
phrase but also it contains separate
occurrences of each query term. How does the highlighter know that 
should only display fragments which
contain phrases and not fragments which contain only the query words 
(not as a phrase)?


Thank you in advance!

Marjan.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighting + phrase queries

2008-01-09 Thread Mark Miller

The contrib Highlighter doesn't know and highlights them all.

Check out my patch here for position sensitive highlighting:
https://issues.apache.org/jira/browse/LUCENE-794

Marjan Celikik wrote:

Dear all,

Let's assume I have a phrase query and a document which contain the 
phrase but also it contains separate
occurrences of each query term. How does the highlighter know that 
should only display fragments which
contain phrases and not fragments which contain only the query words 
(not as a phrase)?


Thank you in advance!

Marjan.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighting + phrase queries

2008-01-09 Thread Marjan Celikik

Mark Miller wrote:

The contrib Highlighter doesn't know and highlights them all.

Check out my patch here for position sensitive highlighting:
https://issues.apache.org/jira/browse/LUCENE-794
OK, before trying it out, I would like to know does the patch work for 
mixed queries, e.g. a b +c -d f g ?


Thanks!

Marjan.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



how do I get my own TopDocHitCollector?

2008-01-09 Thread Beard, Brian

Question:

The documents that I index have two id's - a unique document id and a
record_id that can link multiple documents together that belong to a
common record.

I'd like to use something like TopDocs to return the first 1024 results
that have unique record_id's, but I will want to skip some of the
returned documents that have the same record_id. We're using the
ParallelMultiSearcher. 

I read that I could use a HitCollector and throw an exception to get it
to stop, but is there a cleaner way?




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Design questions

2008-01-09 Thread spring
Hi,

I have to index (tokenized) documents which may have very much pages, up to 
10.000.
I also have to know on which pages the search phrase occurs.
I have to update some stored index fields for my document.
The content is never changed.

Thus I think I have to add one lucene document with the index fields and one 
lucene document per page.

Mapping
===

MyDocument
-ID
-Field 1-N
-Page 1-N


Lucene
-Lucene Document with ID, page number 0 and Field1 - N (stored fields)
-Lucene Document 1 with ID, page number 1 and tokenized content of Page 1
...
-Lucene Document N with ID, page number N and tokenized content of Page N

Delete of MyDocument - IndexWriter#deleteDocuments(Term:ID=foo)

Update of stored index fields - IndexWriter#updateDocument(Term: ID=foo, page 
number = 0)

Search with index and content.

Step 1: Search on stored index fields - List of IDs
Step 2: Search on ID field (list from above OR'ed together) and content - List 
of IDs and page numbers

Does this work?

What drawbacks has this approch?
Is there another way to achieve what I want?

Thank you.

P.S.

There are millions of documents with a page range from 1 to 10.000.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Antony Bowesman

Ariel wrote:


The problem I have is that my application spends a lot of time to index all
the documents, the delay to index 10 gb of pdf documents is about 2 days (to
convert pdf to text I am using pdfbox) that is of course a lot of time,
others applications based in lucene, for instance ibm omnifind only takes 5
hours to index the same amount of pdfs documents. I would like to find out


If you are using log4j, make sure you have the pdfbox log4j categories set to 
info or higher, otherwise this really slows it down (factor of 10) or make sure 
you are using the non log4j version.  See 
http://sourceforge.net/forum/message.php?msg_id=3947448


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Design questions

2008-01-09 Thread Erick Erickson
You can do several things:

Rather than index one doc per page, you could index a special
token between pages. Say you index $ as the special
token. So your index looks like this:
last of page 1  first of page 2 last of page 2  first of
page 3

and so on. Now, if you used SpanNearQuery with a slop of 0, you would never
match across pages.

Now, you can call SpanNear.getSpans() to get the offsets of all your
matches.
You can then correlate these to pages by using TermPositions (?) or similar
interface and determine what pages you matched on.

This is not as expensive as it sounds, since you're not reading the
document,
just the indexes.

This is a possibility, I'd think that it would be easier to keep track of if
there's
a 1-to-1 correspondence between your documents in the two indexes.

As an aside, note that you don't *require* two separate indexes. There's no
requirement that all documents in an index have the same fields. So you
could
index your meta-data with an ID of, say, meta_doc_id and your page text
with text_id where these are your unique (external to Lucene) IDs. Then
you could delete with a term delete on meta_doc_id

So a meta-doc looks something like:
meta_doc_id:453
field1:
field2:
field3:

and the text doc (the one and only) would be
text_id:543
text: (all 10,000 pages with page delimiters, maybe (see below)).

You could even store all of the page offsets in your meta-data document
in a special field if you wanted, then lazy-load that field rather than
dynamically counting. You'd have to be careful that your offsets
corresponded to the data *after* it was analyzed, but that shouldn't
be too hard. You'd have to read this field before deleting the doc
and make sure it was stored with the replacement.

One caution: Lucene by default only stores the first 10,000 tokens for
a field in a document. So be sure to bump this limit with
IndexWrite.setMaxFieldLength

If you stored all the offsets of page breaks, you wouldn't have to store the
special
token since you'd have no reason to have to count them later. Be aware that
you'd get a match for a phrase that spanned the last word of one page and
the first word of the next. Which may be good, but you'll have to decide
that. You
should be able to do this pretty easily with a custom Analyzer.

One more point: I once determined that the following two actions are
identical:
1 create one really big string with all the page data concatenated together
and then add it to a document

and

2 just add successive fragments to the same document. That is,

Document doc;
doc.add(new Field(text, all the text in all the pages


is just like
Document doc;
while (more pages) {
doc.add(new Field(text, text for this page
}

I like this variant better
And, since I'm getting random ideas anyway, here's another.
The PositionIncrementGap is the distance (measured in
terms) between two tokens. Let's claim that you have no
page with more than 10,000 (or whatever) tokens. Just
bump the positionincrementgap to the next 10,000 at the
start of each page. So, the first term on the first page
has an offset of 0. the first term on the second page
has an offset of 10,000. The first term on the third
page has an offset of 20,000.

Now, with the SpanNearQuery trick from above, your
term position modulo 10,000 is also your page. This would
also NOT match across pages. H, I kind of like that
idea.


I guess my last question is How often will a document change?
The added complexity of keeping two documents per unique ID
may be unnecessary if your documents don't change all that often.

Anyway, all FWIW
Best
Erick

On Jan 9, 2008 4:39 PM, [EMAIL PROTECTED] wrote:

 Hi,

 I have to index (tokenized) documents which may have very much pages, up
 to 10.000.
 I also have to know on which pages the search phrase occurs.
 I have to update some stored index fields for my document.
 The content is never changed.

 Thus I think I have to add one lucene document with the index fields and
 one lucene document per page.

 Mapping
 ===

 MyDocument
 -ID
 -Field 1-N
 -Page 1-N


 Lucene
 -Lucene Document with ID, page number 0 and Field1 - N (stored fields)
 -Lucene Document 1 with ID, page number 1 and tokenized content of Page 1
 ...
 -Lucene Document N with ID, page number N and tokenized content of Page N

 Delete of MyDocument - IndexWriter#deleteDocuments(Term:ID=foo)

 Update of stored index fields - IndexWriter#updateDocument(Term: ID=foo,
 page number = 0)

 Search with index and content.

 Step 1: Search on stored index fields - List of IDs
 Step 2: Search on ID field (list from above OR'ed together) and content -
 List of IDs and page numbers

 Does this work?

 What drawbacks has this approch?
 Is there another way to achieve what I want?

 Thank you.

 P.S.

 There are millions of documents with a page range from 1 to 10.000.

 -
 To unsubscribe, 

RE: Empty lucene-similarity jars on maven mirrors

2008-01-09 Thread Steven A Rowe
Hi Sanjay,

On 01/09/2008 at 3:02 PM, Sanjay Dahiya wrote:
 lucene-similarity (2.1.0 and 2.2.0) jar files available on maven mirrors
 don't contain any files.

That's because the o.a.l.search.similar package (the sole contents of the 
contrib/similarity/ directory) has been empty as of the 2.1.0 release.  The 
idea behind its continued existence in this state is for it to be a home for 
future contributions of custom similarity implementations.

From Lucene's changelog (trunk/CHANGES.txt):

   --
   === Release 2.1.0 2007-02-14 ===
   [...]
   Bug fixes
   [...]
   24. LUCENE-728: Removed duplicate/old MoreLikeThis and SimilarityQueries
   classes from contrib/similarity, as their new home is under
   contrib/queries.
   (Otis Gospodnetic)
   --

Here's the issue that licensed/documented this change:

http://issues.apache.org/jira/browse/LUCENE-728

Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how do I get my own TopDocHitCollector?

2008-01-09 Thread Antony Bowesman

Beard, Brian wrote:

Question:

The documents that I index have two id's - a unique document id and a
record_id that can link multiple documents together that belong to a
common record.

I'd like to use something like TopDocs to return the first 1024 results
that have unique record_id's, but I will want to skip some of the
returned documents that have the same record_id. We're using the
ParallelMultiSearcher. 


I read that I could use a HitCollector and throw an exception to get it
to stop, but is there a cleaner way?


I'm doing a similar thing.  I have external Ids (equivalent to yout record_id), 
which have one or more Lucene Documents associated with them.  I wrote a custom 
HitCollector that uses a Map to hold the so far collected external ids along 
with the collected document.


I had to write my own priority queue to know when an object was dropped of the 
bottom of the score sorted queue, but the latest PriorityQueue on the trunk now 
has insertWithOverflow(), which does the same thing.


Note that ResultDoc extends ScoreDoc, so that the external Id of the item 
dropped off the queue can be used to remove it from my Map.


Code snippet is somewhat as below (I am caching my external Ids, hence the cache 
usage)


   protected MapOfficeId, ScoreDoc results;

   public void collect(int doc, float score)
{
if (score  0.0f)
{
totalHits++;
if (pq.size()  numHits || score  minScore)
{
OfficeId id = cache.get(doc);
ResultDoc rd = results.get(id);
//  No current result for this ID yet found
if (rd == null)
{
rd = new ResultDoc(id, doc, score);
ResultDoc added = pq.insert(rd);
if (added == null)
{
//  Nothing dropped of the bottom
results.put(id, rd);
}
else
{
//  Return value dropped of the bottom
results.remove(added.id);
results.put(id, rd);
remaining++;
}
}
//  Already found this ID, so replace high score if necessary
else
{
if (score  rd.score)
{
pq.remove(rd);
rd.score = score;
pq.insert(rd);
}
}
//  Readjust our minimum score again from the top entry
minScore = pq.peek().score;
}
else
remaining++;
}
}

HTH
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Otis Gospodnetic
Ariel,

I believe PDFBox is not the fastest thing and was built more to handle all 
possible PDFs than for speed (just my impression - Ben, PDFBox's author might 
still be on this list and might comment).  Pulling data from NFS to index seems 
like a bad idea.  I hope at least the indices are local and not on a remote 
NFS...

We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) and 
indexing overNFS was slooow.

Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Ariel [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Wednesday, January 9, 2008 2:50:41 PM
Subject: Why is lucene so slow indexing in nfs file system ?

Hi:
I have seen the post in
http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
 and
I am implementing a similar application in a distributed enviroment, a
cluster of nodes only 5 nodes. The operating system I use is
 Linux(Centos)
so I am using nfs file system too to access the home directory where
 the
documents to be indexed reside and I would like to know how much time
 an
application spends to index a big amount of documents like 10 Gb ?
I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
 every
nodes, LAN: 1Gbits/s.

The problem I have is that my application spends a lot of time to index
 all
the documents, the delay to index 10 gb of pdf documents is about 2
 days (to
convert pdf to text I am using pdfbox) that is of course a lot of time,
others applications based in lucene, for instance ibm omnifind only
 takes 5
hours to index the same amount of pdfs documents. I would like to find
 out
why my application has this big delay to index, any help is welcome.
Dou you know others distributed architecture application that uses
 lucene to
index big amounts of documents ? How long time it takes to index ?
I hope yo can help me
Greetings




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: linkedin group for lucene interest group

2008-01-09 Thread Otis Gospodnetic
Hm, propaganda! :)
There is also a Lucene group on Simpy with lots of Lucene/search/IR resources - 
http://www.simpy.com/group/363

You'll see some familiar names from the list on the right side of the screen.  
Let me know if you want to join.

Otis 

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: John Wang [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Wednesday, January 9, 2008 6:23:30 PM
Subject: linkedin group for lucene interest group

To join:
http://www.linkedin.com/e/gis/49647/019FD71A8AEF

-John




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]