SAN is Storage Area Network. FC is fiber channel. I can confirm by one customer experience that using SAN does scale pretty well, and pretty simple. Well, it costs some money.
-- Chris Lu ------------------------- Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding! On Jan 10, 2008 3:26 PM, Ariel <[EMAIL PROTECTED]> wrote: > Thanks for yours suggestions. > > I'm sorry I didn't know but I would want to know what Do you mean with "SAN" > and "FC"? > > Another thing, I have visited the lucene home page and there is not released > the 2.3 version, could you tell me where is the download link ? > > Thanks in advance. > Ariel > > On Jan 10, 2008 2:59 PM, Otis Gospodnetic <[EMAIL PROTECTED]> > > wrote: > > > Ariel, > > > > Comments inline. > > > > > > ----- Original Message ---- > > From: Ariel <[EMAIL PROTECTED]> > > To: java-user@lucene.apache.org > > Sent: Thursday, January 10, 2008 10:05:28 AM > > Subject: Re: Why is lucene so slow indexing in nfs file system ? > > > > In a distributed enviroment the application should make an exhaustive > > use of > > the network and there is not another way to access to the documents in > > a > > remote repository but accessing in nfs file system. > > > > OG: What about SAN connected over FC for example? > > > > One thing I must clarify: I index the documents in memory, I use > > RAMDirectory to do that, then when the RAMDirectory reach the limit(I > > have > > put about 10 Mb) then I serialize to disk(nfs) the index to merge it > > with > > the central index(the central index is in nfs file system), is that > > correct? > > > > OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it will > > do in-memory thing for you. Make good use of your RAM and use 2.3 which > > gives you more control over RAM use during indexing. Parallelizing indexing > > over multiple machines and merging at the end is faster, so that's a good > > approach. Also, if your boxes have multiple CPUs write your code so that it > > has multiple worker threads that do indexing and feed docs to > > IndexWriter.addDocument(Document) to keep the CPUs fully utilized. > > > > OG: Oh, something faster than PDFBox? There is (can't remember the name > > now... itextstream or something like that?), though it may not be free like > > PDFBox. > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > On Jan 10, 2008 8:45 AM, Ariel <[EMAIL PROTECTED]> wrote: > > > > > Thanks all you for yours answers, I going to change a few things in > > my > > > application and make tests. > > > One thing I haven't find another good pdfToText converter like pdfBox > > Do > > > you know any other faster ? > > > Greetings > > > Thanks for yours answers > > > Ariel > > > > > > > > > On Jan 9, 2008 11:08 PM, Otis Gospodnetic > > <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Ariel, > > > > > > > > I believe PDFBox is not the fastest thing and was built more to > > handle > > > > all possible PDFs than for speed (just my impression - Ben, > > PDFBox's author > > > > might still be on this list and might comment). Pulling data from > > NFS to > > > > index seems like a bad idea. I hope at least the indices are local > > and not > > > > on a remote NFS... > > > > > > > > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which > > one) > > > > and indexing overNFS was slooooooow. > > > > > > > > Otis > > > > > > > > -- > > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > ----- Original Message ---- > > > > From: Ariel <[EMAIL PROTECTED]> > > > > To: java-user@lucene.apache.org > > > > Sent: Wednesday, January 9, 2008 2:50:41 PM > > > > Subject: Why is lucene so slow indexing in nfs file system ? > > > > > > > > Hi: > > > > I have seen the post in > > > > > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html > > > > and > > > > I am implementing a similar application in a distributed > > enviroment, a > > > > cluster of nodes only 5 nodes. The operating system I use is > > > > Linux(Centos) > > > > so I am using nfs file system too to access the home directory > > where > > > > the > > > > documents to be indexed reside and I would like to know how much > > time > > > > an > > > > application spends to index a big amount of documents like 10 Gb ? > > > > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb > > in > > > > every > > > > nodes, LAN: 1Gbits/s. > > > > > > > > The problem I have is that my application spends a lot of time to > > index > > > > all > > > > the documents, the delay to index 10 gb of pdf documents is about 2 > > > > days (to > > > > convert pdf to text I am using pdfbox) that is of course a lot of > > time, > > > > others applications based in lucene, for instance ibm omnifind only > > > > takes 5 > > > > hours to index the same amount of pdfs documents. I would like to > > find > > > > out > > > > why my application has this big delay to index, any help is > > welcome. > > > > Dou you know others distributed architecture application that uses > > > > lucene to > > > > index big amounts of documents ? How long time it takes to index ? > > > > I hope yo can help me > > > > Greetings > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]