Hi, My second suggestion is basically to store the user documents (word docs) directly in lucene index.
1) If you are using Lucene 1.4.3, you can do something like this: // suppose the word docs are now in byte array byte[] wordDoc = getUploadedWordDoc(); // add the byte array to lucene index Document doc = new Document(); doc.add(Field.UnIndexed("originalDoc", getBase64(wordDoc))); The getBase64 method basically transforms the bytes into ASCII text, as follows: String getBase64(byte[] wordDoc) { byte[] chars = Base64.encodeBase64(wordDoc); String encodedStr = new String(chars, "US-ASCII"); return encodedStr; } You can get the Base64.java from http://jakarta.apache.org/commons/codec/apidocs/org/apache/commons/codec/binary/Base64.html 2) Correct me if I am wrong, I think the latest Lucene dev base has the capability to direct add binary content to the Lucene index. Looking at http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/java/org/apache/lucene/document/Field.java?view=markup It has: /** * Create a stored field with binary value. Optionally the value may be compressed. * * @param name The name of the field * @param value The binary value * @param store How <code>value</code> should be stored (compressed or not.) */ public Field(String name, byte[] value, Store store) { ..... So, I guess if you use the lastest Lucene dev base, you can do: byte[] wordDoc = getUploadedWordDoc(); Document doc = new Document(); doc.add(new Field("originalDoc", wordDoc(), Store.YES); I think Lucene index is pretty good in terms of storing millions of small documents. However, there are two concerns that you might address: 1) no transaction support for the index manipulation. I am not sure what happens when the program is storing the original word document meanwhile the machine gets shut down. Will the index be corrupted? 2) Since Lucene index is basically files in a physical directory, the index file size could eventually hit a hard limit and then you have to have another way to get around it. (Split up the index into two indexes, or, you could configure Lucene for the IndexWriter.DEFAULT_MAX_MERGE_DOCS?) For example, I think some version of windoze (e.g., using FAT file system), has a file size limit of 2GB. Let me know if this helps. Cheers, Jian On 6/29/05, bib_lucene bib <[EMAIL PROTECTED]> wrote: > Thanks Jian > > I need to retrive the original document sometimes. I did not quite understand > your second suggestion. > Can you please help me understand better, a pointer to some web resource will > also help. > > jian chen <[EMAIL PROTECTED]> wrote: > Hi, > > Depending on the operating system, there might be a hard limit on the > number of files in one directory (windoze versions). Even with > operating systems that don't have a hard limit, it is still better not > to put too many files in one directory (linux). > > Typically, the file system won't be very efficient in terms of file > retrieval if there are nore than couple thousand files in one > directory. > > There are some ways to tackle this issue. > > 1) Use a hash function to distribute the files to different sub > directories based on the file name. For example, use the MD5 algorithm > in Java or CRC algorithm in java, hash the file name to a number, use > this number to construct directory. For example, if the number you > hashed is 123456, then, you can make 123 as a sub-dir name, and 456 as > the sub-sub dir name, so forth. > > I think the SQUID web proxy server uses this approach to do the file caching. > > 2) Why not use Lucene's indexing algorithm and store binary files with > lucene index?! I love the indexing algorithm, in that, you don't need > to manage the free space like that in a typical file system. Because > the merge process will take care of reclaiming the free space > automatically. > > Should these two advices be good? > > Jian > > On 6/29/05, bib_lucene bib wrote: > > Hi All > > > > In my webapp i have people uploading their documents. My server is > > windows/tomcat. I am thinking there will be a limit on the no of files in a > > directory. Typically apllication users will load 3-5 page word docs. > > > > 1. How does one design the system such that there will not be any problem > > as the users keep uploading their files, even if a million files are > > reached. > > 2. Is there a sample application that does this. > > 3. Should I have lucene update index after each upload or should I do it > > like once a day. > > > > Thanks > > Bib > > > > __________________________________________________ > > Do You Yahoo!? > > Tired of spam? Yahoo! Mail has the best spam protection around > > http://mail.yahoo.com > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]