Don't commit after adding each and every document.
On Tue, Sep 3, 2013 at 7:20 AM, nischal reddy <nischal.srini...@gmail.com>wrote: > Hi, > > Some more update on my progress, > > i have multithreaded indexing in my application, i have used thread pool > executor and used a pool size of 4 but had a very slight increase in the > performace very negligible, still it is taking around 20 minutes of time to > index around 30k files, > > Some more info on what am i doing > > method where indexing is done: > > private void indexAllFields(IResource resource) { > IFile ifile = (IFile) resource; > File file = resource.getLocation().toFile(); > Document doc = new Document(); > try { > doc.add(new StringField(FIELD_FILE_PATH, > getIndexFilePath(resource), Store.YES)); > doc.add(new StringField(FIELD_FILE_TYPE, > ifile.getFileExtension().toLowerCase(), Store.YES)); > //indexContents(file, doc); > /** > * Calling updateDocument will make sure that only one indexed > document will be added per IFile. > * Because this method deletes any existing document with the > given Term and adds a new document. > * This Fixes Sonic00039677 > */ > //iWriter.addDocument(doc); > iWriter.updateDocument(new Term(FIELD_FILE_PATH, > getIndexFilePath(resource)), doc); > iWriter.commit(); > } catch (FileNotFoundException e) { > > } catch (IOException e) { > > } > } > > > //Runnable to schedule a indexing job > class IndexingJob implements Runnable{ > > private IResource resource; > > public IndexingJob(IResource resource) { > this.resource = resource; > } > > @Override > public void run() { > indexAllFields(resource); > } > > } > > //method to queue files to be indexed > > void doJob(){ > > ThreadPoolExecutor executor = new ThreadPoolExecutor(4, 6, Long.MAX_VALUE, > TimeUnit.SECONDS, workQueue); > for (IResource iResource : files) { > addToIndexQueue(iResource,executor); > //updateBasedOnTimeStamp(iResource); > } > executor.shutdown(); > > try { > executor.awaitTermination(Long.MAX_VALUE, > TimeUnit.SECONDS); > } catch (InterruptedException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > > } > > Still with the multi threaded approach it is taking very long. > > TIA, > Nischal Y > > > > > On Mon, Sep 2, 2013 at 8:07 PM, Erick Erickson <erickerick...@gmail.com > >wrote: > > > Stop. Back up. Test. <G>.... > > > > The very _first_ thing I'd do is just comment out the bit that > > actually indexes the content. I'm guessing you have some > > loop like: > > > > while (more files) { > > read the file > > transform the data > > create a Lucene document > > index the document > > } > > > > Just comment out the "index the document" line and see how > > long _that_ takes. 9 times out of 10, the bottleneck is here. > > As a comparison, I can index 3-4K docs/second on my laptop. > > This is using Solr and is the Wikipedia dump so the docs > > are several K each. > > > > So, if you're going to multi-thread, you'll probably want to > > multi-thread the acquisition of the data and feed that > > through a separate thread that actually does the indexing, > > you don't want multiple IndexWriters active at once. > > > > FWIW, > > Erick > > > > > > > > On Mon, Sep 2, 2013 at 10:13 AM, nischal reddy > > <nischal.srini...@gmail.com>wrote: > > > > > Hi, > > > > > > I am thinking to make my lucene indexing multi threaded, can someone > > throw > > > some light on the best approach to be followed for achieving this. > > > > > > I will give short gist about what i am trying to do, please suggest me > > the > > > best way to tackle this. > > > > > > What am i trying to do? > > > > > > I am building an index for files (around 30000 files), and later will > use > > > this index to search the contents of the files. The usual sequential > > > approach works fine but is taking humungous amount of time (around 30 > > > minutes is this the expected time or am i screwing up things > somewhere?). > > > > > > What am i thinking to do? > > > > > > So to improve the performance i am thinking to make my application > > > multithreaded > > > > > > Need suggestions :) > > > > > > Please suggest me best ways to do this and normally how long does > lucene > > > take to index 30k files? > > > > > > Please suggest me some links of examples (or probably best practices > for > > > multithreading lucene) for making my application more robust. > > > > > > TIA, > > > Nischal Y > > > > > >