Hi,
Some more update on my progress,
i have multithreaded indexing in my application, i have used thread pool
executor and used a pool size of 4 but had a very slight increase in the
performace very negligible, still it is taking around 20 minutes of time to
index around 30k files,
Some more info on what am i doing
method where indexing is done:
private void indexAllFields(IResource resource) {
IFile ifile = (IFile) resource;
File file = resource.getLocation().toFile();
Document doc = new Document();
try {
doc.add(new StringField(FIELD_FILE_PATH,
getIndexFilePath(resource), Store.YES));
doc.add(new StringField(FIELD_FILE_TYPE,
ifile.getFileExtension().toLowerCase(), Store.YES));
//indexContents(file, doc);
/**
* Calling updateDocument will make sure that only one indexed
document will be added per IFile.
* Because this method deletes any existing document with the
given Term and adds a new document.
* This Fixes Sonic00039677
*/
//iWriter.addDocument(doc);
iWriter.updateDocument(new Term(FIELD_FILE_PATH,
getIndexFilePath(resource)), doc);
iWriter.commit();
} catch (FileNotFoundException e) {
} catch (IOException e) {
}
}
//Runnable to schedule a indexing job
class IndexingJob implements Runnable{
private IResource resource;
public IndexingJob(IResource resource) {
this.resource = resource;
}
@Override
public void run() {
indexAllFields(resource);
}
}
//method to queue files to be indexed
void doJob(){
ThreadPoolExecutor executor = new ThreadPoolExecutor(4, 6, Long.MAX_VALUE,
TimeUnit.SECONDS, workQueue);
for (IResource iResource : files) {
addToIndexQueue(iResource,executor);
//updateBasedOnTimeStamp(iResource);
}
executor.shutdown();
try {
executor.awaitTermination(Long.MAX_VALUE,
TimeUnit.SECONDS);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Still with the multi threaded approach it is taking very long.
TIA,
Nischal Y
On Mon, Sep 2, 2013 at 8:07 PM, Erick Erickson <[email protected]>wrote:
> Stop. Back up. Test. <G>....
>
> The very _first_ thing I'd do is just comment out the bit that
> actually indexes the content. I'm guessing you have some
> loop like:
>
> while (more files) {
> read the file
> transform the data
> create a Lucene document
> index the document
> }
>
> Just comment out the "index the document" line and see how
> long _that_ takes. 9 times out of 10, the bottleneck is here.
> As a comparison, I can index 3-4K docs/second on my laptop.
> This is using Solr and is the Wikipedia dump so the docs
> are several K each.
>
> So, if you're going to multi-thread, you'll probably want to
> multi-thread the acquisition of the data and feed that
> through a separate thread that actually does the indexing,
> you don't want multiple IndexWriters active at once.
>
> FWIW,
> Erick
>
>
>
> On Mon, Sep 2, 2013 at 10:13 AM, nischal reddy
> <[email protected]>wrote:
>
> > Hi,
> >
> > I am thinking to make my lucene indexing multi threaded, can someone
> throw
> > some light on the best approach to be followed for achieving this.
> >
> > I will give short gist about what i am trying to do, please suggest me
> the
> > best way to tackle this.
> >
> > What am i trying to do?
> >
> > I am building an index for files (around 30000 files), and later will use
> > this index to search the contents of the files. The usual sequential
> > approach works fine but is taking humungous amount of time (around 30
> > minutes is this the expected time or am i screwing up things somewhere?).
> >
> > What am i thinking to do?
> >
> > So to improve the performance i am thinking to make my application
> > multithreaded
> >
> > Need suggestions :)
> >
> > Please suggest me best ways to do this and normally how long does lucene
> > take to index 30k files?
> >
> > Please suggest me some links of examples (or probably best practices for
> > multithreading lucene) for making my application more robust.
> >
> > TIA,
> > Nischal Y
> >
>