Hi Shawn, I'm currently running 4 threads concurrently to run the indexing, Which means I run the script in command prompt in 4 different command windows. The ID has been configured in such a way that it will not overwrite each other during the indexing. Is that considered multi-threading?
The rate are all below 0.2GB/hr for each individual threads, and overall rate is just 0.7GB/hr. Regards, Edwin On 20 April 2016 at 21:43, Shawn Heisey <apa...@elyograg.org> wrote: > On 4/19/2016 10:12 PM, Zheng Lin Edwin Yeo wrote: > > Thanks for the information Shawn. > > > > I believe it could be due to the types of file that is being indexed. > > Currently, I'm indexing the EML files which are in HTML format, and they > > are more rich in content (with in line images and full text), while > > previously the EML files are in Plain Text format, with the images as > > attachments. > > > > Will this be the cause of the slow indexing speed which I'm facing now? > It > > is more than 3 times slower than what I had previously. > > I assume that you are using the Extracting Request Handler for this. I > know almost nothing about Tika, but I would imagine that extracting data > from rich text documents is not a fast process, and that plain text > documents would be a lot faster. I could be wrong -- I've never used > the ERH myself. > > If you want a setup like this to go faster, you probably need to make > your indexing process multi-threaded. Ideally, such an application > would be written in Java and would incorporate Tika into the client-side > code. Tika can be very unstable, so running it inside Solr (the > Extracting Request Handler) can make Solr itself unstable. > > Thanks, > Shawn > >