On 4/19/2016 10:12 PM, Zheng Lin Edwin Yeo wrote: > Thanks for the information Shawn. > > I believe it could be due to the types of file that is being indexed. > Currently, I'm indexing the EML files which are in HTML format, and they > are more rich in content (with in line images and full text), while > previously the EML files are in Plain Text format, with the images as > attachments. > > Will this be the cause of the slow indexing speed which I'm facing now? It > is more than 3 times slower than what I had previously.
I assume that you are using the Extracting Request Handler for this. I know almost nothing about Tika, but I would imagine that extracting data from rich text documents is not a fast process, and that plain text documents would be a lot faster. I could be wrong -- I've never used the ERH myself. If you want a setup like this to go faster, you probably need to make your indexing process multi-threaded. Ideally, such an application would be written in Java and would incorporate Tika into the client-side code. Tika can be very unstable, so running it inside Solr (the Extracting Request Handler) can make Solr itself unstable. Thanks, Shawn