On 1-5-2015 14:28, subhabrata.bane...@gmail.com wrote:
> Dear Group,
> 
> I have several millions of documents in several folders and subfolders in my 
> machine.
> I tried to write a script as follows, to extract all the .doc files and to 
> convert them in text, but it seems it is taking too much of time. 
> 

[snip]

> But it seems it is taking too much of time, to convert to text and to append 
> to list. Is there any way I may do it fast? I am using Python2.7 on Windows 7 
> Professional Edition. Apology for any indentation error. 
> 
> If any one may kindly suggest a solution.

Have you profiled and identified the part of your script that is slow?

On first sight though your python code, while not optimal, contains no immediate
performance issues. It is likely the COM interop call to Winword and getting 
the text
via that interface that is slow. Imagine opening word for "several million 
documents",
no wonder it doesn't perform.

Investigate tools like antiword, wv, docx2txt. I suspect they're quite a bit 
faster than
relying on Word itself.


Irmen


-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to