----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://git.reviewboard.kde.org/r/113217/#review41690 -----------------------------------------------------------
Ship it! Writing a proper parser for the binary formats is quite hard. I think this approach makes sense for now. Btw, I don't see any cmake changes in the patch. I think you might have just forgotten to add them. Please ship this to master, and thanks for taking care of this. services/fileindexer/indexer/officeextractor.cpp <http://git.reviewboard.kde.org/r/113217/#comment30455> It was really simple code. Attribution wasn't really required :) services/fileindexer/indexer/officeextractor.cpp <http://git.reviewboard.kde.org/r/113217/#comment30456> Maybe put this in a QScopedPointer so that is deleted when it goes out of scope? Otherwise we seem to have a minor memory leak. - Vishesh Handa On Oct. 12, 2013, 1:43 p.m., Denis Steckelmacher wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > http://git.reviewboard.kde.org/r/113217/ > ----------------------------------------------------------- > > (Updated Oct. 12, 2013, 1:43 p.m.) > > > Review request for Nepomuk. > > > Repository: nepomuk-core > > > Description > ------- > > This patch adds a File Extractor for doc, xls and ppt files (the binary MS > Office formats). The current version of the extractor is very simple and only > indexes the plain text content of the files (no title nor owner information > is extracted). The extractor is a tiny wrapper around the "catdoc", "catppt" > and "xls2csv" command-line utilities. These tools are packaged in the > "catdoc" package of Debian and openSUSE. > > These utilities are released under the GNU GPLv2. If I recall correctly, the > LGPLv2.1 Nepomuk libraries can use these tools provided no library calls are > made to them. The extractor uses QProcess to launch an instance of catdoc, > catppt or xls2csv, giving it the name of the file to index, and gets the > plain text from the standard output of this process. I hope this complies > with the GPL. > > The commands are located at run-time using KStandardDirs. This way, no new > build dependency is added to Nepomuk, and it is up to the user or the > distribution to add "catdoc" to the dependency list of Nepomuk. If a command > is not found, the indexer is disabled for the specific MIME type handled by > the command. > > > Diffs > ----- > > services/fileindexer/indexer/officeextractor.cpp PRE-CREATION > services/fileindexer/indexer/officeextractor.h PRE-CREATION > services/fileindexer/indexer/nepomukofficeextractor.desktop PRE-CREATION > > Diff: http://git.reviewboard.kde.org/r/113217/diff/ > > > Testing > ------- > > I have run the indexer on several DOC, XLS and PPT files I have on my > computer. The indexer doesn't work on encrypted files (catdoc refuses to > parse them). This is embarrassing because some interesting Excel files are > password-protected only on select pages, or only the edition of certain cells > is prohibited. The rest of the file can contain valuable data and should be > indexed. > > > Thanks, > > Denis Steckelmacher > >
_______________________________________________ Nepomuk mailing list [email protected] https://mail.kde.org/mailman/listinfo/nepomuk
