Dear Denis, Thank you for your contribution to Tika!
Filing an issue would be great, head over here: https://issues.apache.org/jira/browse/TIKA Please sign up for an account, create an issue and then attach your patch there. I for one would welcome the contribution and am happy to help shepherd it into the sources. Thank you! Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: kildishev <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Monday, July 1, 2013 5:00 AM To: "[email protected]" <[email protected]> Cc: Khoroshilov <[email protected]> Subject: Patches for parser.microsoft.WordExtractor >Dear Tika developers, > >My name is Denis Kildishev and I am working for Institute for System >Programming of the Russian Academy of Sciences (ISPRAS). We use Apache >Tika in our open source project Requality >(https://forge.ispras.ru/projects/reqdb) for doc->xhtml conversion. One >of our requirements is getting xhtml visual representation close to >original doc one. > >Working with current version of Tika we found that some improvements >can >be made over it. I'd like to introduce some modifications that were >made >on Word Extractor from parsers package. They includes support of lists, >table borders(according to 2007 specification) and some additional >changes on styling and indents. Also, in our version of this parser we >have XHTML commands buffer that helps to deal with a problem of nested >tables. If it is possible, I'd like to contribute those changes back to >the Tika project. As a first of possible patches I'd like to present >changes over table representation. > >This patch includes changes over table representation. The information >about border color is related to specification of 2007 format. Spanning >of cells is taken from poi html parser. > >Some of patches, including this one, alters the structure of generated >XHTML file. Different >changes are made over existing unit tests to deal with this fact. All >those changes preserve original original test purposes, but in >different >way. As an example may be a check of table to be on output file. As for >current >trunk version, it is checked by looking for clear "<table>" >construction. >When we introduces styling to table, this construction tends to be >wrong, >so, we can looks for "<table" instead. > >I will create a corresponding ticket and I will attach my patch there. >It is my first contribution to an Apache project, so I would appreciate >if you guide me how to proceed with it. > >Yours sincerely, >Denis Kildishev >Software Engineering Department, ISPRAS
