[ https://issues.apache.org/jira/browse/TIKA-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Guo updated TIKA-3816: ---------------------------- Attachment: output.PNG > Tika cannot parse the text in the table(Microsoft word) > ------------------------------------------------------- > > Key: TIKA-3816 > URL: https://issues.apache.org/jira/browse/TIKA-3816 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.3.0 > Environment: OS : Windows 10, > Software Platform : Java > Reporter: Jason Guo > Priority: Major > Fix For: 2.4.2 > > Attachments: output.PNG, test1.docx > > > I am trying to parse a microsoft word document (.doc) which contains a table > that contains a select component and a text. > the code I am using for parsing the doc is below > public static byte[] convertToByteArray(byte[] bytes) throws Exception { > Tika tika = new Tika(); > if(bytes.length > tika.getMaxStringLength()) { > tika.setMaxStringLength(bytes.length); > } > String result = tika.parseToString(new ByteArrayInputStream(bytes)); > byte[] rv = result.getBytes(); > return rv; > } > the dependencies I am using are > compile ('org.apache.tika:tika-parsers-standard-package:2.3.0'){ > exclude group: 'org.apache.poi', module : 'poi-scratchpad' > exclude group: 'org.apache.poi', module : 'poi' > // exclude group: 'com.drewnoakes', module : 'metadata-extractor' > } > compile 'org.apache.tika:tika-core:2.3.0' > compile 'org.apache.poi:poi-scratchpad:5.2.1' > compile 'org.apache.poi:poi:5.2.1' -- This message was sent by Atlassian Jira (v8.20.10#820010)