[ https://issues.apache.org/jira/browse/TIKA-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved TIKA-2874. ------------------------------- Resolution: Won't Fix > Parsing of 4 mb excel file generates 164 mb worth of words > ---------------------------------------------------------- > > Key: TIKA-2874 > URL: https://issues.apache.org/jira/browse/TIKA-2874 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.20, 1.19.1 > Reporter: Sébastien Nussbaumer > Priority: Major > Attachments: excel_that_generates_huge_number_of_words.xlsx, > tika-config.xml > > > When I parse the attached 4 mb excel file, I get 164 mb worth of words. When > checking out the words I see that some cells are repeated *many hundred > thousand* of times. > I tried passing the words through the uniq linux command line utility and got > a file with a much more reasonnable size of 16 kb. > This is the code I use : > {code:java} > TikaConfig config = new TikaConfig(new > ClassPathResource("tika-config.xml").getURL()); > Detector detector = config.getDetector(); > Parser autoDetectParser = new AutoDetectParser(config); > Tika tika = new Tika(detector, autoDetectParser); > try (LanguageWriter languageWriter = new > LanguageWriter(LanguageDetector.getDefaultLanguageDetector().loadModels()); > OutputStreamWriter outputStreamWriter = new > OutputStreamWriter(output, StandardCharsets.UTF_8); > CompositeWriter compositeWriter = new > CompositeWriter(outputStreamWriter, languageWriter)) { > WriteOutContentHandler handler = new > WriteOutContentHandler(compositeWriter, indexedChars); > ParseContext context = new ParseContext(); > context.set(Parser.class, tika.getParser()); > tika.getParser().parse(input, new BodyContentHandler(handler), new > Metadata(), context); > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)