[ https://issues.apache.org/jira/browse/TIKA-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175286#comment-17175286 ]
Peter Lee commented on TIKA-3155: --------------------------------- Hey. I think it's caused by the Quote Mode of Apache Commons CSV. We can simply fix this by turning the Quote Mode off. > Parse Error while extracting CSV files > -------------------------------------- > > Key: TIKA-3155 > URL: https://issues.apache.org/jira/browse/TIKA-3155 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.24.1 > Reporter: Akash > Priority: Major > Attachments: UTF-8_chars.csv > > > We are getting parse error while trying to extract csv files. > This was working in version 1.9, but exception coming in 1.24.1 > > {code:java} > /Exception in thread "main" org.apache.tika.exception.TikaException: > exception parsing the csv > at > org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:198 > undefined) > at > org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 > undefined) > at > org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 > undefined) > at > org.apache.tikar.AutoDetectParser.parse.parse(AutoDetectParser.java:143 > undefined) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209 > undefined) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496 undefined) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149 undefined) > Caused by: java.lang.IllegalStateException: IOException reading next record: > java.io.IOException: (startline 39) EOF reached before encapsulated token > finished > at > org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:145 > undefined) > at > org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:155 > undefined) > at > org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:178 > undefined) > ... 6 more > Caused by: java.io.IOException: (startline 39) EOF reached before > encapsulated token finished > at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:288 > undefined) > at org.apache.commons.csv.Lexer.nextToken(Lexer.java:158 undefined) > at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:674 > undefined) > at > org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:142 > undefined)/ > {code} > Issue is coming when we encounter double quotes in one of the cells. -- This message was sent by Atlassian Jira (v8.3.4#803005)