[ 
https://issues.apache.org/jira/browse/TIKA-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175213#comment-17175213
 ] 

chenshuming commented on TIKA-3155:
-----------------------------------

Here is why difference between tika version 1.9 and 1.24.1 :
 # tika-app-1.9.jar treat the test file as text/plain , so there is no exception
 # tika-app-1.24.1.jar treat the test file as text/csv,  use commons-csv to 
handle it and commons-csv throws a exception.

 

Here is why  commons-csv throws a exception :
 # Line 39 in test file is a line start with double quotes,
 # commons-csv will try to find another double quotes when encounter the line 
start with double quotes .  If it can't find another double quotes after read 
to end of this line , it will throws exception with message: "EOF reached 
before encapsulated token finished".
 # So if you replace line 39 in test file from " to "" or \" , it will not fail.

 

Anyway, Maybe we should improve the handle in commons-csv . Any ideas ?

> Parse Error while extracting CSV files
> --------------------------------------
>
>                 Key: TIKA-3155
>                 URL: https://issues.apache.org/jira/browse/TIKA-3155
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.24.1
>            Reporter: Akash
>            Priority: Major
>         Attachments: UTF-8_chars.csv
>
>
> We are getting parse error while trying to extract csv files.
> This was working in version 1.9, but exception coming in 1.24.1
>  
> {code:java}
> /Exception in thread "main" org.apache.tika.exception.TikaException: 
> exception parsing the csv
>       at 
> org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:198 
> undefined)
>       at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)
>       at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)
>       at 
> org.apache.tikar.AutoDetectParser.parse.parse(AutoDetectParser.java:143 
> undefined)
>       at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209 
> undefined)
>       at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496 undefined)
>       at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149 undefined)
> Caused by: java.lang.IllegalStateException: IOException reading next record: 
> java.io.IOException: (startline 39) EOF reached before encapsulated token 
> finished
>       at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:145
>  undefined)
>       at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:155 
> undefined)
>       at 
> org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:178 
> undefined)
>       ... 6 more
> Caused by: java.io.IOException: (startline 39) EOF reached before 
> encapsulated token finished
>       at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:288 
> undefined)
>       at org.apache.commons.csv.Lexer.nextToken(Lexer.java:158 undefined)
>       at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:674 
> undefined)
>       at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:142
>  undefined)/ 
> {code}
> Issue is coming when we encounter double quotes in one of the cells.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to