[ 
https://issues.apache.org/jira/browse/TIKA-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176206#comment-17176206
 ] 

Peter Lee commented on TIKA-3155:
---------------------------------

According to my understanding , here is how Tika handle csv file :
1. Try to parse with commons-csv first.
2. Parse the rest data in InputStream as plain text if encounter 
IllegalStateException.
 
Unfortunately, in this case , commons-csv has consumed all data in InputStream 
before it throws IllegalStateException , so there is nothing left in 
InputStream and we can't parse .
 
If we don't try to parse with commons-csv first then we don't know is it gonna 
to encounter IllegalStateException.
But if we try and encounter IllegalStateException, there is nothing we can do 
because all data has been consumed.
 
Maybe we can read all data form InputStream to a byte array, then we can try 
many way in many times. 
But this may cost a lot of memory size and I don't think is a smart way.
 
Maybe it's beter if we change nothing. Just let user to adjust their csv file 
when encounter IllegalStateException.

> Parse Error while extracting CSV files
> --------------------------------------
>
>                 Key: TIKA-3155
>                 URL: https://issues.apache.org/jira/browse/TIKA-3155
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.24.1
>            Reporter: Akash
>            Priority: Major
>         Attachments: UTF-8_chars.csv
>
>
> We are getting parse error while trying to extract csv files.
> This was working in version 1.9, but exception coming in 1.24.1
>  
> {code:java}
> /Exception in thread "main" org.apache.tika.exception.TikaException: 
> exception parsing the csv
>       at 
> org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:198 
> undefined)
>       at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)
>       at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)
>       at 
> org.apache.tikar.AutoDetectParser.parse.parse(AutoDetectParser.java:143 
> undefined)
>       at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209 
> undefined)
>       at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496 undefined)
>       at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149 undefined)
> Caused by: java.lang.IllegalStateException: IOException reading next record: 
> java.io.IOException: (startline 39) EOF reached before encapsulated token 
> finished
>       at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:145
>  undefined)
>       at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:155 
> undefined)
>       at 
> org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:178 
> undefined)
>       ... 6 more
> Caused by: java.io.IOException: (startline 39) EOF reached before 
> encapsulated token finished
>       at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:288 
> undefined)
>       at org.apache.commons.csv.Lexer.nextToken(Lexer.java:158 undefined)
>       at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:674 
> undefined)
>       at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:142
>  undefined)/ 
> {code}
> Issue is coming when we encounter double quotes in one of the cells.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to