Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

Takeshi Yamamuro Thu, 08 Jun 2017 09:05:55 -0700

I filed a jira about this issue:
https://issues.apache.org/jira/browse/SPARK-21024


On Thu, Jun 8, 2017 at 1:27 AM, Chanh Le <giaosu...@gmail.com> wrote:

> Can you recommend one?
>
> Thanks.
>
> On Thu, Jun 8, 2017 at 2:47 PM Jörn Franke <jornfra...@gmail.com> wrote:
>
>> You can change the CSV parser library
>>
>> On 8. Jun 2017, at 08:35, Chanh Le <giaosu...@gmail.com> wrote:
>>
>>
>> I did add mode -> DROPMALFORMED but it still couldn't ignore it because
>> the error raise from the CSV library that Spark are using.
>>
>>
>> On Thu, Jun 8, 2017 at 12:11 PM Jörn Franke <jornfra...@gmail.com> wrote:
>>
>>> The CSV data source allows you to skip invalid lines - this should also
>>> include lines that have more than maxColumns. Choose mode "DROPMALFORMED"
>>>
>>> On 8. Jun 2017, at 03:04, Chanh Le <giaosu...@gmail.com> wrote:
>>>
>>> Hi Takeshi, Jörn Franke,
>>>
>>> The problem is even I increase the maxColumns it still have some lines
>>> have larger columns than the one I set and it will cost a lot of memory.
>>> So I just wanna skip the line has larger columns than the maxColumns I
>>> set.
>>>
>>> Regards,
>>> Chanh
>>>
>>>
>>> On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro <linguin....@gmail.com>
>>> wrote:
>>>
>>>> Is it not enough to set `maxColumns` in CSV options?
>>>>
>>>> https://github.com/apache/spark/blob/branch-2.1/sql/
>>>> core/src/main/scala/org/apache/spark/sql/execution/
>>>> datasources/csv/CSVOptions.scala#L116
>>>>
>>>> // maropu
>>>>
>>>> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke <jornfra...@gmail.com>
>>>> wrote:
>>>>
>>>>> Spark CSV data source should be able
>>>>>
>>>>> On 7. Jun 2017, at 17:50, Chanh Le <giaosu...@gmail.com> wrote:
>>>>>
>>>>> Hi everyone,
>>>>> I am using Spark 2.1.1 to read csv files and convert to avro files.
>>>>> One problem that I am facing is if one row of csv file has more
>>>>> columns than maxColumns (default is 20480). The process of parsing
>>>>> was stop.
>>>>>
>>>>> Internal state when error was thrown: line=1, column=3, record=0,
>>>>> charIndex=12
>>>>> com.univocity.parsers.common.TextParsingException: 
>>>>> java.lang.ArrayIndexOutOfBoundsException
>>>>> - 2
>>>>> Hint: Number of columns processed may have exceeded limit of 2
>>>>> columns. Use settings.setMaxColumns(int) to define the maximum number of
>>>>> columns your input can have
>>>>> Ensure your configuration is correct, with delimiters, quotes and
>>>>> escape sequences that match the input format you are trying to parse
>>>>> Parser Configuration: CsvParserSettings:
>>>>>
>>>>>
>>>>> I did some investigation in univocity
>>>>> <https://github.com/uniVocity/univocity-parsers> library but the way
>>>>> it handle is throw error that why spark stop the process.
>>>>>
>>>>> How to skip the invalid row and just continue to parse next valid one?
>>>>> Any libs can replace univocity in that job?
>>>>>
>>>>> Thanks & regards,
>>>>> Chanh
>>>>> --
>>>>> Regards,
>>>>> Chanh
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>> --
>>> Regards,
>>> Chanh
>>>
>>> --
>> Regards,
>> Chanh
>>
>> --
> Regards,
> Chanh
>



-- 
---
Takeshi Yamamuro

Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

Reply via email to