Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

Chanh Le Wed, 07 Jun 2017 23:36:07 -0700

I did add mode -> DROPMALFORMED but it still couldn't ignore it because the
error raise from the CSV library that Spark are using.



On Thu, Jun 8, 2017 at 12:11 PM Jörn Franke <jornfra...@gmail.com> wrote:

> The CSV data source allows you to skip invalid lines - this should also
> include lines that have more than maxColumns. Choose mode "DROPMALFORMED"
>
> On 8. Jun 2017, at 03:04, Chanh Le <giaosu...@gmail.com> wrote:
>
> Hi Takeshi, Jörn Franke,
>
> The problem is even I increase the maxColumns it still have some lines
> have larger columns than the one I set and it will cost a lot of memory.
> So I just wanna skip the line has larger columns than the maxColumns I set.
>
> Regards,
> Chanh
>
>
> On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro <linguin....@gmail.com>
> wrote:
>
>> Is it not enough to set `maxColumns` in CSV options?
>>
>>
>> https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116
>>
>> // maropu
>>
>> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>>
>>> Spark CSV data source should be able
>>>
>>> On 7. Jun 2017, at 17:50, Chanh Le <giaosu...@gmail.com> wrote:
>>>
>>> Hi everyone,
>>> I am using Spark 2.1.1 to read csv files and convert to avro files.
>>> One problem that I am facing is if one row of csv file has more columns
>>> than maxColumns (default is 20480). The process of parsing was stop.
>>>
>>> Internal state when error was thrown: line=1, column=3, record=0,
>>> charIndex=12
>>> com.univocity.parsers.common.TextParsingException:
>>> java.lang.ArrayIndexOutOfBoundsException - 2
>>> Hint: Number of columns processed may have exceeded limit of 2 columns.
>>> Use settings.setMaxColumns(int) to define the maximum number of columns
>>> your input can have
>>> Ensure your configuration is correct, with delimiters, quotes and escape
>>> sequences that match the input format you are trying to parse
>>> Parser Configuration: CsvParserSettings:
>>>
>>>
>>> I did some investigation in univocity
>>> <https://github.com/uniVocity/univocity-parsers> library but the way it
>>> handle is throw error that why spark stop the process.
>>>
>>> How to skip the invalid row and just continue to parse next valid one?
>>> Any libs can replace univocity in that job?
>>>
>>> Thanks & regards,
>>> Chanh
>>> --
>>> Regards,
>>> Chanh
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
> --
> Regards,
> Chanh
>
> --
Regards,
Chanh

Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

Reply via email to