Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

Chanh Le Fri, 09 Jun 2017 05:27:59 -0700

Hi Takeshi,

Thank you very much.


Regards,
Chanh


On Thu, Jun 8, 2017 at 11:05 PM Takeshi Yamamuro <linguin....@gmail.com>
wrote:

> I filed a jira about this issue:
> https://issues.apache.org/jira/browse/SPARK-21024
>
> On Thu, Jun 8, 2017 at 1:27 AM, Chanh Le <giaosu...@gmail.com> wrote:
>
>> Can you recommend one?
>>
>> Thanks.
>>
>> On Thu, Jun 8, 2017 at 2:47 PM Jörn Franke <jornfra...@gmail.com> wrote:
>>
>>> You can change the CSV parser library
>>>
>>> On 8. Jun 2017, at 08:35, Chanh Le <giaosu...@gmail.com> wrote:
>>>
>>>
>>> I did add mode -> DROPMALFORMED but it still couldn't ignore it because
>>> the error raise from the CSV library that Spark are using.
>>>
>>>
>>> On Thu, Jun 8, 2017 at 12:11 PM Jörn Franke <jornfra...@gmail.com>
>>> wrote:
>>>
>>>> The CSV data source allows you to skip invalid lines - this should also
>>>> include lines that have more than maxColumns. Choose mode "DROPMALFORMED"
>>>>
>>>> On 8. Jun 2017, at 03:04, Chanh Le <giaosu...@gmail.com> wrote:
>>>>
>>>> Hi Takeshi, Jörn Franke,
>>>>
>>>> The problem is even I increase the maxColumns it still have some lines
>>>> have larger columns than the one I set and it will cost a lot of memory.
>>>> So I just wanna skip the line has larger columns than the maxColumns I
>>>> set.
>>>>
>>>> Regards,
>>>> Chanh
>>>>
>>>>
>>>> On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro <linguin....@gmail.com>
>>>> wrote:
>>>>
>>>>> Is it not enough to set `maxColumns` in CSV options?
>>>>>
>>>>>
>>>>> https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116
>>>>>
>>>>> // maropu
>>>>>
>>>>> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke <jornfra...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Spark CSV data source should be able
>>>>>>
>>>>>> On 7. Jun 2017, at 17:50, Chanh Le <giaosu...@gmail.com> wrote:
>>>>>>
>>>>>> Hi everyone,
>>>>>> I am using Spark 2.1.1 to read csv files and convert to avro files.
>>>>>> One problem that I am facing is if one row of csv file has more
>>>>>> columns than maxColumns (default is 20480). The process of parsing
>>>>>> was stop.
>>>>>>
>>>>>> Internal state when error was thrown: line=1, column=3, record=0,
>>>>>> charIndex=12
>>>>>> com.univocity.parsers.common.TextParsingException:
>>>>>> java.lang.ArrayIndexOutOfBoundsException - 2
>>>>>> Hint: Number of columns processed may have exceeded limit of 2
>>>>>> columns. Use settings.setMaxColumns(int) to define the maximum number of
>>>>>> columns your input can have
>>>>>> Ensure your configuration is correct, with delimiters, quotes and
>>>>>> escape sequences that match the input format you are trying to parse
>>>>>> Parser Configuration: CsvParserSettings:
>>>>>>
>>>>>>
>>>>>> I did some investigation in univocity
>>>>>> <https://github.com/uniVocity/univocity-parsers> library but the way
>>>>>> it handle is throw error that why spark stop the process.
>>>>>>
>>>>>> How to skip the invalid row and just continue to parse next valid one?
>>>>>> Any libs can replace univocity in that job?
>>>>>>
>>>>>> Thanks & regards,
>>>>>> Chanh
>>>>>> --
>>>>>> Regards,
>>>>>> Chanh
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ---
>>>>> Takeshi Yamamuro
>>>>>
>>>> --
>>>> Regards,
>>>> Chanh
>>>>
>>>> --
>>> Regards,
>>> Chanh
>>>
>>> --
>> Regards,
>> Chanh
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>
-- 
Regards,
Chanh

Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

Reply via email to