Hi Takeshi, Thank you very much.
Regards, Chanh On Thu, Jun 8, 2017 at 11:05 PM Takeshi Yamamuro <linguin....@gmail.com> wrote: > I filed a jira about this issue: > https://issues.apache.org/jira/browse/SPARK-21024 > > On Thu, Jun 8, 2017 at 1:27 AM, Chanh Le <giaosu...@gmail.com> wrote: > >> Can you recommend one? >> >> Thanks. >> >> On Thu, Jun 8, 2017 at 2:47 PM Jörn Franke <jornfra...@gmail.com> wrote: >> >>> You can change the CSV parser library >>> >>> On 8. Jun 2017, at 08:35, Chanh Le <giaosu...@gmail.com> wrote: >>> >>> >>> I did add mode -> DROPMALFORMED but it still couldn't ignore it because >>> the error raise from the CSV library that Spark are using. >>> >>> >>> On Thu, Jun 8, 2017 at 12:11 PM Jörn Franke <jornfra...@gmail.com> >>> wrote: >>> >>>> The CSV data source allows you to skip invalid lines - this should also >>>> include lines that have more than maxColumns. Choose mode "DROPMALFORMED" >>>> >>>> On 8. Jun 2017, at 03:04, Chanh Le <giaosu...@gmail.com> wrote: >>>> >>>> Hi Takeshi, Jörn Franke, >>>> >>>> The problem is even I increase the maxColumns it still have some lines >>>> have larger columns than the one I set and it will cost a lot of memory. >>>> So I just wanna skip the line has larger columns than the maxColumns I >>>> set. >>>> >>>> Regards, >>>> Chanh >>>> >>>> >>>> On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro <linguin....@gmail.com> >>>> wrote: >>>> >>>>> Is it not enough to set `maxColumns` in CSV options? >>>>> >>>>> >>>>> https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L116 >>>>> >>>>> // maropu >>>>> >>>>> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke <jornfra...@gmail.com> >>>>> wrote: >>>>> >>>>>> Spark CSV data source should be able >>>>>> >>>>>> On 7. Jun 2017, at 17:50, Chanh Le <giaosu...@gmail.com> wrote: >>>>>> >>>>>> Hi everyone, >>>>>> I am using Spark 2.1.1 to read csv files and convert to avro files. >>>>>> One problem that I am facing is if one row of csv file has more >>>>>> columns than maxColumns (default is 20480). The process of parsing >>>>>> was stop. >>>>>> >>>>>> Internal state when error was thrown: line=1, column=3, record=0, >>>>>> charIndex=12 >>>>>> com.univocity.parsers.common.TextParsingException: >>>>>> java.lang.ArrayIndexOutOfBoundsException - 2 >>>>>> Hint: Number of columns processed may have exceeded limit of 2 >>>>>> columns. Use settings.setMaxColumns(int) to define the maximum number of >>>>>> columns your input can have >>>>>> Ensure your configuration is correct, with delimiters, quotes and >>>>>> escape sequences that match the input format you are trying to parse >>>>>> Parser Configuration: CsvParserSettings: >>>>>> >>>>>> >>>>>> I did some investigation in univocity >>>>>> <https://github.com/uniVocity/univocity-parsers> library but the way >>>>>> it handle is throw error that why spark stop the process. >>>>>> >>>>>> How to skip the invalid row and just continue to parse next valid one? >>>>>> Any libs can replace univocity in that job? >>>>>> >>>>>> Thanks & regards, >>>>>> Chanh >>>>>> -- >>>>>> Regards, >>>>>> Chanh >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> --- >>>>> Takeshi Yamamuro >>>>> >>>> -- >>>> Regards, >>>> Chanh >>>> >>>> -- >>> Regards, >>> Chanh >>> >>> -- >> Regards, >> Chanh >> > > > > -- > --- > Takeshi Yamamuro > -- Regards, Chanh