I filed a jira about this issue: https://issues.apache.org/jira/browse/SPARK-21024
On Thu, Jun 8, 2017 at 1:27 AM, Chanh Le <giaosu...@gmail.com> wrote: > Can you recommend one? > > Thanks. > > On Thu, Jun 8, 2017 at 2:47 PM Jörn Franke <jornfra...@gmail.com> wrote: > >> You can change the CSV parser library >> >> On 8. Jun 2017, at 08:35, Chanh Le <giaosu...@gmail.com> wrote: >> >> >> I did add mode -> DROPMALFORMED but it still couldn't ignore it because >> the error raise from the CSV library that Spark are using. >> >> >> On Thu, Jun 8, 2017 at 12:11 PM Jörn Franke <jornfra...@gmail.com> wrote: >> >>> The CSV data source allows you to skip invalid lines - this should also >>> include lines that have more than maxColumns. Choose mode "DROPMALFORMED" >>> >>> On 8. Jun 2017, at 03:04, Chanh Le <giaosu...@gmail.com> wrote: >>> >>> Hi Takeshi, Jörn Franke, >>> >>> The problem is even I increase the maxColumns it still have some lines >>> have larger columns than the one I set and it will cost a lot of memory. >>> So I just wanna skip the line has larger columns than the maxColumns I >>> set. >>> >>> Regards, >>> Chanh >>> >>> >>> On Thu, Jun 8, 2017 at 12:48 AM Takeshi Yamamuro <linguin....@gmail.com> >>> wrote: >>> >>>> Is it not enough to set `maxColumns` in CSV options? >>>> >>>> https://github.com/apache/spark/blob/branch-2.1/sql/ >>>> core/src/main/scala/org/apache/spark/sql/execution/ >>>> datasources/csv/CSVOptions.scala#L116 >>>> >>>> // maropu >>>> >>>> On Wed, Jun 7, 2017 at 9:45 AM, Jörn Franke <jornfra...@gmail.com> >>>> wrote: >>>> >>>>> Spark CSV data source should be able >>>>> >>>>> On 7. Jun 2017, at 17:50, Chanh Le <giaosu...@gmail.com> wrote: >>>>> >>>>> Hi everyone, >>>>> I am using Spark 2.1.1 to read csv files and convert to avro files. >>>>> One problem that I am facing is if one row of csv file has more >>>>> columns than maxColumns (default is 20480). The process of parsing >>>>> was stop. >>>>> >>>>> Internal state when error was thrown: line=1, column=3, record=0, >>>>> charIndex=12 >>>>> com.univocity.parsers.common.TextParsingException: >>>>> java.lang.ArrayIndexOutOfBoundsException >>>>> - 2 >>>>> Hint: Number of columns processed may have exceeded limit of 2 >>>>> columns. Use settings.setMaxColumns(int) to define the maximum number of >>>>> columns your input can have >>>>> Ensure your configuration is correct, with delimiters, quotes and >>>>> escape sequences that match the input format you are trying to parse >>>>> Parser Configuration: CsvParserSettings: >>>>> >>>>> >>>>> I did some investigation in univocity >>>>> <https://github.com/uniVocity/univocity-parsers> library but the way >>>>> it handle is throw error that why spark stop the process. >>>>> >>>>> How to skip the invalid row and just continue to parse next valid one? >>>>> Any libs can replace univocity in that job? >>>>> >>>>> Thanks & regards, >>>>> Chanh >>>>> -- >>>>> Regards, >>>>> Chanh >>>>> >>>>> >>>> >>>> >>>> -- >>>> --- >>>> Takeshi Yamamuro >>>> >>> -- >>> Regards, >>> Chanh >>> >>> -- >> Regards, >> Chanh >> >> -- > Regards, > Chanh > -- --- Takeshi Yamamuro