Re: Spark CSV skip lines

Hyukjin Kwon Sat, 10 Sep 2016 10:06:45 -0700

As you are reading each record as each file via wholeTextFiles and
falttening them to records, I think you can just drop the few lines as you
want.


Can you just drop or skip few lines from reader.readAll().map(...)?

Also, are you sure this is an issue in Spark or external CSV library issue?

Do you mind if I ask the stack trace if you think so?

On 11 Sep 2016 1:50 a.m., "Selvam Raman" <sel...@gmail.com> wrote:

> Hi,
>
> I saw this two option already anyway thanks for the idea.
>
> i am using wholetext file to read my data(cause there are  \n middle of
> it) and using opencsv to parse the data. In my data first two lines are
> just some report. how can i eliminate.
>
> *How to eliminate first two lines after reading from wholetextfiles.*
>
> val test = wholeTextFiles.flatMap{ case (_, txt) =>
>      | val reader = new CSVReader(new StringReader(txt));
>      | reader.readAll().map(data => Row(data(3),data(4),data(7),
> data(9),data(14)))}
>
> The above code throws arrayoutofbounce exception for empty line and report
> line.
>
>
> On Sat, Sep 10, 2016 at 3:02 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
>> Hi Selvam,
>>
>> If your report is commented with any character (e.g. #), you can skip
>> these lines via comment option [1].
>>
>> If you are using Spark 1.x, then you might be able to do this by manually
>> skipping from the RDD and then making this to DataFrame as below:
>>
>> I haven’t tested this but I think this should work.
>>
>> val rdd = sparkContext.textFile("...")
>> val filteredRdd = rdd.mapPartitionsWithIndex { (idx, iter) =>
>>   if (idx == 0) {
>>     iter.drop(10)
>>   } else {
>>     iter
>>   }
>> }
>> val df = new CsvParser().csvRdd(sqlContext, filteredRdd)
>>
>> If you are using Spark 2.0, then it seems there is no way to manually
>> modifying the source data because loading existing RDD or DataSet[String]
>> to DataFrame is not yet supported.
>>
>> There is an issue open[2]. I hope this is helpful.
>>
>> Thanks.
>>
>> [1] https://github.com/apache/spark/blob/27209252f09ff73c58e
>> 60c6df8aaba73b308088c/sql/core/src/main/scala/org/
>> apache/spark/sql/DataFrameReader.scala#L369
>> [2] https://issues.apache.org/jira/browse/SPARK-15463
>>
>>
>> 
>>
>>
>> On 10 Sep 2016 6:14 p.m., "Selvam Raman" <sel...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am using spark csv to read csv file. The issue is my files first n
>>> lines contains some report and followed by actual data (header and rest of
>>> the data).
>>>
>>> So how can i skip first n lines in spark csv. I dont have any specific
>>> comment character in the first byte.
>>>
>>> Please give me some idea.
>>>
>>> --
>>> Selvam Raman
>>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>>
>>
>
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>

Re: Spark CSV skip lines

Reply via email to