Re: Question about SPARK-11374 (skip.header.line.count)

Dongjoon Hyun Sat, 10 Dec 2016 18:29:02 -0800

Thank you for the opinion, Felix.

Bests,
Dongjoon.


On Sat, Dec 10, 2016 at 11:00 AM, Felix Cheung <[email protected]>
wrote:

> +1 I think it's useful to always have a pure SQL way and skip header for
> plain text / csv that lots of companies have.
>
>
> ------------------------------
> *From:* Dongjoon Hyun <[email protected]>
> *Sent:* Friday, December 9, 2016 9:42:58 AM
> *To:* Dongjin Lee; [email protected]
> *Subject:* Re: Question about SPARK-11374 (skip.header.line.count)
>
> Thank you for the opinion, Dongjin!
>
>
> On Thu, Dec 8, 2016 at 21:56 Dongjin Lee <[email protected]> wrote:
>
>> +1 For this idea. I need it also.
>>
>> Regards,
>> Dongjin
>>
>> On Fri, Dec 9, 2016 at 8:59 AM, Dongjoon Hyun <[email protected]>
>> wrote:
>>
>> Hi, All.
>>
>>
>>
>>
>>
>> Could you give me some opinion?
>>
>>
>>
>>
>>
>> There is an old SPARK issue, SPARK-11374, about removing header lines
>> from text file.
>>
>>
>> Currently, Spark supports removing CSV header lines by the following way.
>>
>>
>>
>>
>>
>> ```
>>
>>
>> scala> spark.read.option("header","true").csv("/data").show
>>
>>
>> +---+---+
>>
>>
>> | c1| c2|
>>
>>
>> +---+---+
>>
>>
>> |  1|  a|
>>
>>
>> |  2|  b|
>>
>>
>> +---+---+
>>
>>
>> ```
>>
>>
>>
>>
>>
>> In SQL world, we can support that like the Hive way,
>> `skip.header.line.count`.
>>
>>
>>
>>
>>
>> ```
>>
>>
>> scala> sql("CREATE TABLE t1 (id INT, value VARCHAR(10)) ROW FORMAT
>> DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/data'
>> TBLPROPERTIES('skip.header.line.count'='1')")
>>
>>
>> scala> sql("SELECT * FROM t1").show
>>
>>
>> +---+-----+
>>
>>
>> | id|value|
>>
>>
>> +---+-----+
>>
>>
>> |  1|    a|
>>
>>
>> |  2|    b|
>>
>>
>> +---+-----+
>>
>>
>> ```
>>
>>
>>
>>
>>
>> Although I made a PR for this based on the JIRA issue, I want to know
>> this is really needed feature.
>>
>>
>> Is it need for your use cases? Or, it's enough for you to remove them in
>> a preprocessing stage.
>>
>>
>> If this is too old and not proper in these days, I'll close the PR and
>> JIRA issue as WON'T FIX.
>>
>>
>>
>>
>>
>> Thank you for all in advance!
>>
>>
>>
>>
>>
>> Bests,
>>
>>
>> Dongjoon.
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>>
>>
>> To unsubscribe e-mail: [email protected]
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> * Dongjin Lee *
>>
>>
>> * Software developer in Line+. So interested in massive-scale machine
>> learning. facebook: www.facebook.com/dongjin.lee.kr
>> <http://www.facebook.com/dongjin.lee.kr>
>> linkedin: kr.linkedin.com/in/dongjinleekr
>> <http://kr.linkedin.com/in/dongjinleekr> github:
>> <http://goog_969573159/>github.com/dongjinleekr
>> <http://github.com/dongjinleekr> twitter: www.twitter.com/dongjinleekr
>> <http://www.twitter.com/dongjinleekr> *
>>
>>
>>

Re: Question about SPARK-11374 (skip.header.line.count)

Reply via email to