Re: Question about SPARK-11374 (skip.header.line.count)

Mingjie Tang Sat, 10 Dec 2016 18:46:12 -0800

+1, it is useful.

On Sat, Dec 10, 2016 at 9:28 PM, Dongjoon Hyun <[email protected]> wrote:


> Thank you for the opinion, Felix.
>
> Bests,
> Dongjoon.
>
> On Sat, Dec 10, 2016 at 11:00 AM, Felix Cheung <[email protected]>
> wrote:
>
>> +1 I think it's useful to always have a pure SQL way and skip header for
>> plain text / csv that lots of companies have.
>>
>>
>> ------------------------------
>> *From:* Dongjoon Hyun <[email protected]>
>> *Sent:* Friday, December 9, 2016 9:42:58 AM
>> *To:* Dongjin Lee; [email protected]
>> *Subject:* Re: Question about SPARK-11374 (skip.header.line.count)
>>
>> Thank you for the opinion, Dongjin!
>>
>>
>> On Thu, Dec 8, 2016 at 21:56 Dongjin Lee <[email protected]> wrote:
>>
>>> +1 For this idea. I need it also.
>>>
>>> Regards,
>>> Dongjin
>>>
>>> On Fri, Dec 9, 2016 at 8:59 AM, Dongjoon Hyun <[email protected]>
>>> wrote:
>>>
>>> Hi, All.
>>>
>>>
>>>
>>>
>>>
>>> Could you give me some opinion?
>>>
>>>
>>>
>>>
>>>
>>> There is an old SPARK issue, SPARK-11374, about removing header lines
>>> from text file.
>>>
>>>
>>> Currently, Spark supports removing CSV header lines by the following way.
>>>
>>>
>>>
>>>
>>>
>>> ```
>>>
>>>
>>> scala> spark.read.option("header","true").csv("/data").show
>>>
>>>
>>> +---+---+
>>>
>>>
>>> | c1| c2|
>>>
>>>
>>> +---+---+
>>>
>>>
>>> |  1|  a|
>>>
>>>
>>> |  2|  b|
>>>
>>>
>>> +---+---+
>>>
>>>
>>> ```
>>>
>>>
>>>
>>>
>>>
>>> In SQL world, we can support that like the Hive way,
>>> `skip.header.line.count`.
>>>
>>>
>>>
>>>
>>>
>>> ```
>>>
>>>
>>> scala> sql("CREATE TABLE t1 (id INT, value VARCHAR(10)) ROW FORMAT
>>> DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/data'
>>> TBLPROPERTIES('skip.header.line.count'='1')")
>>>
>>>
>>> scala> sql("SELECT * FROM t1").show
>>>
>>>
>>> +---+-----+
>>>
>>>
>>> | id|value|
>>>
>>>
>>> +---+-----+
>>>
>>>
>>> |  1|    a|
>>>
>>>
>>> |  2|    b|
>>>
>>>
>>> +---+-----+
>>>
>>>
>>> ```
>>>
>>>
>>>
>>>
>>>
>>> Although I made a PR for this based on the JIRA issue, I want to know
>>> this is really needed feature.
>>>
>>>
>>> Is it need for your use cases? Or, it's enough for you to remove them in
>>> a preprocessing stage.
>>>
>>>
>>> If this is too old and not proper in these days, I'll close the PR and
>>> JIRA issue as WON'T FIX.
>>>
>>>
>>>
>>>
>>>
>>> Thank you for all in advance!
>>>
>>>
>>>
>>>
>>>
>>> Bests,
>>>
>>>
>>> Dongjoon.
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>>
>>>
>>> To unsubscribe e-mail: [email protected]
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> * Dongjin Lee *
>>>
>>>
>>> * Software developer in Line+. So interested in massive-scale machine
>>> learning. facebook: www.facebook.com/dongjin.lee.kr
>>> <http://www.facebook.com/dongjin.lee.kr>
>>> linkedin: kr.linkedin.com/in/dongjinleekr
>>> <http://kr.linkedin.com/in/dongjinleekr> github:
>>> <http://goog_969573159/>github.com/dongjinleekr
>>> <http://github.com/dongjinleekr> twitter: www.twitter.com/dongjinleekr
>>> <http://www.twitter.com/dongjinleekr> *
>>>
>>>
>>>
>

Re: Question about SPARK-11374 (skip.header.line.count)

Reply via email to