+1, it is useful. On Sat, Dec 10, 2016 at 9:28 PM, Dongjoon Hyun <dongj...@apache.org> wrote:
> Thank you for the opinion, Felix. > > Bests, > Dongjoon. > > On Sat, Dec 10, 2016 at 11:00 AM, Felix Cheung <felixcheun...@hotmail.com> > wrote: > >> +1 I think it's useful to always have a pure SQL way and skip header for >> plain text / csv that lots of companies have. >> >> >> ------------------------------ >> *From:* Dongjoon Hyun <dongj...@apache.org> >> *Sent:* Friday, December 9, 2016 9:42:58 AM >> *To:* Dongjin Lee; dev@spark.apache.org >> *Subject:* Re: Question about SPARK-11374 (skip.header.line.count) >> >> Thank you for the opinion, Dongjin! >> >> >> On Thu, Dec 8, 2016 at 21:56 Dongjin Lee <dong...@apache.org> wrote: >> >>> +1 For this idea. I need it also. >>> >>> Regards, >>> Dongjin >>> >>> On Fri, Dec 9, 2016 at 8:59 AM, Dongjoon Hyun <dongj...@apache.org> >>> wrote: >>> >>> Hi, All. >>> >>> >>> >>> >>> >>> Could you give me some opinion? >>> >>> >>> >>> >>> >>> There is an old SPARK issue, SPARK-11374, about removing header lines >>> from text file. >>> >>> >>> Currently, Spark supports removing CSV header lines by the following way. >>> >>> >>> >>> >>> >>> ``` >>> >>> >>> scala> spark.read.option("header","true").csv("/data").show >>> >>> >>> +---+---+ >>> >>> >>> | c1| c2| >>> >>> >>> +---+---+ >>> >>> >>> | 1| a| >>> >>> >>> | 2| b| >>> >>> >>> +---+---+ >>> >>> >>> ``` >>> >>> >>> >>> >>> >>> In SQL world, we can support that like the Hive way, >>> `skip.header.line.count`. >>> >>> >>> >>> >>> >>> ``` >>> >>> >>> scala> sql("CREATE TABLE t1 (id INT, value VARCHAR(10)) ROW FORMAT >>> DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/data' >>> TBLPROPERTIES('skip.header.line.count'='1')") >>> >>> >>> scala> sql("SELECT * FROM t1").show >>> >>> >>> +---+-----+ >>> >>> >>> | id|value| >>> >>> >>> +---+-----+ >>> >>> >>> | 1| a| >>> >>> >>> | 2| b| >>> >>> >>> +---+-----+ >>> >>> >>> ``` >>> >>> >>> >>> >>> >>> Although I made a PR for this based on the JIRA issue, I want to know >>> this is really needed feature. >>> >>> >>> Is it need for your use cases? Or, it's enough for you to remove them in >>> a preprocessing stage. >>> >>> >>> If this is too old and not proper in these days, I'll close the PR and >>> JIRA issue as WON'T FIX. >>> >>> >>> >>> >>> >>> Thank you for all in advance! >>> >>> >>> >>> >>> >>> Bests, >>> >>> >>> Dongjoon. >>> >>> >>> >>> >>> >>> --------------------------------------------------------------------- >>> >>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> * Dongjin Lee * >>> >>> >>> * Software developer in Line+. So interested in massive-scale machine >>> learning. facebook: www.facebook.com/dongjin.lee.kr >>> <http://www.facebook.com/dongjin.lee.kr> >>> linkedin: kr.linkedin.com/in/dongjinleekr >>> <http://kr.linkedin.com/in/dongjinleekr> github: >>> <http://goog_969573159/>github.com/dongjinleekr >>> <http://github.com/dongjinleekr> twitter: www.twitter.com/dongjinleekr >>> <http://www.twitter.com/dongjinleekr> * >>> >>> >>> >