+1 On Wed, Sep 19, 2018 at 8:07 AM Ted Yu <yuzhih...@gmail.com> wrote:
> +1 > > -------- Original message -------- > From: Dongjin Lee <dong...@apache.org> > Date: 9/19/18 7:20 AM (GMT-08:00) > To: dev <dev@spark.apache.org> > Subject: Re: from_csv > > Another +1. > > I already experienced this case several times. > > On Mon, Sep 17, 2018 at 11:03 AM Hyukjin Kwon <gurwls...@gmail.com> wrote: > >> +1 for this idea since text parsing in CSV/JSON is quite common. >> >> One thing is about schema inference likewise with JSON functionality. In >> case of JSON, we added schema_of_json for it and same thing should be able >> to apply to CSV too. >> If we see some more needs for it, we can consider a function like >> schema_of_csv as well. >> >> >> 2018년 9월 16일 (일) 오후 4:41, Maxim Gekk <maxim.g...@databricks.com>님이 작성: >> >>> Hi Reynold, >>> >>> > i'd make this as consistent as to_json / from_json as possible >>> >>> Sure, new function from_csv() has the same signature as from_json(). >>> >>> > how would this work in sql? i.e. how would passing options in work? >>> >>> The options are passed to the function via map, for example: >>> select from_csv('26/08/2015', 'time Timestamp', map('timestampFormat', >>> 'dd/MM/yyyy')) >>> >>> On Sun, Sep 16, 2018 at 7:01 AM Reynold Xin <r...@databricks.com> wrote: >>> >>>> makes sense - i'd make this as consistent as to_json / from_json as >>>> possible. >>>> >>>> how would this work in sql? i.e. how would passing options in work? >>>> >>>> -- >>>> excuse the brevity and lower case due to wrist injury >>>> >>>> >>>> On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk <maxim.g...@databricks.com> >>>> wrote: >>>> >>>>> Hi All, >>>>> >>>>> I would like to propose new function from_csv() for parsing columns >>>>> containing strings in CSV format. Here is my PR: >>>>> https://github.com/apache/spark/pull/22379 >>>>> >>>>> An use case is loading a dataset from an external storage, dbms or >>>>> systems like Kafka to where CSV content was dumped as one of >>>>> columns/fields. Other columns could contain related information like >>>>> timestamps, ids, sources of data and etc. The column with CSV strings can >>>>> be parsed by existing method csv() of DataFrameReader but in that >>>>> case we have to "clean up" dataset and remove other columns since the >>>>> csv() method requires Dataset[String]. Joining back result of parsing >>>>> and original dataset by positions is expensive and not convenient. Instead >>>>> users parse CSV columns by string functions. The approach is usually error >>>>> prone especially for quoted values and other special cases. >>>>> >>>>> The proposed in the PR methods should make a better user experience in >>>>> parsing CSV-like columns. Please, share your thoughts. >>>>> >>>>> -- >>>>> >>>>> Maxim Gekk >>>>> >>>>> Technical Solutions Lead >>>>> >>>>> Databricks Inc. >>>>> >>>>> maxim.g...@databricks.com >>>>> >>>>> databricks.com >>>>> >>>>> <http://databricks.com/> >>>>> >>>> >>> > > -- > *Dongjin Lee* > > *A hitchhiker in the mathematical world.* > > *github: <http://goog_969573159/>github.com/dongjinleekr > <http://github.com/dongjinleekr>linkedin: kr.linkedin.com/in/dongjinleekr > <http://kr.linkedin.com/in/dongjinleekr>slideshare: > www.slideshare.net/dongjinleekr > <http://www.slideshare.net/dongjinleekr>* > -- John Zhuge