Re: from_csv

John Zhuge Wed, 19 Sep 2018 13:23:16 -0700

+1

On Wed, Sep 19, 2018 at 8:07 AM Ted Yu <yuzhih...@gmail.com> wrote:


> +1
>
> -------- Original message --------
> From: Dongjin Lee <dong...@apache.org>
> Date: 9/19/18 7:20 AM (GMT-08:00)
> To: dev <dev@spark.apache.org>
> Subject: Re: from_csv
>
> Another +1.
>
> I already experienced this case several times.
>
> On Mon, Sep 17, 2018 at 11:03 AM Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
>> +1 for this idea since text parsing in CSV/JSON is quite common.
>>
>> One thing is about schema inference likewise with JSON functionality. In
>> case of JSON, we added schema_of_json for it and same thing should be able
>> to apply to CSV too.
>> If we see some more needs for it, we can consider a function like
>> schema_of_csv as well.
>>
>>
>> 2018년 9월 16일 (일) 오후 4:41, Maxim Gekk <maxim.g...@databricks.com>님이 작성:
>>
>>> Hi Reynold,
>>>
>>> > i'd make this as consistent as to_json / from_json as possible
>>>
>>> Sure, new function from_csv() has the same signature as from_json().
>>>
>>> > how would this work in sql? i.e. how would passing options in work?
>>>
>>> The options are passed to the function via map, for example:
>>> select from_csv('26/08/2015', 'time Timestamp', map('timestampFormat',
>>> 'dd/MM/yyyy'))
>>>
>>> On Sun, Sep 16, 2018 at 7:01 AM Reynold Xin <r...@databricks.com> wrote:
>>>
>>>> makes sense - i'd make this as consistent as to_json / from_json as
>>>> possible.
>>>>
>>>> how would this work in sql? i.e. how would passing options in work?
>>>>
>>>> --
>>>> excuse the brevity and lower case due to wrist injury
>>>>
>>>>
>>>> On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk <maxim.g...@databricks.com>
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I would like to propose new function from_csv() for parsing columns
>>>>> containing strings in CSV format. Here is my PR:
>>>>> https://github.com/apache/spark/pull/22379
>>>>>
>>>>> An use case is loading a dataset from an external storage, dbms or
>>>>> systems like Kafka to where CSV content was dumped as one of
>>>>> columns/fields. Other columns could contain related information like
>>>>> timestamps, ids, sources of data and etc. The column with CSV strings can
>>>>> be parsed by existing method csv() of DataFrameReader but in that
>>>>> case we have to "clean up" dataset and remove other columns since the
>>>>> csv() method requires Dataset[String]. Joining back result of parsing
>>>>> and original dataset by positions is expensive and not convenient. Instead
>>>>> users parse CSV columns by string functions. The approach is usually error
>>>>> prone especially for quoted values and other special cases.
>>>>>
>>>>> The proposed in the PR methods should make a better user experience in
>>>>> parsing CSV-like columns. Please, share your thoughts.
>>>>>
>>>>> --
>>>>>
>>>>> Maxim Gekk
>>>>>
>>>>> Technical Solutions Lead
>>>>>
>>>>> Databricks Inc.
>>>>>
>>>>> maxim.g...@databricks.com
>>>>>
>>>>> databricks.com
>>>>>
>>>>>   <http://databricks.com/>
>>>>>
>>>>
>>>
>
> --
> *Dongjin Lee*
>
> *A hitchhiker in the mathematical world.*
>
> *github:  <http://goog_969573159/>github.com/dongjinleekr
> <http://github.com/dongjinleekr>linkedin: kr.linkedin.com/in/dongjinleekr
> <http://kr.linkedin.com/in/dongjinleekr>slideshare: 
> www.slideshare.net/dongjinleekr
> <http://www.slideshare.net/dongjinleekr>*
>


-- 
John Zhuge

Re: from_csv

Reply via email to