Re: from_csv

Ted Yu Wed, 19 Sep 2018 08:07:41 -0700

+1
-------- Original message --------From: Dongjin Lee <dong...@apache.org> Date: 
9/19/18  7:20 AM  (GMT-08:00) To: dev <dev@spark.apache.org> Subject: Re: 
from_csv 
Another +1.
I already experienced this case several times.


On Mon, Sep 17, 2018 at 11:03 AM Hyukjin Kwon <gurwls...@gmail.com> wrote:
+1 for this idea since text parsing in CSV/JSON is quite common.
One thing is about schema inference likewise with JSON functionality. In case 
of JSON, we added schema_of_json for it and same thing should be able to apply 
to CSV too.
If we see some more needs for it, we can consider a function like schema_of_csv 
as well.

2018년 9월 16일 (일) 오후 4:41, Maxim Gekk <maxim.g...@databricks.com>님이 작성:
Hi Reynold,
> i'd make this as consistent as to_json / from_json as possible
Sure, new function from_csv() has the same signature as from_json().
> how would this work in sql? i.e. how would passing options in work?
The options are passed to the function via map, for example:select 
from_csv('26/08/2015', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'))

On Sun, Sep 16, 2018 at 7:01 AM Reynold Xin <r...@databricks.com> wrote:
makes sense - i'd make this as consistent as to_json / from_json as possible. 
how would this work in sql? i.e. how would passing options in work?
--excuse the brevity and lower case due to wrist injury

On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk <maxim.g...@databricks.com> wrote:
Hi All,
I would like to propose new function from_csv() for parsing columns containing 
strings in CSV format. Here is my PR: https://github.com/apache/spark/pull/22379
An use case is loading a dataset from an external storage, dbms or systems like 
Kafka to where CSV content was dumped as one of columns/fields. Other columns 
could contain related information like timestamps, ids, sources of data and 
etc. The column with CSV strings can be parsed by existing method csv() of 
DataFrameReader but in that case we have to "clean up" dataset and remove other 
columns since the csv() method requires Dataset[String]. Joining back result of 
parsing and original dataset by positions is expensive and not convenient. 
Instead users parse CSV columns by string functions. The approach is usually 
error prone especially for quoted values and other special cases.
The proposed in the PR methods should make a better user experience in parsing 
CSV-like columns. Please, share your thoughts.
-- 

Maxim Gekk
Technical Solutions LeadDatabricks inc.maxim.g...@databricks.comdatabricks.com 






-- 
Dongjin Lee
A hitchhiker in the mathematical world.
github: github.com/dongjinleekrlinkedin: 
kr.linkedin.com/in/dongjinleekrslideshare: www.slideshare.net/dongjinleekr

Re: from_csv

Reply via email to