Hi

The spark postgres JDBC reader is limited because it relies on basic
SELECT statements with fetchsize and crashes on large tables even if
multiple partitions are setup with lower/upper bounds.

I am about writing a new postgres JDBC reader based on "COPY TO STDOUT".
It would stream the data and produce CSV on the fileSystem (hdfs or
local).  The CSV would be then parsed with the spark CSV reader to
produce a dataframe. It would send multiple "COPY TO STDOUT" for each
executor.

Right now, I am able to loop over an output stream and write the string
somewhere.
I am wondering what would be the best way to process the resulting
string stream. In particular the best way to direct it to a hdfs folder
or maybe parse it on the fly into a dataframe.

Thanks,

-- 
nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to