The resulting library is on github: https://github.com/EDS-APHP/spark-postgres
While there is room for improvements it is also able to read/write postgres
data with the COPY statement allowing reading/writing **very large** tables
without problems.


On Sat, Dec 29, 2018 at 01:06:00PM +0100, Nicolas Paris wrote:
> Hi
> 
> The spark postgres JDBC reader is limited because it relies on basic
> SELECT statements with fetchsize and crashes on large tables even if
> multiple partitions are setup with lower/upper bounds.
> 
> I am about writing a new postgres JDBC reader based on "COPY TO STDOUT".
> It would stream the data and produce CSV on the fileSystem (hdfs or
> local).  The CSV would be then parsed with the spark CSV reader to
> produce a dataframe. It would send multiple "COPY TO STDOUT" for each
> executor.
> 
> Right now, I am able to loop over an output stream and write the string
> somewhere.
> I am wondering what would be the best way to process the resulting
> string stream. In particular the best way to direct it to a hdfs folder
> or maybe parse it on the fly into a dataframe.
> 
> Thanks,
> 
> -- 
> nicolas
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-- 
nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to