Hi The spark postgres JDBC reader is limited because it relies on basic SELECT statements with fetchsize and crashes on large tables even if multiple partitions are setup with lower/upper bounds.
I am about writing a new postgres JDBC reader based on "COPY TO STDOUT". It would stream the data and produce CSV on the fileSystem (hdfs or local). The CSV would be then parsed with the spark CSV reader to produce a dataframe. It would send multiple "COPY TO STDOUT" for each executor. Right now, I am able to loop over an output stream and write the string somewhere. I am wondering what would be the best way to process the resulting string stream. In particular the best way to direct it to a hdfs folder or maybe parse it on the fly into a dataframe. Thanks, -- nicolas --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org