Looking good Nicolas, thanks for sharing. Since there is also Pyspark support, it should be relative straightforward to invoke the spark-postgres library from Airflow.
Cheers, Fokko Op za 9 feb. 2019 om 12:16 schreef Nicolas Paris <[email protected]>: > Hi > > Be careful with sparkJdbc as a replacement of Sqoop for large tables. > Sqoop is able to handle any source table size while sparkJdbc design does > not. > While it provides a way to distribute in multiple partitions, spark is > limited by the executors memory where sqoop is limited by the hdfs > space. > > As a result, I have written a spark library (for postgres only right > now) witch overcome the core spark jdbc limitations. It handles any > workload, and my tests show it was 8 times faster than sqoop. I have not > tested it with airflow, but it is compatible with apache livy and > pySpark. > > https://github.com/EDS-APHP/spark-postgres > > > On Fri, Feb 01, 2019 at 01:53:57PM +0100, Iván Robla Albarrán wrote: > > Hi , > > > > I am seaching how to substitute Apache Sqoop > > > > I am analyzing SparkJDBCOperator, but i dont understand how i have to > use . > > > > It a version of SparkSubmit operator, for include as conection JDBC > > conection ? > > > > I need to include Spark code? > > > > Any example? > > > > Thanks, I am very lost > > > > Regards, > > Iván Robla > > -- > nicolas >
