On Sun, Feb 10, 2019 at 12:45:33PM +0100, Driesprong, Fokko wrote: > Since there is also Pyspark support, it should be relative straightforward > to invoke the spark-postgres library from Airflow.
yes pyspark is supported. In version 3 of spark-postgres library, I will improve the pyspark API which is right now not user friendly. Still I tested it successfully with pyspark and it gives same performances and reliability as scala spark. The idea might work for mysql and oracle, but this would need to deeply understand those databases and specific connectors such postgres's COPY command used in spark-postgres. regards > Op za 9 feb. 2019 om 12:16 schreef Nicolas Paris <[email protected]>: > > > Hi > > > > Be careful with sparkJdbc as a replacement of Sqoop for large tables. > > Sqoop is able to handle any source table size while sparkJdbc design does > > not. > > While it provides a way to distribute in multiple partitions, spark is > > limited by the executors memory where sqoop is limited by the hdfs > > space. > > > > As a result, I have written a spark library (for postgres only right > > now) witch overcome the core spark jdbc limitations. It handles any > > workload, and my tests show it was 8 times faster than sqoop. I have not > > tested it with airflow, but it is compatible with apache livy and > > pySpark. > > > > https://github.com/EDS-APHP/spark-postgres > > > > > > On Fri, Feb 01, 2019 at 01:53:57PM +0100, Iván Robla Albarrán wrote: > > > Hi , > > > > > > I am seaching how to substitute Apache Sqoop > > > > > > I am analyzing SparkJDBCOperator, but i dont understand how i have to > > use . > > > > > > It a version of SparkSubmit operator, for include as conection JDBC > > > conection ? > > > > > > I need to include Spark code? > > > > > > Any example? > > > > > > Thanks, I am very lost > > > > > > Regards, > > > Iván Robla > > > > -- > > nicolas > > -- nicolas
