I have a large dataset stored into a BigQuery table and I would like to load it into a pypark RDD for ETL data processing.
I realized that BigQuery supports the Hadoop Input / Output format https://cloud.google.com/hadoop/writing-with-bigquery-connector and pyspark should be able to use this interface in order to create an RDD by using the method "newAPIHadoopRDD". http://spark.apache.org/docs/latest/api/python/pyspark.html Unfortunately, the documentation on both ends seems scarce and goes beyond my knowledge of Hadoop/Spark/BigQuery. Is there anybody who has figured out how to do this? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/BigQuery-connector-for-pyspark-via-Hadoop-Input-Format-example-tp23900.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org