I'm trying to set up a PySpark ETL job that takes in JSON log files and
spits out fact table files for upload to Redshift. Is there an efficient
way to send different event types to different outputs without having to
just read the same cached RDD twice? I have my first RDD which is just a
json
I don't have a solution for you, but it sounds like you might want to
follow this issue:
SPARK-3533 https://issues.apache.org/jira/browse/SPARK-3533 - Add
saveAsTextFileByKey() method to RDDs
On Wed Nov 19 2014 at 6:41:11 AM Tom Seddon mr.tom.sed...@gmail.com wrote:
I'm trying to set up a