subject:"Efficient way to split an input data set into different output files"

Efficient way to split an input data set into different output files

2014-11-19 Thread Tom Seddon

I'm trying to set up a PySpark ETL job that takes in JSON log files and spits out fact table files for upload to Redshift. Is there an efficient way to send different event types to different outputs without having to just read the same cached RDD twice? I have my first RDD which is just a json

Re: Efficient way to split an input data set into different output files

2014-11-19 Thread Nicholas Chammas

I don't have a solution for you, but it sounds like you might want to follow this issue: SPARK-3533 https://issues.apache.org/jira/browse/SPARK-3533 - Add saveAsTextFileByKey() method to RDDs On Wed Nov 19 2014 at 6:41:11 AM Tom Seddon mr.tom.sed...@gmail.com wrote: I'm trying to set up a