> I think my best option is to partition my data in directories by day
> before running my Spark application, and then direct
> my Spark application to load RDD's from each directory when
> I want to load a date range. How does this sound?
>
> If your upstream system can write data by day then it makes perfect sense
to do that and load (into Spark) only the data that is required for
processing. This also saves you the filter step and hopefully time and
memory. If you want to get back the bigger dataset you can always join
multiple days of data (RDDs) together.

Reply via email to