https://issues.apache.org/jira/browse/SPARK-3586 talks about creating a file dstream which can monitor for new files recursively but this functionality is not yet added.
I don't see an easy way out. You will have to create your folders based on timeline (looks like you are already doing that) and running a new job over the new folders created in an interval. This will have to be an automated using an external script. Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Wed, Mar 9, 2016 at 10:33 AM, Benjamin Kim <bbuil...@gmail.com> wrote: > I am wondering if anyone can help. > > Our company stores zipped CSV files in S3, which has been a big headache > from the start. I was wondering if anyone has created a way to iterate > through several subdirectories (s3n://events/2016/03/01/00, > s3n://2016/03/01/01, etc.) in S3 to find the newest files and load them. It > would be a big bonus to include the unzipping of the file in the process so > that the CSV can be loaded directly into a dataframe for further > processing. I’m pretty sure that the S3 part of this request is not > uncommon. I would think the file being zipped is uncommon. If anyone can > help, I would truly be grateful for I am new to Scala and Spark. This would > be a great help in learning. > > Thanks, > Ben > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >