Hi Divya, There are a number of ways you can do this
Get today's date in epoch format. These are my package imports import java.util.Calendar import org.joda.time._ import java.math.BigDecimal import java.sql.{Timestamp, Date} import org.joda.time.format.DateTimeFormat // Get epoch time now scala> val epoch = System.currentTimeMillis epoch: Long = 1474996552292 //get thirty days ago in epoch time scala> val thirtydaysago = epoch - (30 * 24 * 60 * 60 * 1000L) thirtydaysago: Long = 1472404552292 // *note that L for Long at the end* // Define a function to convert date to str to double check if indeed it is 30 days ago scala> def timeToStr(epochMillis: Long): String = { | DateTimeFormat.forPattern("YYYY-MM-dd HH:mm:ss").print(epochMillis)} timeToStr: (epochMillis: Long)String scala> timeToStr(epoch) res4: String = 2016-09-27 18:15:52 So you need to pick files >= file_thirtydaysago UP to file_epoch Regardless I think you can do better with partitioning of directories. With a file created every 5 minutes you will have 288 files generated daily (12*24). Just partition the sub-directory daily. Flume can do that for you or you can do it in a shell script. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 27 September 2016 at 15:53, Peter Figliozzi <pete.figlio...@gmail.com> wrote: > If you're up for a fancy but excellent solution: > > - Store your data in Cassandra. > - Use the expiring data feature (TTL) > <https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html> so > data will automatically be removed a month later. > - Now in your Spark process, just read from the database and you don't > have to worry about the timestamp. > - You'll still have all your old files if you need to refer back them. > > Pete > > On Tue, Sep 27, 2016 at 2:52 AM, Divya Gehlot <divya.htco...@gmail.com> > wrote: > >> Hi, >> The input data files for my spark job generated at every five minutes >> file name follows epoch time convention as below : >> >> InputFolder/batch-1474959600000 >> InputFolder/batch-1474959900000 >> InputFolder/batch-1474960200000 >> InputFolder/batch-1474960500000 >> InputFolder/batch-1474960800000 >> InputFolder/batch-1474961100000 >> InputFolder/batch-1474961400000 >> InputFolder/batch-1474961700000 >> InputFolder/batch-1474962000000 >> InputFolder/batch-1474962300000 >> >> As per requirement I need to read one month of data from current >> timestamp. >> >> Would really appreciate if anybody could help me . >> >> Thanks, >> Divya >> > >