Re: read multiple files

Mich Talebzadeh Tue, 27 Sep 2016 10:22:12 -0700

Hi Divya,

There are a number of ways you can do this


Get today's date in epoch format. These are my package imports

import java.util.Calendar
import org.joda.time._
import java.math.BigDecimal
import java.sql.{Timestamp, Date}
import org.joda.time.format.DateTimeFormat

// Get epoch time now

scala> val epoch = System.currentTimeMillis
epoch: Long = 1474996552292

//get thirty days ago in epoch time

scala> val thirtydaysago = epoch - (30 * 24 * 60 * 60 * 1000L)
thirtydaysago: Long = 1472404552292

// *note that L for Long at the end*

// Define a function to convert date to str to double check if indeed it is
30 days ago

scala> def timeToStr(epochMillis: Long): String = {
     | DateTimeFormat.forPattern("YYYY-MM-dd HH:mm:ss").print(epochMillis)}
timeToStr: (epochMillis: Long)String


scala> timeToStr(epoch)
res4: String = 2016-09-27 18:15:52

So you need to pick files >= file_thirtydaysago UP to  file_epoch

Regardless I think you can do better with partitioning of directories. With
a file created every 5 minutes you will have 288 files generated daily
(12*24). Just partition the sub-directory daily. Flume can do that for you
or you can do it in a shell script.

HTH











Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 September 2016 at 15:53, Peter Figliozzi <pete.figlio...@gmail.com>
wrote:

> If you're up for a fancy but excellent solution:
>
>    - Store your data in Cassandra.
>    - Use the expiring data feature (TTL)
>    <https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html> so
>    data will automatically be removed a month later.
>    - Now in your Spark process, just read from the database and you don't
>    have to worry about the timestamp.
>    - You'll still have all your old files if you need to refer back them.
>
> Pete
>
> On Tue, Sep 27, 2016 at 2:52 AM, Divya Gehlot <divya.htco...@gmail.com>
> wrote:
>
>> Hi,
>> The input data files for my spark job generated at every five minutes
>> file name follows epoch time convention  as below :
>>
>> InputFolder/batch-1474959600000
>> InputFolder/batch-1474959900000
>> InputFolder/batch-1474960200000
>> InputFolder/batch-1474960500000
>> InputFolder/batch-1474960800000
>> InputFolder/batch-1474961100000
>> InputFolder/batch-1474961400000
>> InputFolder/batch-1474961700000
>> InputFolder/batch-1474962000000
>> InputFolder/batch-1474962300000
>>
>> As per requirement I need to read one month of data from current
>> timestamp.
>>
>> Would really appreciate if anybody could help me .
>>
>> Thanks,
>> Divya
>>
>
>

Re: read multiple files

Reply via email to