Stepping away from any particular framework, it seems to me that you can never guarantee that you only read rows in that date range.
Even with a sorted array, you need to do a log( N ) binary search to find each of your boundary dates. Unless you maintain explicit pointers to these boundaries, which turns out to be moot because a) your dates are changing dynamically so updating them and maintaining the sorted order requires a minimum of log( N ) operations anyways, and b) you are dealing with files not arrays - files require you to seek a particular line number one line at a time, in O( N ). So, back to your Spark specific question: you cannot do better than O( N ) anyways with a file so why worry about anything more sophisticated than a 'filter' transformation? On Sep 16, 2013 3:51 PM, "Satheessh" <[email protected]> wrote: > 1. The date is dynamic. (I.e if the date is changed we shouldn't read all > records). > Look like below solution will read all the records if the date is changed. > (Please Correct me if I am wrong) > > 2. We can assume file is sorted by date. > > Sent from my iPhone > > On Sep 16, 2013, at 5:27 PM, Horia <[email protected]> wrote: > > Without sorting, you can implement this using the 'filter' transformation. > > This will eventually read all the rows once, but subsequently only shuffle > and send the transformed data which passed the filter. > > Does this help, or did I misunderstand? > On Sep 16, 2013 1:37 PM, "satheessh chinnu" <[email protected]> wrote: > >> i am having a text file. Each line is a record and first ten characters >> on each line is a date in YYYY-MM-DD format. >> >> i would like to run a map function on this RDD with specific date range. >> (i.e from 2005 -01-01 to 2007-12-31). I would like to avoid reading the >> records out of the specified data range. (i.e kind of primary index sorted >> by date) >> >> is there way to implement this? >> >> >>
