Thanks explained. I understand cannot do better than O( N ) in files. is there anyway we can achieve log(n) using spark? (may be storing as in memory distributed binary tree (i.e key = date, value = line. then execute map function on each node.).
On Mon, Sep 16, 2013 at 7:12 PM, Horia <[email protected]> wrote: > Stepping away from any particular framework, it seems to me that you can > never guarantee that you only read rows in that date range. > > Even with a sorted array, you need to do a log( N ) binary search to find > each of your boundary dates. Unless you maintain explicit pointers to these > boundaries, which turns out to be moot because a) your dates are changing > dynamically so updating them and maintaining the sorted order requires a > minimum of log( N ) operations anyways, and b) you are dealing with files > not arrays - files require you to seek a particular line number one line at > a time, in O( N ). > > So, back to your Spark specific question: you cannot do better than O( N ) > anyways with a file so why worry about anything more sophisticated than a > 'filter' transformation? > On Sep 16, 2013 3:51 PM, "Satheessh" <[email protected]> wrote: > >> 1. The date is dynamic. (I.e if the date is changed we shouldn't read all >> records). >> Look like below solution will read all the records if the date is >> changed. (Please Correct me if I am wrong) >> >> 2. We can assume file is sorted by date. >> >> Sent from my iPhone >> >> On Sep 16, 2013, at 5:27 PM, Horia <[email protected]> wrote: >> >> Without sorting, you can implement this using the 'filter' transformation. >> >> This will eventually read all the rows once, but subsequently only >> shuffle and send the transformed data which passed the filter. >> >> Does this help, or did I misunderstand? >> On Sep 16, 2013 1:37 PM, "satheessh chinnu" <[email protected]> wrote: >> >>> i am having a text file. Each line is a record and first ten characters >>> on each line is a date in YYYY-MM-DD format. >>> >>> i would like to run a map function on this RDD with specific date range. >>> (i.e from 2005 -01-01 to 2007-12-31). I would like to avoid reading the >>> records out of the specified data range. (i.e kind of primary index sorted >>> by date) >>> >>> is there way to implement this? >>> >>> >>>
