RDD.tail()
Has there been any thought to adding a tail() method to RDD? It would be really handy to skip over the first item in an RDD when it contains header information. Even better would be a drop(int) function that would allow you to skip over several lines of header information. Our attempts to do something equivalent with a filter() call seem a bit contorted. Any thoughts? Thanks, Philip
Re: RDD.tail()
We have similar needs but IIRC, I came to the conclusion that this would only work on ordered RDDs, and then you would still have to figure out which partition is the first one. I ended up deciding it would be best to just drop the header lines from a Scala iterator before creating an RDD based on it. Not sure if this was the right thing to do, but would that work for you? Regards, Ethan On Mon, Apr 14, 2014 at 10:24 AM, Philip Ogren philip.og...@oracle.comwrote: Has there been any thought to adding a tail() method to RDD? It would be really handy to skip over the first item in an RDD when it contains header information. Even better would be a drop(int) function that would allow you to skip over several lines of header information. Our attempts to do something equivalent with a filter() call seem a bit contorted. Any thoughts? Thanks, Philip
Re: RDD.tail()
You can use mapPartitionsWithIndex and look at the partition index (0 will be the first partition) to decide whether to skip the first line. Matei On Apr 14, 2014, at 8:50 AM, Ethan Jewett esjew...@gmail.com wrote: We have similar needs but IIRC, I came to the conclusion that this would only work on ordered RDDs, and then you would still have to figure out which partition is the first one. I ended up deciding it would be best to just drop the header lines from a Scala iterator before creating an RDD based on it. Not sure if this was the right thing to do, but would that work for you? Regards, Ethan On Mon, Apr 14, 2014 at 10:24 AM, Philip Ogren philip.og...@oracle.com wrote: Has there been any thought to adding a tail() method to RDD? It would be really handy to skip over the first item in an RDD when it contains header information. Even better would be a drop(int) function that would allow you to skip over several lines of header information. Our attempts to do something equivalent with a filter() call seem a bit contorted. Any thoughts? Thanks, Philip