RDD equivalent of HBase Scan

2015-03-26 Thread Stuart Layton
HBase scans come with the ability to specify filters that make scans very fast and efficient (as they let you seek for the keys that pass the filter). Do RDD's or Spark DataFrames offer anything similar or would I be required to use a NoSQL db like HBase to do something like this? -- Stuart

Re: RDD equivalent of HBase Scan

2015-03-26 Thread Ted Yu
In examples//src/main/scala/org/apache/spark/examples/HBaseTest.scala, TableInputFormat is used. TableInputFormat accepts parameter public static final String SCAN = hbase.mapreduce.scan; where if specified, Scan object would be created from String form: if (conf.get(SCAN) != null) {

Re: RDD equivalent of HBase Scan

2015-03-26 Thread Stuart Layton
Thanks but I'm hoping to get away from hbase all together. I was wondering if there is a way to get similar scan performance directly on cached rdd's or data frames On Thu, Mar 26, 2015 at 9:54 AM, Ted Yu yuzhih...@gmail.com wrote: In

Re: RDD equivalent of HBase Scan

2015-03-26 Thread Sean Owen
An RDD is a very different creature than a NoSQL store, so I would not think of them as in the same ball-park for NoSQL-like workloads. It's not built for point queries or range scans, since any request would launch a distributed job to scan all partitions. It's not something built for, say,