HBase scans come with the ability to specify filters that make scans very
fast and efficient (as they let you seek for the keys that pass the filter).
Do RDD's or Spark DataFrames offer anything similar or would I be required
to use a NoSQL db like HBase to do something like this?
--
Stuart
In examples//src/main/scala/org/apache/spark/examples/HBaseTest.scala,
TableInputFormat is used.
TableInputFormat accepts parameter
public static final String SCAN = hbase.mapreduce.scan;
where if specified, Scan object would be created from String form:
if (conf.get(SCAN) != null) {
Thanks but I'm hoping to get away from hbase all together. I was wondering
if there is a way to get similar scan performance directly on cached rdd's
or data frames
On Thu, Mar 26, 2015 at 9:54 AM, Ted Yu yuzhih...@gmail.com wrote:
In
An RDD is a very different creature than a NoSQL store, so I would not
think of them as in the same ball-park for NoSQL-like workloads. It's
not built for point queries or range scans, since any request would
launch a distributed job to scan all partitions. It's not something
built for, say,