An RDD is a very different creature than a NoSQL store, so I would not think of them as in the same ball-park for NoSQL-like workloads. It's not built for point queries or range scans, since any request would launch a distributed job to scan all partitions. It's not something built for, say, thousands of concurrent jobs (queries).
On Thu, Mar 26, 2015 at 1:57 PM, Stuart Layton <[email protected]> wrote: > Thanks but I'm hoping to get away from hbase all together. I was wondering > if there is a way to get similar scan performance directly on cached rdd's > or data frames > > On Thu, Mar 26, 2015 at 9:54 AM, Ted Yu <[email protected]> wrote: >> >> In examples//src/main/scala/org/apache/spark/examples/HBaseTest.scala, >> TableInputFormat is used. >> TableInputFormat accepts parameter >> >> public static final String SCAN = "hbase.mapreduce.scan"; >> >> where if specified, Scan object would be created from String form: >> >> if (conf.get(SCAN) != null) { >> >> try { >> >> scan = TableMapReduceUtil.convertStringToScan(conf.get(SCAN)); >> >> You can use TableMapReduceUtil#convertScanToString() to convert a Scan >> which has filter(s) and pass to TableInputFormat >> >> Cheers >> >> >> On Thu, Mar 26, 2015 at 6:46 AM, Stuart Layton <[email protected]> >> wrote: >>> >>> HBase scans come with the ability to specify filters that make scans very >>> fast and efficient (as they let you seek for the keys that pass the filter). >>> >>> Do RDD's or Spark DataFrames offer anything similar or would I be >>> required to use a NoSQL db like HBase to do something like this? >>> >>> -- >>> Stuart Layton >> >> > > > > -- > Stuart Layton --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
