An RDD is a very different creature than a NoSQL store, so I would not
think of them as in the same ball-park for NoSQL-like workloads. It's
not built for point queries or range scans, since any request would
launch a distributed job to scan all partitions. It's not something
built for, say, thousands of concurrent jobs (queries).

On Thu, Mar 26, 2015 at 1:57 PM, Stuart Layton <[email protected]> wrote:
> Thanks but I'm hoping to get away from hbase all together. I was wondering
> if there is a way to get similar scan performance directly on cached rdd's
> or data frames
>
> On Thu, Mar 26, 2015 at 9:54 AM, Ted Yu <[email protected]> wrote:
>>
>> In examples//src/main/scala/org/apache/spark/examples/HBaseTest.scala,
>> TableInputFormat is used.
>> TableInputFormat accepts parameter
>>
>>   public static final String SCAN = "hbase.mapreduce.scan";
>>
>> where if specified, Scan object would be created from String form:
>>
>>     if (conf.get(SCAN) != null) {
>>
>>       try {
>>
>>         scan = TableMapReduceUtil.convertStringToScan(conf.get(SCAN));
>>
>> You can use TableMapReduceUtil#convertScanToString() to convert a Scan
>> which has filter(s) and pass to TableInputFormat
>>
>> Cheers
>>
>>
>> On Thu, Mar 26, 2015 at 6:46 AM, Stuart Layton <[email protected]>
>> wrote:
>>>
>>> HBase scans come with the ability to specify filters that make scans very
>>> fast and efficient (as they let you seek for the keys that pass the filter).
>>>
>>> Do RDD's or Spark DataFrames offer anything similar or would I be
>>> required to use a NoSQL db like HBase to do something like this?
>>>
>>> --
>>> Stuart Layton
>>
>>
>
>
>
> --
> Stuart Layton

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to