RDD equivalent of HBase Scan

2015-03-26 Thread Stuart Layton
HBase scans come with the ability to specify filters that make scans very fast and efficient (as they let you seek for the keys that pass the filter). Do RDD's or Spark DataFrames offer anything similar or would I be required to use a NoSQL db like HBase to do something like this? -- Stuart

Re: RDD equivalent of HBase Scan

2015-03-26 Thread Stuart Layton
= TableMapReduceUtil.convertStringToScan(conf.get(SCAN)); You can use TableMapReduceUtil#convertScanToString() to convert a Scan which has filter(s) and pass to TableInputFormat Cheers On Thu, Mar 26, 2015 at 6:46 AM, Stuart Layton stuart.lay...@gmail.com wrote: HBase scans come with the ability to specify filters that make

What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Stuart Layton
and saving it to S3, however as I want to optimize for filtering speed I'm not sure this is the best option. -- Stuart Layton

Re: What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Stuart Layton
should certainly use them for the advanced stuff that expressions can't handle). I opened SPARK-6536 https://issues.apache.org/jira/browse/SPARK-6536 to provide a nicer interface for this. On Wed, Mar 25, 2015 at 7:41 AM, Stuart Layton stuart.lay...@gmail.com wrote: I have a SparkSQL

Can a DataFrame be saved to s3 directly using Parquet?

2015-03-25 Thread Stuart Layton
-testing/, expected: hdfs:// ec2-52-0-159-113.compute-1.amazonaws.com:9000 Is it possible to save a dataframe to s3 directly using parquet? -- Stuart Layton