I have created a DataFrame from a HBase Table (PHOENIX) which has 500 million rows. From the DataFrame I created an RDD of JavaBean and use it for joining with data from a file.
Map<String, String> phoenixInfoMap = new HashMap<String, String>(); phoenixInfoMap.put("table", tableName); phoenixInfoMap.put("zkUrl", zkURL); DataFrame df = sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap).load(); JavaRDD<Row> tableRows = df.toJavaRDD(); JavaPairRDD<String, AccountModel> dbData = tableRows.mapToPair( new PairFunction<Row, String, String>() { @Override public Tuple2<String, String> call(Row row) throws Exception { return new Tuple2<String, String>(row.getAs("ID"), row.getAs("NAME")); } }); Now my question - Lets say the file has 2 unique million entries matching with the table. Is the entire table loaded into memory as RDD or only the matching 2 million records from the table will be loaded into memory as RDD ? http://stackoverflow.com/questions/37289849/phoenix-spark-load-table-as-dataframe -- Thanks and Regards Mohan