Loading a large Hbase table into SPARK RDD takes quite long time
I am trying to load a large Hbase table into SPARK RDD to run a SparkSQL query on the entity. For an entity with about 6 million rows, it will take about 35 seconds to load it to RDD. Is it expected? Is there any way to shorten the loading process? I have been getting some tips from http://hbase.apache.org/book/perf.reading.html to speed up the process, e.g., scan.setCaching(cacheSize) and only add the necessary attributes/column to scan. I am just wondering if there are other ways to improve the speed? Here is the code snippet: SparkConf sparkConf = new SparkConf().setMaster(spark://url).setAppName(SparkSQLTest); JavaSparkContext jsc = new JavaSparkContext(sparkConf); Configuration hbase_conf = HBaseConfiguration.create(); hbase_conf.set(hbase.zookeeper.quorum,url); hbase_conf.set(hbase.regionserver.port, 60020); hbase_conf.set(hbase.master, url); hbase_conf.set(TableInputFormat.INPUT_TABLE, entityName); Scan scan = new Scan(); scan.addColumn(Bytes.toBytes(MetaInfo), Bytes.toBytes(col1)); scan.addColumn(Bytes.toBytes(MetaInfo), Bytes.toBytes(col2)); scan.addColumn(Bytes.toBytes(MetaInfo), Bytes.toBytes(col3)); scan.setCaching(this.cacheSize); hbase_conf.set(TableInputFormat.SCAN, convertScanToString(scan)); JavaPairRDDImmutableBytesWritable, Result hBaseRDD = jsc.newAPIHadoopRDD(hbase_conf, TableInputFormat.class, ImmutableBytesWritable.class, Result.class); logger.info(count is + hBaseRDD.cache().count()); -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Loading-a-large-Hbase-table-into-SPARK-RDD-takes-quite-long-time-tp20396.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Loading a large Hbase table into SPARK RDD takes quite long time
Hi, What is your cluster setup? How mich memory do you have? How much space does one row only consisting of the 3 columns consume? Do you run other stuff in the background? Best regards Am 04.12.2014 23:57 schrieb bonnahu bonn...@gmail.com: I am trying to load a large Hbase table into SPARK RDD to run a SparkSQL query on the entity. For an entity with about 6 million rows, it will take about 35 seconds to load it to RDD. Is it expected? Is there any way to shorten the loading process? I have been getting some tips from http://hbase.apache.org/book/perf.reading.html to speed up the process, e.g., scan.setCaching(cacheSize) and only add the necessary attributes/column to scan. I am just wondering if there are other ways to improve the speed? Here is the code snippet: SparkConf sparkConf = new SparkConf().setMaster(spark://url).setAppName(SparkSQLTest); JavaSparkContext jsc = new JavaSparkContext(sparkConf); Configuration hbase_conf = HBaseConfiguration.create(); hbase_conf.set(hbase.zookeeper.quorum,url); hbase_conf.set(hbase.regionserver.port, 60020); hbase_conf.set(hbase.master, url); hbase_conf.set(TableInputFormat.INPUT_TABLE, entityName); Scan scan = new Scan(); scan.addColumn(Bytes.toBytes(MetaInfo), Bytes.toBytes(col1)); scan.addColumn(Bytes.toBytes(MetaInfo), Bytes.toBytes(col2)); scan.addColumn(Bytes.toBytes(MetaInfo), Bytes.toBytes(col3)); scan.setCaching(this.cacheSize); hbase_conf.set(TableInputFormat.SCAN, convertScanToString(scan)); JavaPairRDDImmutableBytesWritable, Result hBaseRDD = jsc.newAPIHadoopRDD(hbase_conf, TableInputFormat.class, ImmutableBytesWritable.class, Result.class); logger.info(count is + hBaseRDD.cache().count()); -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Loading-a-large-Hbase-table-into-SPARK-RDD-takes-quite-long-time-tp20396.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Loading a large Hbase table into SPARK RDD takes quite long time
Hi, Here is the configuration of the cluster: Workers: 2 For each worker, Cores: 24 Total, 0 Used Memory: 69.6 GB Total, 0.0 B Used For the spark.executor.memory, I didn't set it, so it should be the default value 512M. How much space does one row only consisting of the 3 columns consume? the size of 3 columns are very small, probably less than 100 bytes. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Loading-a-large-Hbase-table-into-SPARK-RDD-takes-quite-long-time-tp20396p20414.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Loading a large Hbase table into SPARK RDD takes quite long time
Hi Ted, Here is the information about the Regions: Region Server Region Count http://regionserver1:60030/ 44 http://regionserver2:60030/ 39 http://regionserver3:60030/ 55 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Loading-a-large-Hbase-table-into-SPARK-RDD-takes-quite-long-time-tp20396p20417.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org