Loading a large Hbase table into SPARK RDD takes quite long time

2014-12-04 Thread bonnahu
I am trying to load a large Hbase table into SPARK RDD to run a SparkSQL
query on the entity. For an entity with about 6 million rows, it will take
about 35 seconds to load it to RDD. Is it expected? Is there any way to
shorten the loading process? I have been getting some tips from
http://hbase.apache.org/book/perf.reading.html to speed up the process,
e.g., scan.setCaching(cacheSize) and only add the necessary
attributes/column to scan. I am just wondering if there are other ways to
improve the speed?

Here is the code snippet:

SparkConf sparkConf = new
SparkConf().setMaster(spark://url).setAppName(SparkSQLTest);
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
Configuration hbase_conf = HBaseConfiguration.create();
hbase_conf.set(hbase.zookeeper.quorum,url);
hbase_conf.set(hbase.regionserver.port, 60020);
hbase_conf.set(hbase.master, url);
hbase_conf.set(TableInputFormat.INPUT_TABLE, entityName);
Scan scan = new Scan();
scan.addColumn(Bytes.toBytes(MetaInfo), Bytes.toBytes(col1));
scan.addColumn(Bytes.toBytes(MetaInfo), Bytes.toBytes(col2));
scan.addColumn(Bytes.toBytes(MetaInfo), Bytes.toBytes(col3));
scan.setCaching(this.cacheSize);
hbase_conf.set(TableInputFormat.SCAN, convertScanToString(scan));
JavaPairRDDImmutableBytesWritable, Result hBaseRDD 
= jsc.newAPIHadoopRDD(hbase_conf,
TableInputFormat.class, ImmutableBytesWritable.class,
Result.class);
logger.info(count is  + hBaseRDD.cache().count());



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-a-large-Hbase-table-into-SPARK-RDD-takes-quite-long-time-tp20396.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Loading a large Hbase table into SPARK RDD takes quite long time

2014-12-04 Thread Jörn Franke
Hi,

What is your cluster setup? How mich memory do you have? How much space
does one row only consisting of the 3 columns consume? Do you run other
stuff in the background?

Best regards
Am 04.12.2014 23:57 schrieb bonnahu bonn...@gmail.com:

 I am trying to load a large Hbase table into SPARK RDD to run a SparkSQL
 query on the entity. For an entity with about 6 million rows, it will take
 about 35 seconds to load it to RDD. Is it expected? Is there any way to
 shorten the loading process? I have been getting some tips from
 http://hbase.apache.org/book/perf.reading.html to speed up the process,
 e.g., scan.setCaching(cacheSize) and only add the necessary
 attributes/column to scan. I am just wondering if there are other ways to
 improve the speed?

 Here is the code snippet:

 SparkConf sparkConf = new
 SparkConf().setMaster(spark://url).setAppName(SparkSQLTest);
 JavaSparkContext jsc = new JavaSparkContext(sparkConf);
 Configuration hbase_conf = HBaseConfiguration.create();
 hbase_conf.set(hbase.zookeeper.quorum,url);
 hbase_conf.set(hbase.regionserver.port, 60020);
 hbase_conf.set(hbase.master, url);
 hbase_conf.set(TableInputFormat.INPUT_TABLE, entityName);
 Scan scan = new Scan();
 scan.addColumn(Bytes.toBytes(MetaInfo), Bytes.toBytes(col1));
 scan.addColumn(Bytes.toBytes(MetaInfo), Bytes.toBytes(col2));
 scan.addColumn(Bytes.toBytes(MetaInfo), Bytes.toBytes(col3));
 scan.setCaching(this.cacheSize);
 hbase_conf.set(TableInputFormat.SCAN, convertScanToString(scan));
 JavaPairRDDImmutableBytesWritable, Result hBaseRDD
 = jsc.newAPIHadoopRDD(hbase_conf,
 TableInputFormat.class, ImmutableBytesWritable.class,
 Result.class);
 logger.info(count is  + hBaseRDD.cache().count());



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Loading-a-large-Hbase-table-into-SPARK-RDD-takes-quite-long-time-tp20396.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Loading a large Hbase table into SPARK RDD takes quite long time

2014-12-04 Thread bonnahu
Hi,
Here is the configuration of the cluster:

Workers: 2
For each worker, 
Cores: 24 Total, 0 Used
Memory: 69.6 GB Total, 0.0 B Used
For the spark.executor.memory, I didn't set it, so it should be the default
value 512M.

How much space does one row only consisting of the 3 columns consume? 
the size of 3 columns are very small, probably less than 100 bytes.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-a-large-Hbase-table-into-SPARK-RDD-takes-quite-long-time-tp20396p20414.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Loading a large Hbase table into SPARK RDD takes quite long time

2014-12-04 Thread bonnahu
Hi Ted,
Here is the information about the Regions:
Region Server   Region Count
http://regionserver1:60030/ 44
http://regionserver2:60030/ 39
http://regionserver3:60030/ 55




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-a-large-Hbase-table-into-SPARK-RDD-takes-quite-long-time-tp20396p20417.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org