Re: Reading from HBase is too slow

2014-10-08 Thread Tao Xiao
Sean, I did specify the number of cores to use as follows: ... ... val sparkConf = new SparkConf() .setAppName("<<< Reading HBase >>>") .set("spark.cores.max", "32") val sc = new SparkContext(sparkConf) ... ... But that does not solve the problem --- only 2 workers are allocate

Re: Reading from HBase is too slow

2014-10-08 Thread Sean Owen
You do need to specify the number of executor cores to use. Executors are not like mappers. After all they may do much more in their lifetime than just read splits from HBase so would not make sense to determine it by something that the first line of the program does. On Oct 8, 2014 8:00 AM, "Tao X

Re: Reading from HBase is too slow

2014-10-08 Thread Tao Xiao
Hi Sean, Do I need to specify the number of executors when submitting the job? I suppose the number of executors will be determined by the number of regions of the table. Just like a MapReduce job, you needn't specify the number of map tasks when reading from a HBase table. The script to su

Re: Reading from HBase is too slow

2014-10-07 Thread Sean Owen
How did you run your program? I don't see from your earlier post that you ever asked for more executors. On Wed, Oct 8, 2014 at 4:29 AM, Tao Xiao wrote: > I found the reason why reading HBase is too slow. Although each > regionserver serves multiple regions for the table I'm reading, the number

Re: Reading from HBase is too slow

2014-10-07 Thread Tao Xiao
I found the reason why reading HBase is too slow. Although each regionserver serves multiple regions for the table I'm reading, the number of Spark workers allocated by Yarn is too low. Actually, I could see that the table has dozens of regions spread over about 20 regionservers, but only two Spar

Re: Reading from HBase is too slow

2014-10-01 Thread Vladimir Rodionov
Yes, its in 0.98. CDH is free (w/o subscription) and sometimes its worth upgrading to the latest version (which is 0.98 based). -Vladimir Rodionov On Wed, Oct 1, 2014 at 9:52 AM, Ted Yu wrote: > As far as I know, that feature is not in CDH 5.0.0 > > FYI > > On Wed, Oct 1, 2014 at 9:34 AM, Vladi

Re: Reading from HBase is too slow

2014-10-01 Thread Ted Yu
As far as I know, that feature is not in CDH 5.0.0 FYI On Wed, Oct 1, 2014 at 9:34 AM, Vladimir Rodionov < vrodio...@splicemachine.com> wrote: > Using TableInputFormat is not the fastest way of reading data from HBase. > Do not expect 100s of Mb per sec. You probably should take a look at M/R >

Re: Reading from HBase is too slow

2014-10-01 Thread Vladimir Rodionov
Using TableInputFormat is not the fastest way of reading data from HBase. Do not expect 100s of Mb per sec. You probably should take a look at M/R over HBase snapshots. https://issues.apache.org/jira/browse/HBASE-8369 -Vladimir Rodionov On Wed, Oct 1, 2014 at 8:17 AM, Tao Xiao wrote: > I can s

Re: Reading from HBase is too slow

2014-10-01 Thread Tao Xiao
I can submit a MapReduce job reading that table, although its processing rate is also a litter slower than I expected, but not that slow as Spark. 2014-10-01 12:04 GMT+08:00 Ted Yu : > Can you launch a job which exercises TableInputFormat on the same table > without using Spark ? > > This would s

Re: Reading from HBase is too slow

2014-09-30 Thread Ted Yu
Can you launch a job which exercises TableInputFormat on the same table without using Spark ? This would show whether the slowdown is in HBase code or somewhere else. Cheers On Mon, Sep 29, 2014 at 11:40 PM, Tao Xiao wrote: > I checked HBase UI. Well, this table is not completely evenly spread

Re: Reading from HBase is too slow

2014-09-29 Thread Tao Xiao
I checked HBase UI. Well, this table is not completely evenly spread across the nodes, but I think to some extent it can be seen as nearly evenly spread - at least there is not a single node which has too many regions. Here is a screenshot of HBase UI

Re: Reading from HBase is too slow

2014-09-29 Thread Vladimir Rodionov
HBase TableInputFormat creates input splits one per each region. You can not achieve high level of parallelism unless you have 5-10 regions per RS at least. What does it mean? You probably have too few regions. You can verify that in HBase Web UI. -Vladimir Rodionov On Mon, Sep 29, 2014 at 7:21 P

Re: Reading from HBase is too slow

2014-09-29 Thread Russ Weeks
Hi, Tao, When I used newAPIHadoopRDD (Accumulo not HBase) I found that I had to specify executor-memory and num-executors explicitly on the command line or else I didn't get any parallelism across the cluster. I used --executor-memory 3G --num-executors 24 but obviously other parameters will be b

Re: Reading from HBase is too slow

2014-09-29 Thread Nan Zhu
can you look at your HBase UI to check whether your job is just reading from a single region server? Best, -- Nan Zhu On Monday, September 29, 2014 at 10:21 PM, Tao Xiao wrote: > I submitted a job in Yarn-Client mode, which simply reads from a HBase table > containing tens of millions of

Re: Reading from HBase is too slow

2014-09-29 Thread Ted Yu
Are the regions for this table evenly spread across nodes in your cluster ? Were region servers under (heavy) load when your job ran ? Cheers On Mon, Sep 29, 2014 at 7:21 PM, Tao Xiao wrote: > I submitted a job in Yarn-Client mode, which simply reads from a HBase > table containing tens of mil

Re: Reading from HBase is too slow

2014-09-29 Thread Tao Xiao
I submitted the job in Yarn-Client mode using the following script: export SPARK_JAR=/usr/games/spark/xt/spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar export HADOOP_CLASSPATH=$(hbase classpath) export CLASSPATH=$CLASSPATH:/usr/games/spark/xt/SparkDemo-0.0.1-SNAPSHOT.jar:/usr/games/s

Reading from HBase is too slow

2014-09-29 Thread Tao Xiao
I submitted a job in Yarn-Client mode, which simply reads from a HBase table containing tens of millions of records and then does a *count *action. The job runs for a much longer time than I expected, so I wonder whether it was because the data to read was too much. Actually, there are 20 nodes in