Spark and HBase join issue

2015-03-14 Thread francexo83
Hi all,


I have the following  cluster configurations:


   - 5 nodes on a cloud environment.
   - Hadoop 2.5.0.
   - HBase 0.98.6.
   - Spark 1.2.0.
   - 8 cores and 16 GB of ram on each host.
   - 1 NFS disk with 300 IOPS  mounted on host 1 and 2.
   - 1 NFS disk with 300 IOPS  mounted on  host 3,4 and 5.

I tried  to run  a spark job in cluster mode that computes the left outer
join between two hbase tables.
The first table  stores  about 4.1 GB of data spread across  3 regions with
Snappy compression.
The second one stores  about 1.2 GB of data spread across  22 regions with
Snappy compression.

I sometimes get executor lost during in the shuffle phase  during the last
stage (saveAsHadoopDataset).

Below my spark conf:

num-cpu-cores = 20
memory-per-node = 10G
spark.scheduler.mode = FAIR
spark.scheduler.pool = production
spark.shuffle.spill= true
spark.rdd.compress = true
spark.core.connection.auth.wait.timeout=2000
spark.sql.shuffle.partitions=100
spark.default.parallelism=50
spark.speculation=false
spark.shuffle.spill=true
spark.shuffle.memoryFraction=0.1
spark.cores.max=30
spark.driver.memory=10g

Are  the resource to low to handle this  kind of operation?

if yes, could you share with me the right configuration to perform this
kind of task?

Thank you in advance.

F.


Re: Spark and HBase join issue

2015-03-14 Thread Ted Yu
The 4.1 GB table has 3 regions. This means that there would be at least 2
nodes which don't carry its region.
Can you split this table into 12 (or more) regions ?

BTW what's the value for spark.yarn.executor.memoryOverhead ?

Cheers

On Sat, Mar 14, 2015 at 10:52 AM, francexo83 francex...@gmail.com wrote:

 Hi all,


 I have the following  cluster configurations:


- 5 nodes on a cloud environment.
- Hadoop 2.5.0.
- HBase 0.98.6.
- Spark 1.2.0.
- 8 cores and 16 GB of ram on each host.
- 1 NFS disk with 300 IOPS  mounted on host 1 and 2.
- 1 NFS disk with 300 IOPS  mounted on  host 3,4 and 5.

 I tried  to run  a spark job in cluster mode that computes the left outer
 join between two hbase tables.
 The first table  stores  about 4.1 GB of data spread across  3 regions
 with Snappy compression.
 The second one stores  about 1.2 GB of data spread across  22 regions with
 Snappy compression.

 I sometimes get executor lost during in the shuffle phase  during the last
 stage (saveAsHadoopDataset).

 Below my spark conf:

 num-cpu-cores = 20
 memory-per-node = 10G
 spark.scheduler.mode = FAIR
 spark.scheduler.pool = production
 spark.shuffle.spill= true
 spark.rdd.compress = true
 spark.core.connection.auth.wait.timeout=2000
 spark.sql.shuffle.partitions=100
 spark.default.parallelism=50
 spark.speculation=false
 spark.shuffle.spill=true
 spark.shuffle.memoryFraction=0.1
 spark.cores.max=30
 spark.driver.memory=10g

 Are  the resource to low to handle this  kind of operation?

 if yes, could you share with me the right configuration to perform this
 kind of task?

 Thank you in advance.

 F.