Re: Spark and HBase join issue

Ted Yu Sat, 14 Mar 2015 11:03:07 -0700

The 4.1 GB table has 3 regions. This means that there would be at least 2
nodes which don't carry its region.
Can you split this table into 12 (or more) regions ?


BTW what's the value for spark.yarn.executor.memoryOverhead ?

Cheers

On Sat, Mar 14, 2015 at 10:52 AM, francexo83 <francex...@gmail.com> wrote:

> Hi all,
>
>
> I have the following  cluster configurations:
>
>
>    - 5 nodes on a cloud environment.
>    - Hadoop 2.5.0.
>    - HBase 0.98.6.
>    - Spark 1.2.0.
>    - 8 cores and 16 GB of ram on each host.
>    - 1 NFS disk with 300 IOPS  mounted on host 1 and 2.
>    - 1 NFS disk with 300 IOPS  mounted on  host 3,4 and 5.
>
> I tried  to run  a spark job in cluster mode that computes the left outer
> join between two hbase tables.
> The first table  stores  about 4.1 GB of data spread across  3 regions
> with Snappy compression.
> The second one stores  about 1.2 GB of data spread across  22 regions with
> Snappy compression.
>
> I sometimes get executor lost during in the shuffle phase  during the last
> stage (saveAsHadoopDataset).
>
> Below my spark conf:
>
>         num-cpu-cores = 20
>         memory-per-node = 10G
>         spark.scheduler.mode = FAIR
>         spark.scheduler.pool = production
>         spark.shuffle.spill= true
>         spark.rdd.compress = true
>         spark.core.connection.auth.wait.timeout=2000
>         spark.sql.shuffle.partitions=100
>         spark.default.parallelism=50
>         spark.speculation=false
>         spark.shuffle.spill=true
>         spark.shuffle.memoryFraction=0.1
>         spark.cores.max=30
>         spark.driver.memory=10g
>
> Are  the resource to low to handle this  kind of operation?
>
> if yes, could you share with me the right configuration to perform this
> kind of task?
>
> Thank you in advance.
>
> F.
>
>
>
>
>

Re: Spark and HBase join issue

Reply via email to