Hi, 

I'm reading data from HBase using the latest (2.0.0-SNAPSHOT) Hbase-Spark 
integration module. HBase is deployed on a cluster of 3 machines and Spark is 
deployed as a Standalone cluster on the same machines. 

I am doing a join between two JavaPairRDDs that are constructed from two 
separate HBase tables. An RDD is obtained from an Hbase table scan, then it 
transformed into a pair RDD with the row key from the table. 

When I run my Spark program as a standalone process, either on my development 
machine or on one of the cluster machines, the join returns a correct, 
non-empty result. When I submit the exact same program to the Spark cluster, 
the join comes out empty. In both cases I'm connecting to the Spark master on 
the cluster.  In summary: 

1) mvn exec:java <my program>  prints out correct non-empty join 
2) spark-submit --deploy-mode client --class same_main_class --master 
cluster_master_url  prints out empty join 
3) spark-submit --deploy-mode cluster --class same_main_class --master 
cluster_master_url  also prints out empty join 

The spark version deployed is 1.5.1. The same version is declared as a Maven 
dependency. I've also tried with 1.5.2 and 1.6.0, redeploying the cluster etc. 
I've spent a few days trying to troubleshoot this but to no avail. I print out 
a count of the RDDs that I'm joining and it always gives me the correct. Only, 
the join doesn't work I submit it as a job to the cluster, regardless of where 
the Spark driver is. 

Can anybody give me some pointers how to debug this? I'm assuming the RDD is 
partitioned and shuffled and whatever is happening behind the scenes, except it 
is not behaving correctly, there aren't any exceptions, errors or even warnings 
and I have no clue why the join would be empty. Again: identical code run as a 
standalone program works, but when submitted to the cluster doesn't. 

I'm mainly looking for troubleshooting tips here! 

Thanks much in advance! 
Boris

Reply via email to