Hi All,

I am trying to run a SQL query on HBase using spark job ,till now i am able
to get the desierd results but as the data set size increases Spark job is
taking a long time 
I believe i am doing something wrong,as after going through documentation
and videos discussing on  spark performance  it should not take more then
couple of seconds.

PFB code snippet 
HBase table contains 10lakh rows

JavaPairRDD<ImmutableBytesWritable, Result> pairRdd = ctx
                                .newAPIHadoopRDD(conf, TableInputFormat.class,
                                                ImmutableBytesWritable.class,
                                                
org.apache.hadoop.hbase.client.Result.class).cache();

JavaRDD<Person> people = pairRdd
                                .map(new 
Function<Tuple2&lt;ImmutableBytesWritable, Result>, Person>() {

                                        public Person 
call(Tuple2<ImmutableBytesWritable, Result> v1)
                                                        throws Exception {
                                                System.out.println("comming");
                                                Person person = new Person();
                                                String 
key=Bytes.toString(v1._2.getRow());
                                                
key=key.substring(0,key.lastIndexOf("_"));
                                                
person.setCalling(Long.parseLong(key));
                                                
person.setCalled(Bytes.toLong(v1._2.getValue(
                                                                
Bytes.toBytes("si"), Bytes.toBytes("called"))));
                                                
person.setTime(Bytes.toLong(v1._2.getValue(
                                                                
Bytes.toBytes("si"), Bytes.toBytes("at"))));

                                                return person;
                                        }
                                });
JavaSchemaRDD schemaPeople = sqlCtx.applySchema(people, Person.class);
                schemaPeople.registerAsTable("people");

                // SQL can be run over RDDs that have been registered as tables.
                JavaSchemaRDD teenagers = sqlCtx
                                .sql("SELECT count(*) from people group by 
calling");
                teenagers.printSchema();


I am running spark using start-all.sh script with 2 workers 

Any pointers will be of a great help
Regards,





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Hbase-job-taking-long-time-tp11541.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to