Hi All, I am trying to run a SQL query on HBase using spark job ,till now i am able to get the desierd results but as the data set size increases Spark job is taking a long time I believe i am doing something wrong,as after going through documentation and videos discussing on spark performance it should not take more then couple of seconds.
PFB code snippet HBase table contains 10lakh rows JavaPairRDD<ImmutableBytesWritable, Result> pairRdd = ctx .newAPIHadoopRDD(conf, TableInputFormat.class, ImmutableBytesWritable.class, org.apache.hadoop.hbase.client.Result.class).cache(); JavaRDD<Person> people = pairRdd .map(new Function<Tuple2<ImmutableBytesWritable, Result>, Person>() { public Person call(Tuple2<ImmutableBytesWritable, Result> v1) throws Exception { System.out.println("comming"); Person person = new Person(); String key=Bytes.toString(v1._2.getRow()); key=key.substring(0,key.lastIndexOf("_")); person.setCalling(Long.parseLong(key)); person.setCalled(Bytes.toLong(v1._2.getValue( Bytes.toBytes("si"), Bytes.toBytes("called")))); person.setTime(Bytes.toLong(v1._2.getValue( Bytes.toBytes("si"), Bytes.toBytes("at")))); return person; } }); JavaSchemaRDD schemaPeople = sqlCtx.applySchema(people, Person.class); schemaPeople.registerAsTable("people"); // SQL can be run over RDDs that have been registered as tables. JavaSchemaRDD teenagers = sqlCtx .sql("SELECT count(*) from people group by calling"); teenagers.printSchema(); I am running spark using start-all.sh script with 2 workers Any pointers will be of a great help Regards, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Hbase-job-taking-long-time-tp11541.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org