Re: Spark Hbase job taking long time

2014-08-12 Thread Amit Singh Hora
Hi ,
Today i created a table with 3 regions and 2 jobtrackers but still the
spark job is taking lot of time
I also noticed one thing that is the memory of client was increasing
linearly is it like spark job was first bringing the complete data in
memory?


On Thu, Aug 7, 2014 at 7:31 PM, Ted Yu [via Apache Spark User List] 
ml-node+s1001560n11651...@n3.nabble.com wrote:

 Forgot to include user@

 Another email from Amit indicated that there is 1 region in his table.
 This wouldn't give you the benefit TableInputFormat is expected to deliver.

 Please split your table into multiple regions.

 See http://hbase.apache.org/book.html#d3593e6847 and related links.

 Cheers


 On Wed, Aug 6, 2014 at 6:41 AM, Ted Yu [hidden email]
 http://user/SendEmail.jtp?type=nodenode=11651i=0 wrote:

 Can you try specifying some value (100, e.g.) for
 hbase.mapreduce.scan.cachedrows in your conf ?

 bq.  table contains 10lakh rows

 How many rows are there in the table ?

 nit: Example uses classOf[TableInputFormat] instead of
 TableInputFormat.class.

 Cheers


 On Wed, Aug 6, 2014 at 5:54 AM, Amit Singh Hora [hidden email]
 http://user/SendEmail.jtp?type=nodenode=11651i=1 wrote:

 Hi All,

 I am trying to run a SQL query on HBase using spark job ,till now i am
 able
 to get the desierd results but as the data set size increases Spark job
 is
 taking a long time
 I believe i am doing something wrong,as after going through documentation
 and videos discussing on  spark performance  it should not take more then
 couple of seconds.

 PFB code snippet
 HBase table contains 10lakh rows

 JavaPairRDDImmutableBytesWritable, Result pairRdd = ctx
 .newAPIHadoopRDD(conf,
 TableInputFormat.class,

 ImmutableBytesWritable.class,

 org.apache.hadoop.hbase.client.Result.class).cache();

 JavaRDDPerson people = pairRdd
 .map(new
 FunctionTuple2lt;ImmutableBytesWritable, Result, Person() {

 public Person
 call(Tuple2ImmutableBytesWritable, Result v1)
 throws Exception
 {

 System.out.println(comming);
 Person person = new
 Person();
 String
 key=Bytes.toString(v1._2.getRow());

 key=key.substring(0,key.lastIndexOf(_));

 person.setCalling(Long.parseLong(key));

 person.setCalled(Bytes.toLong(v1._2.getValue(

 Bytes.toBytes(si), Bytes.toBytes(called;

 person.setTime(Bytes.toLong(v1._2.getValue(

 Bytes.toBytes(si), Bytes.toBytes(at;

 return person;
 }
 });
 JavaSchemaRDD schemaPeople = sqlCtx.applySchema(people, Person.class);
 schemaPeople.registerAsTable(people);

 // SQL can be run over RDDs that have been registered as
 tables.
 JavaSchemaRDD teenagers = sqlCtx
 .sql(SELECT count(*) from people group
 by calling);
 teenagers.printSchema();


 I am running spark using start-all.sh script with 2 workers

 Any pointers will be of a great help
 Regards,





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Hbase-job-taking-long-time-tp11541.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=11651i=2
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=11651i=3





 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Hbase-job-taking-long-time-tp11541p11651.html
  To unsubscribe from Spark Hbase job taking long time, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=11541code=aG9yYS5hbWl0QGdtYWlsLmNvbXwxMTU0MXw4OTIzNDIwNzY=
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Hbase-job-taking-long-time-tp11541p11998.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Hbase job taking long time

2014-08-07 Thread Ted Yu
Forgot to include user@

Another email from Amit indicated that there is 1 region in his table.
This wouldn't give you the benefit TableInputFormat is expected to deliver.

Please split your table into multiple regions.

See http://hbase.apache.org/book.html#d3593e6847 and related links.

Cheers


On Wed, Aug 6, 2014 at 6:41 AM, Ted Yu yuzhih...@gmail.com wrote:

 Can you try specifying some value (100, e.g.) for
 hbase.mapreduce.scan.cachedrows in your conf ?

 bq.  table contains 10lakh rows

 How many rows are there in the table ?

 nit: Example uses classOf[TableInputFormat] instead of
 TableInputFormat.class.

 Cheers


 On Wed, Aug 6, 2014 at 5:54 AM, Amit Singh Hora hora.a...@gmail.com
 wrote:

 Hi All,

 I am trying to run a SQL query on HBase using spark job ,till now i am
 able
 to get the desierd results but as the data set size increases Spark job is
 taking a long time
 I believe i am doing something wrong,as after going through documentation
 and videos discussing on  spark performance  it should not take more then
 couple of seconds.

 PFB code snippet
 HBase table contains 10lakh rows

 JavaPairRDDImmutableBytesWritable, Result pairRdd = ctx
 .newAPIHadoopRDD(conf,
 TableInputFormat.class,

 ImmutableBytesWritable.class,

 org.apache.hadoop.hbase.client.Result.class).cache();

 JavaRDDPerson people = pairRdd
 .map(new
 FunctionTuple2lt;ImmutableBytesWritable, Result, Person() {

 public Person
 call(Tuple2ImmutableBytesWritable, Result v1)
 throws Exception {

 System.out.println(comming);
 Person person = new
 Person();
 String
 key=Bytes.toString(v1._2.getRow());

 key=key.substring(0,key.lastIndexOf(_));

 person.setCalling(Long.parseLong(key));

 person.setCalled(Bytes.toLong(v1._2.getValue(

 Bytes.toBytes(si), Bytes.toBytes(called;

 person.setTime(Bytes.toLong(v1._2.getValue(

 Bytes.toBytes(si), Bytes.toBytes(at;

 return person;
 }
 });
 JavaSchemaRDD schemaPeople = sqlCtx.applySchema(people, Person.class);
 schemaPeople.registerAsTable(people);

 // SQL can be run over RDDs that have been registered as
 tables.
 JavaSchemaRDD teenagers = sqlCtx
 .sql(SELECT count(*) from people group
 by calling);
 teenagers.printSchema();


 I am running spark using start-all.sh script with 2 workers

 Any pointers will be of a great help
 Regards,





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Hbase-job-taking-long-time-tp11541.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org