Re: Spark Hbase job taking long time

Ted Yu Thu, 07 Aug 2014 07:01:10 -0700

Forgot to include user@

Another email from Amit indicated that there is 1 region in his table.
This wouldn't give you the benefit TableInputFormat is expected to deliver.


Please split your table into multiple regions.

See http://hbase.apache.org/book.html#d3593e6847 and related links.

Cheers


On Wed, Aug 6, 2014 at 6:41 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> Can you try specifying some value (100, e.g.) for
> "hbase.mapreduce.scan.cachedrows" in your conf ?
>
> bq.  table contains 10lakh rows
>
> How many rows are there in the table ?
>
> nit: Example uses classOf[TableInputFormat] instead of
> TableInputFormat.class.
>
> Cheers
>
>
> On Wed, Aug 6, 2014 at 5:54 AM, Amit Singh Hora <hora.a...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I am trying to run a SQL query on HBase using spark job ,till now i am
>> able
>> to get the desierd results but as the data set size increases Spark job is
>> taking a long time
>> I believe i am doing something wrong,as after going through documentation
>> and videos discussing on  spark performance  it should not take more then
>> couple of seconds.
>>
>> PFB code snippet
>> HBase table contains 10lakh rows
>>
>> JavaPairRDD<ImmutableBytesWritable, Result> pairRdd = ctx
>>                                 .newAPIHadoopRDD(conf,
>> TableInputFormat.class,
>>
>> ImmutableBytesWritable.class,
>>
>> org.apache.hadoop.hbase.client.Result.class).cache();
>>
>> JavaRDD<Person> people = pairRdd
>>                                 .map(new
>> Function<Tuple2&lt;ImmutableBytesWritable, Result>, Person>() {
>>
>>                                         public Person
>> call(Tuple2<ImmutableBytesWritable, Result> v1)
>>                                                         throws Exception {
>>
>> System.out.println("comming");
>>                                                 Person person = new
>> Person();
>>                                                 String
>> key=Bytes.toString(v1._2.getRow());
>>
>> key=key.substring(0,key.lastIndexOf("_"));
>>
>> person.setCalling(Long.parseLong(key));
>>
>> person.setCalled(Bytes.toLong(v1._2.getValue(
>>
>> Bytes.toBytes("si"), Bytes.toBytes("called"))));
>>
>> person.setTime(Bytes.toLong(v1._2.getValue(
>>
>> Bytes.toBytes("si"), Bytes.toBytes("at"))));
>>
>>                                                 return person;
>>                                         }
>>                                 });
>> JavaSchemaRDD schemaPeople = sqlCtx.applySchema(people, Person.class);
>>                 schemaPeople.registerAsTable("people");
>>
>>                 // SQL can be run over RDDs that have been registered as
>> tables.
>>                 JavaSchemaRDD teenagers = sqlCtx
>>                                 .sql("SELECT count(*) from people group
>> by calling");
>>                 teenagers.printSchema();
>>
>>
>> I am running spark using start-all.sh script with 2 workers
>>
>> Any pointers will be of a great help
>> Regards,
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Hbase-job-taking-long-time-tp11541.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Spark Hbase job taking long time

Reply via email to