The scenario is using HTable instance to scan multiple rowkey range in Spark
tasks look likes below:
Option 1:
val users = input
      .map { case (deviceId, uid) =>
uid}.distinct().sortBy(x=>x).mapPartitions(iterator=>{
      val conf = HBaseConfiguration.create()
      val table = new HTable(conf, "actions")
      val result = iterator.map{ userId=>
        (userId, getUserActions(table, userId, timeStart, timeStop))
      }
      table.close()
      result
    })

But got the exception:
org.apache.spark.SparkException: Task not serializable
        at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
        at
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
        at org.apache.spark.SparkContext.clean(SparkContext.scala:1264)
        at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:597)
        at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:60)...
...
Caused by: java.io.NotSerializableException:
org.apache.hadoop.conf.Configuration
        at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
        at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
        at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)

The reason not using sc.newAPIHadoopRDD is it only support one scan each
time.
val hbaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],    
      classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],    
      classOf[org.apache.hadoop.hbase.client.Result]) 

And if using MultiTableInputFormat, driver is not possible put all rowkeys
into HBaseConfiguration
Option 2:
sc.newAPIHadoopRDD(conf, classOf[MultiTableInputFormat],
      classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
      classOf[org.apache.hadoop.hbase.client.Result])

It may divide all rowkey ranges into several parts then use option 2, but I
prefer option 1. So is there any solution for option 1? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Serialization-issue-when-using-HBase-with-Spark-tp20655.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to