RE: Unable to Access files in Hadoop HA enabled from using Spark

2016-04-12 Thread Amit Hora
Finally I tried setting the configuration manually using
sc.hadoopconfiguration.set
dfs.nameservices
dfs.ha.namenodes.hdpha
dfs.namenode.rpc-address.hdpha.n1 

And it worked ,don't know why it was not reading these settings from file under 
HADOOP_CONF_DIR

-Original Message-
From: "Amit Hora" 
Sent: ‎4/‎13/‎2016 11:41 AM
To: "Jörn Franke" 
Cc: "user@spark.apache.org" 
Subject: RE: Unable to Access files in Hadoop HA enabled from using Spark

There are DNS entries for both of my namenode
Ambarimaster is standby and it resolves to ip perfectly
Hdp231 is active and it also resolves to ip
Hdpha is my Hadoop HA cluster name
And hdfs-site.xml has entries related to these configuration


From: Jörn Franke
Sent: ‎4/‎13/‎2016 11:37 AM
To: Amit Singh Hora
Cc: user@spark.apache.org
Subject: Re: Unable to Access files in Hadoop HA enabled from using Spark


Is the host in /etc/hosts ?

> On 13 Apr 2016, at 07:28, Amit Singh Hora  wrote:
> 
> I am trying to access directory in Hadoop from my Spark code on local
> machine.Hadoop is HA enabled .
> 
> val conf = new SparkConf().setAppName("LDA Sample").setMaster("local[2]")
> val sc=new SparkContext(conf)
> val distFile = sc.textFile("hdfs://hdpha/mini_newsgroups/")
> println(distFile.count())
> but getting error
> 
> java.net.UnknownHostException: hdpha
> As hdpha not resolves to a particular machine it is the name I have chosen
> for my HA Hadoop.I have already copied all hadoop configuration on my local
> machine and have set the env. variable HADOOP_CONF_DIR But still no success.
> 
> Any suggestion will be of a great help
> 
> Note:- Hadoop HA is working properly as i have tried uploading file to
> hadoop and it works
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-Access-files-in-Hadoop-HA-enabled-from-using-Spark-tp26768.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

RE: Unable to Access files in Hadoop HA enabled from using Spark

2016-04-12 Thread Amit Hora
There are DNS entries for both of my namenode
Ambarimaster is standby and it resolves to ip perfectly
Hdp231 is active and it also resolves to ip
Hdpha is my Hadoop HA cluster name
And hdfs-site.xml has entries related to these configuration

-Original Message-
From: "Jörn Franke" 
Sent: ‎4/‎13/‎2016 11:37 AM
To: "Amit Singh Hora" 
Cc: "user@spark.apache.org" 
Subject: Re: Unable to Access files in Hadoop HA enabled from using Spark

Is the host in /etc/hosts ?

> On 13 Apr 2016, at 07:28, Amit Singh Hora  wrote:
> 
> I am trying to access directory in Hadoop from my Spark code on local
> machine.Hadoop is HA enabled .
> 
> val conf = new SparkConf().setAppName("LDA Sample").setMaster("local[2]")
> val sc=new SparkContext(conf)
> val distFile = sc.textFile("hdfs://hdpha/mini_newsgroups/")
> println(distFile.count())
> but getting error
> 
> java.net.UnknownHostException: hdpha
> As hdpha not resolves to a particular machine it is the name I have chosen
> for my HA Hadoop.I have already copied all hadoop configuration on my local
> machine and have set the env. variable HADOOP_CONF_DIR But still no success.
> 
> Any suggestion will be of a great help
> 
> Note:- Hadoop HA is working properly as i have tried uploading file to
> hadoop and it works
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-Access-files-in-Hadoop-HA-enabled-from-using-Spark-tp26768.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


RE: SPARKONHBase checkpointing issue

2015-10-27 Thread Amit Hora
Thanks for sharing the link.Yes I understand that accumulators and broadcast 
variables state are not recovered from checkpoint but is there any way by which 
I can say that the HBaseContext in this context should nt be recovered from 
checkpoint rather must be reinitialized 

-Original Message-
From: "Adrian Tanase" 
Sent: ‎27-‎10-‎2015 18:08
To: "Amit Singh Hora" ; "user@spark.apache.org" 

Subject: Re: SPARKONHBase checkpointing issue

Does this help?

https://issues.apache.org/jira/browse/SPARK-5206



On 10/27/15, 1:53 PM, "Amit Singh Hora"  wrote:

>Hi all ,
>
>I am using Cloudera's SparkObHbase to bulk insert in hbase ,Please find
>below code
>object test {
>  
>def main(args: Array[String]): Unit = {
>
>
>
>   val conf = ConfigFactory.load("connection.conf").getConfig("connection")
>val checkpointDirectory=conf.getString("spark.checkpointDir")
>val ssc = StreamingContext.getOrCreate(checkpointDirectory, ()=>{
>  functionToCreateContext(checkpointDirectory)
>})
> 
>
>ssc.start()
>ssc.awaitTermination()
>
> }
>
>def getHbaseContext(sc: SparkContext,conf: Config): HBaseContext={
>  println("always gets created")
>   val hconf = HBaseConfiguration.create();
>val timeout= conf.getString("hbase.zookeepertimeout")
>val master=conf.getString("hbase.hbase_master")
>val zk=conf.getString("hbase.hbase_zkquorum")
>val zkport=conf.getString("hbase.hbase_zk_port")
>
>  hconf.set("zookeeper.session.timeout",timeout);
>hconf.set("hbase.client.retries.number", Integer.toString(1));
>hconf.set("zookeeper.recovery.retry", Integer.toString(1));
>hconf.set("hbase.master", master);
>hconf.set("hbase.zookeeper.quorum",zk);
>hconf.set("zookeeper.znode.parent", "/hbase-unsecure");
>hconf.set("hbase.zookeeper.property.clientPort",zkport );
>
>   
>val hbaseContext = new HBaseContext(sc, hconf);
>return hbaseContext
>}
>  def functionToCreateContext(checkpointDirectory: String): StreamingContext
>= {
>println("creating for frst time")
>val conf = ConfigFactory.load("connection.conf").getConfig("connection")
>val brokerlist = conf.getString("kafka.broker")
>val topic = conf.getString("kafka.topic")
>
>val Array(brokers, topics) = Array(brokerlist, topic)
>
>
>val sparkConf = new SparkConf().setAppName("HBaseBulkPutTimestampExample
>" )
>sparkConf.set("spark.cleaner.ttl", "2");
>sparkConf.setMaster("local[2]")
>
>
> val topicsSet = topic.split(",").toSet
>val batchduration = conf.getString("spark.batchduration").toInt
>val ssc: StreamingContext = new StreamingContext(sparkConf,
>Seconds(batchduration))
>  ssc.checkpoint(checkpointDirectory)   // set checkpoint directory
> val kafkaParams = Map[String, String]("metadata.broker.list" ->
>brokerlist, "auto.offset.reset" -> "smallest")
>val messages = KafkaUtils.createDirectStream[String, String,
>StringDecoder, StringDecoder](
>  ssc, kafkaParams, topicsSet)
>val lines=messages.map(_._2)
>   
>
>
>getHbaseContext(ssc.sparkContext,conf).streamBulkPut[String](lines,
>  "ecs_test",
>  (putRecord) => {
>if (putRecord.length() > 0) {
>  var maprecord = new HashMap[String, String];
>  val mapper = new ObjectMapper();
>
>  //convert JSON string to Map
>  maprecord = mapper.readValue(putRecord,
>new TypeReference[HashMap[String, String]]() {});
>  
>  var ts: Long = maprecord.get("ts").toLong
>  
>   var tweetID:Long= maprecord.get("id").toLong
>  val key=ts+"_"+tweetID;
>  
>  val put = new Put(Bytes.toBytes(key))
>  maprecord.foreach(kv => {
> 
> 
>put.add(Bytes.toBytes("test"),Bytes.toBytes(kv._1),Bytes.toBytes(kv._2))
>  
>
>  })
>
>
>  put
>} else {
>  null
>}
>  },
>  false);
>
>ssc
>  
>  }
>}
>
>i am not able to retrieve from checkpoint after restart ,always get 
>Unable to getConfig from broadcast
>
>after debugging more i can see that the method for creating the HbaseContext
>actually broadcasts the configuration ,context object passed
>
>as a solution i just want to recreate the hbase context in every condition
>weather the checkpoint exists or not
>
>
>
>--
>View this message in context: 
>http://apache-spark-user-list.1001560.n3.nabble.com/SPARKONHBase-checkpointing-issue-tp25211.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>


RE: Spark opening to many connection with zookeeper

2015-10-20 Thread Amit Hora
Hi All,

I am using Hbase 1.1.1 ,I came across a post describing hbase-spark included in 
hbase core

I am trying to use HbaseContext but cnt find the appropriate lib while trying 
to add following in pim I am getting missing artifact error

Org.apache.hbase
Hbase
1.1.1

-Original Message-
From: "Ted Yu" 
Sent: ‎20-‎10-‎2015 21:30
To: "Amit Hora" 
Cc: "user" 
Subject: Re: Spark opening to many connection with zookeeper

I need to dig deeper into saveAsHadoopDataset to see what might have caused the 
effect you observed.


Cheers


On Tue, Oct 20, 2015 at 8:57 AM, Amit Hora  wrote:

Hi Ted,

I made mistake last time yes the connection are very controlled when I used put 
like iterated over rdd for each and within that for each partition made 
connection and executed put list for hbase 

But why it was that the connection were getting too much when I used hibconf 
and storehadoopdataset method?


From: Amit Hora
Sent: ‎20-‎10-‎2015 20:38
To: Ted Yu
Cc: user
Subject: RE: Spark opening to many connection with zookeeper


I used that also but the number of connection goes on increasing started frm 10 
and went till 299 
Than I changed my zookeeper conf to set max client connection to just 30 and 
restarted job 
Now the connections are between 18- 24 from last 2 hours

I am unable to understand such a behaviour


From: Ted Yu
Sent: ‎20-‎10-‎2015 20:19
To: Amit Hora
Cc: user
Subject: Re: Spark opening to many connection with zookeeper


Can you take a look at example 37 on page 225 of:

http://hbase.apache.org/apache_hbase_reference_guide.pdf



You can use the following method of Table:
  void put(List puts) throws IOException;
After the put() returns, the connection is closed.
Cheers


On Tue, Oct 20, 2015 at 2:40 AM, Amit Hora  wrote:

One region 


From: Ted Yu
Sent: ‎20-‎10-‎2015 15:01
To: Amit Singh Hora
Cc: user
Subject: Re: Spark opening to many connection with zookeeper


How many regions do your table have ?


Which hbase release do you use ?


Cheers


On Tue, Oct 20, 2015 at 12:32 AM, Amit Singh Hora  wrote:

Hi All ,

My spark job started reporting zookeeper errors after seeing the zkdumps
from Hbase master i realized that there are N number of connection being
made from the nodes where worker of spark are running i  believe some how
the connections are not getting closed that is leading to error

please find below code

val conf = ConfigFactory.load("connection.conf").getConfig("connection")
  val hconf = HBaseConfiguration.create();
hconf.set(TableOutputFormat.OUTPUT_TABLE,
conf.getString("hbase.tablename"))
hconf.set("zookeeper.session.timeout",
conf.getString("hbase.zookeepertimeout"));
hconf.set("hbase.client.retries.number", Integer.toString(1));
hconf.set("zookeeper.recovery.retry", Integer.toString(1));
hconf.set("hbase.master", conf.getString("hbase.hbase_master"));

hconf.set("hbase.zookeeper.quorum",conf.getString("hbase.hbase_zkquorum"));
// zkquorum consists of 5 nodes
hconf.set("zookeeper.znode.parent", "/hbase-unsecure");
hconf.set("hbase.zookeeper.property.clientPort",
conf.getString("hbase.hbase_zk_port"));

hconf.set(TableOutputFormat.OUTPUT_TABLE,conf.getString("hbase.tablename"))
val jobConfig: JobConf = new JobConf(hconf, this.getClass)
jobConfig.set("mapreduce.output.fileoutputformat.outputdir",
"/user/user01/out")
jobConfig.setOutputFormat(classOf[TableOutputFormat])
jobConfig.set(TableOutputFormat.OUTPUT_TABLE,
conf.getString("hbase.tablename"))

 try{
 rdd.map(convertToPut).
saveAsHadoopDataset(jobConfig)
 }

the method convertToPut does nothing but jsut converts the json to Put
objects of HBase

After i killed the application/driver the number of connection decreased
drastically

Kindly help in understanding and resolving the issue



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-opening-to-many-connection-with-zookeeper-tp25137.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Spark opening to many connection with zookeeper

2015-10-20 Thread Amit Hora
Request to share if you come across any hint

-Original Message-
From: "Ted Yu" 
Sent: ‎20-‎10-‎2015 21:30
To: "Amit Hora" 
Cc: "user" 
Subject: Re: Spark opening to many connection with zookeeper

I need to dig deeper into saveAsHadoopDataset to see what might have caused the 
effect you observed.


Cheers


On Tue, Oct 20, 2015 at 8:57 AM, Amit Hora  wrote:

Hi Ted,

I made mistake last time yes the connection are very controlled when I used put 
like iterated over rdd for each and within that for each partition made 
connection and executed put list for hbase 

But why it was that the connection were getting too much when I used hibconf 
and storehadoopdataset method?


From: Amit Hora
Sent: ‎20-‎10-‎2015 20:38
To: Ted Yu
Cc: user
Subject: RE: Spark opening to many connection with zookeeper


I used that also but the number of connection goes on increasing started frm 10 
and went till 299 
Than I changed my zookeeper conf to set max client connection to just 30 and 
restarted job 
Now the connections are between 18- 24 from last 2 hours

I am unable to understand such a behaviour


From: Ted Yu
Sent: ‎20-‎10-‎2015 20:19
To: Amit Hora
Cc: user
Subject: Re: Spark opening to many connection with zookeeper


Can you take a look at example 37 on page 225 of:

http://hbase.apache.org/apache_hbase_reference_guide.pdf



You can use the following method of Table:
  void put(List puts) throws IOException;
After the put() returns, the connection is closed.
Cheers


On Tue, Oct 20, 2015 at 2:40 AM, Amit Hora  wrote:

One region 


From: Ted Yu
Sent: ‎20-‎10-‎2015 15:01
To: Amit Singh Hora
Cc: user
Subject: Re: Spark opening to many connection with zookeeper


How many regions do your table have ?


Which hbase release do you use ?


Cheers


On Tue, Oct 20, 2015 at 12:32 AM, Amit Singh Hora  wrote:

Hi All ,

My spark job started reporting zookeeper errors after seeing the zkdumps
from Hbase master i realized that there are N number of connection being
made from the nodes where worker of spark are running i  believe some how
the connections are not getting closed that is leading to error

please find below code

val conf = ConfigFactory.load("connection.conf").getConfig("connection")
  val hconf = HBaseConfiguration.create();
hconf.set(TableOutputFormat.OUTPUT_TABLE,
conf.getString("hbase.tablename"))
hconf.set("zookeeper.session.timeout",
conf.getString("hbase.zookeepertimeout"));
hconf.set("hbase.client.retries.number", Integer.toString(1));
hconf.set("zookeeper.recovery.retry", Integer.toString(1));
hconf.set("hbase.master", conf.getString("hbase.hbase_master"));

hconf.set("hbase.zookeeper.quorum",conf.getString("hbase.hbase_zkquorum"));
// zkquorum consists of 5 nodes
hconf.set("zookeeper.znode.parent", "/hbase-unsecure");
hconf.set("hbase.zookeeper.property.clientPort",
conf.getString("hbase.hbase_zk_port"));

hconf.set(TableOutputFormat.OUTPUT_TABLE,conf.getString("hbase.tablename"))
val jobConfig: JobConf = new JobConf(hconf, this.getClass)
jobConfig.set("mapreduce.output.fileoutputformat.outputdir",
"/user/user01/out")
jobConfig.setOutputFormat(classOf[TableOutputFormat])
jobConfig.set(TableOutputFormat.OUTPUT_TABLE,
conf.getString("hbase.tablename"))

 try{
 rdd.map(convertToPut).
saveAsHadoopDataset(jobConfig)
 }

the method convertToPut does nothing but jsut converts the json to Put
objects of HBase

After i killed the application/driver the number of connection decreased
drastically

Kindly help in understanding and resolving the issue



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-opening-to-many-connection-with-zookeeper-tp25137.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Spark opening to many connection with zookeeper

2015-10-20 Thread Amit Hora
Hi Ted,

I made mistake last time yes the connection are very controlled when I used put 
like iterated over rdd for each and within that for each partition made 
connection and executed put list for hbase 

But why it was that the connection were getting too much when I used hibconf 
and storehadoopdataset method?

-Original Message-
From: "Amit Hora" 
Sent: ‎20-‎10-‎2015 20:38
To: "Ted Yu" 
Cc: "user" 
Subject: RE: Spark opening to many connection with zookeeper

I used that also but the number of connection goes on increasing started frm 10 
and went till 299 
Than I changed my zookeeper conf to set max client connection to just 30 and 
restarted job 
Now the connections are between 18- 24 from last 2 hours

I am unable to understand such a behaviour


From: Ted Yu
Sent: ‎20-‎10-‎2015 20:19
To: Amit Hora
Cc: user
Subject: Re: Spark opening to many connection with zookeeper


Can you take a look at example 37 on page 225 of:

http://hbase.apache.org/apache_hbase_reference_guide.pdf



You can use the following method of Table:
  void put(List puts) throws IOException;
After the put() returns, the connection is closed.
Cheers


On Tue, Oct 20, 2015 at 2:40 AM, Amit Hora  wrote:

One region 


From: Ted Yu
Sent: ‎20-‎10-‎2015 15:01
To: Amit Singh Hora
Cc: user
Subject: Re: Spark opening to many connection with zookeeper


How many regions do your table have ?


Which hbase release do you use ?


Cheers


On Tue, Oct 20, 2015 at 12:32 AM, Amit Singh Hora  wrote:

Hi All ,

My spark job started reporting zookeeper errors after seeing the zkdumps
from Hbase master i realized that there are N number of connection being
made from the nodes where worker of spark are running i  believe some how
the connections are not getting closed that is leading to error

please find below code

val conf = ConfigFactory.load("connection.conf").getConfig("connection")
  val hconf = HBaseConfiguration.create();
hconf.set(TableOutputFormat.OUTPUT_TABLE,
conf.getString("hbase.tablename"))
hconf.set("zookeeper.session.timeout",
conf.getString("hbase.zookeepertimeout"));
hconf.set("hbase.client.retries.number", Integer.toString(1));
hconf.set("zookeeper.recovery.retry", Integer.toString(1));
hconf.set("hbase.master", conf.getString("hbase.hbase_master"));

hconf.set("hbase.zookeeper.quorum",conf.getString("hbase.hbase_zkquorum"));
// zkquorum consists of 5 nodes
hconf.set("zookeeper.znode.parent", "/hbase-unsecure");
hconf.set("hbase.zookeeper.property.clientPort",
conf.getString("hbase.hbase_zk_port"));

hconf.set(TableOutputFormat.OUTPUT_TABLE,conf.getString("hbase.tablename"))
val jobConfig: JobConf = new JobConf(hconf, this.getClass)
jobConfig.set("mapreduce.output.fileoutputformat.outputdir",
"/user/user01/out")
jobConfig.setOutputFormat(classOf[TableOutputFormat])
jobConfig.set(TableOutputFormat.OUTPUT_TABLE,
conf.getString("hbase.tablename"))

 try{
 rdd.map(convertToPut).
saveAsHadoopDataset(jobConfig)
 }

the method convertToPut does nothing but jsut converts the json to Put
objects of HBase

After i killed the application/driver the number of connection decreased
drastically

Kindly help in understanding and resolving the issue



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-opening-to-many-connection-with-zookeeper-tp25137.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Spark opening to many connection with zookeeper

2015-10-20 Thread Amit Hora
I used that also but the number of connection goes on increasing started frm 10 
and went till 299 
Than I changed my zookeeper conf to set max client connection to just 30 and 
restarted job 
Now the connections are between 18- 24 from last 2 hours

I am unable to understand such a behaviour

-Original Message-
From: "Ted Yu" 
Sent: ‎20-‎10-‎2015 20:19
To: "Amit Hora" 
Cc: "user" 
Subject: Re: Spark opening to many connection with zookeeper

Can you take a look at example 37 on page 225 of:

http://hbase.apache.org/apache_hbase_reference_guide.pdf



You can use the following method of Table:
  void put(List puts) throws IOException;
After the put() returns, the connection is closed.
Cheers


On Tue, Oct 20, 2015 at 2:40 AM, Amit Hora  wrote:

One region 


From: Ted Yu
Sent: ‎20-‎10-‎2015 15:01
To: Amit Singh Hora
Cc: user
Subject: Re: Spark opening to many connection with zookeeper


How many regions do your table have ?


Which hbase release do you use ?


Cheers


On Tue, Oct 20, 2015 at 12:32 AM, Amit Singh Hora  wrote:

Hi All ,

My spark job started reporting zookeeper errors after seeing the zkdumps
from Hbase master i realized that there are N number of connection being
made from the nodes where worker of spark are running i  believe some how
the connections are not getting closed that is leading to error

please find below code

val conf = ConfigFactory.load("connection.conf").getConfig("connection")
  val hconf = HBaseConfiguration.create();
hconf.set(TableOutputFormat.OUTPUT_TABLE,
conf.getString("hbase.tablename"))
hconf.set("zookeeper.session.timeout",
conf.getString("hbase.zookeepertimeout"));
hconf.set("hbase.client.retries.number", Integer.toString(1));
hconf.set("zookeeper.recovery.retry", Integer.toString(1));
hconf.set("hbase.master", conf.getString("hbase.hbase_master"));

hconf.set("hbase.zookeeper.quorum",conf.getString("hbase.hbase_zkquorum"));
// zkquorum consists of 5 nodes
hconf.set("zookeeper.znode.parent", "/hbase-unsecure");
hconf.set("hbase.zookeeper.property.clientPort",
conf.getString("hbase.hbase_zk_port"));

hconf.set(TableOutputFormat.OUTPUT_TABLE,conf.getString("hbase.tablename"))
val jobConfig: JobConf = new JobConf(hconf, this.getClass)
jobConfig.set("mapreduce.output.fileoutputformat.outputdir",
"/user/user01/out")
jobConfig.setOutputFormat(classOf[TableOutputFormat])
jobConfig.set(TableOutputFormat.OUTPUT_TABLE,
conf.getString("hbase.tablename"))

 try{
 rdd.map(convertToPut).
saveAsHadoopDataset(jobConfig)
 }

the method convertToPut does nothing but jsut converts the json to Put
objects of HBase

After i killed the application/driver the number of connection decreased
drastically

Kindly help in understanding and resolving the issue



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-opening-to-many-connection-with-zookeeper-tp25137.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Spark opening to many connection with zookeeper

2015-10-20 Thread Amit Hora
One region 

-Original Message-
From: "Ted Yu" 
Sent: ‎20-‎10-‎2015 15:01
To: "Amit Singh Hora" 
Cc: "user" 
Subject: Re: Spark opening to many connection with zookeeper

How many regions do your table have ?


Which hbase release do you use ?


Cheers


On Tue, Oct 20, 2015 at 12:32 AM, Amit Singh Hora  wrote:

Hi All ,

My spark job started reporting zookeeper errors after seeing the zkdumps
from Hbase master i realized that there are N number of connection being
made from the nodes where worker of spark are running i  believe some how
the connections are not getting closed that is leading to error

please find below code

val conf = ConfigFactory.load("connection.conf").getConfig("connection")
  val hconf = HBaseConfiguration.create();
hconf.set(TableOutputFormat.OUTPUT_TABLE,
conf.getString("hbase.tablename"))
hconf.set("zookeeper.session.timeout",
conf.getString("hbase.zookeepertimeout"));
hconf.set("hbase.client.retries.number", Integer.toString(1));
hconf.set("zookeeper.recovery.retry", Integer.toString(1));
hconf.set("hbase.master", conf.getString("hbase.hbase_master"));

hconf.set("hbase.zookeeper.quorum",conf.getString("hbase.hbase_zkquorum"));
// zkquorum consists of 5 nodes
hconf.set("zookeeper.znode.parent", "/hbase-unsecure");
hconf.set("hbase.zookeeper.property.clientPort",
conf.getString("hbase.hbase_zk_port"));

hconf.set(TableOutputFormat.OUTPUT_TABLE,conf.getString("hbase.tablename"))
val jobConfig: JobConf = new JobConf(hconf, this.getClass)
jobConfig.set("mapreduce.output.fileoutputformat.outputdir",
"/user/user01/out")
jobConfig.setOutputFormat(classOf[TableOutputFormat])
jobConfig.set(TableOutputFormat.OUTPUT_TABLE,
conf.getString("hbase.tablename"))

 try{
 rdd.map(convertToPut).
saveAsHadoopDataset(jobConfig)
 }

the method convertToPut does nothing but jsut converts the json to Put
objects of HBase

After i killed the application/driver the number of connection decreased
drastically

Kindly help in understanding and resolving the issue



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-opening-to-many-connection-with-zookeeper-tp25137.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: HBase Spark Streaming giving error after restore

2015-10-16 Thread Amit Hora
Hi,

Regresta for delayed resoonse
please find below full stack trace

ava.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to
org.apache.hadoop.hbase.client.Mutation
at
org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:85)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/10/16 18:50:03 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID
1, localhost, ANY, 1185 bytes)
15/10/16 18:50:03 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
15/10/16 18:50:03 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
localhost): java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be
cast to org.apache.hadoop.hbase.client.Mutation
at
org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:85)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

15/10/16 18:50:03 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times;
aborting job
15/10/16 18:50:03 INFO TaskSchedulerImpl: Cancelling stage 0
15/10/16 18:50:03 INFO Executor: Executor is trying to kill task 1.0 in
stage 0.0 (TID 1)
15/10/16 18:50:03 INFO TaskSchedulerImpl: Stage 0 was cancelled
15/10/16 18:50:03 INFO DAGScheduler: Job 0 failed: foreachRDD at
TwitterStream.scala:150, took 5.956054 s
15/10/16 18:50:03 INFO JobScheduler: Starting job streaming job
144500141 ms.0 from job set of time 144500141 ms
15/10/16 18:50:03 ERROR JobScheduler: Error running job streaming job
144500140 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage
0.0 (TID 0, localhost): java.lang.ClassCastException:
scala.runtime.BoxedUnit cannot be cast to
org.apache.hadoop.hbase.client.Mutation
at
org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:85)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: