about spark on hbase problem

2021-08-17 Thread igyu
System.setProperty("java.security.krb5.conf", 
config.getJSONObject("auth").getString("krb5"))

val conf = HBaseConfiguration.create()
val zookeeper = config.getString("zookeeper")
val port = config.getString("port")
conf.set(HConstants.ZOOKEEPER_QUORUM, zookeeper)
conf.set(HConstants.ZOOKEEPER_CLIENT_PORT, port)
conf.set("hadoop.security.authentication", "Kerberos")
conf.set(HConstants.ZK_CLIENT_KEYTAB_FILE,config.getJSONObject("auth").getString("keytab"))
conf.set(HConstants.ZK_CLIENT_KERBEROS_PRINCIPAL,config.getJSONObject("auth").getString("principal"))
conf.set(HConstants.ZK_SERVER_KEYTAB_FILE,config.getJSONObject("auth").getString("keytab"))
conf.set(HConstants.ZK_SERVER_KERBEROS_PRINCIPAL,config.getJSONObject("auth").getString("principal"))

HBaseAdmin.available(conf)

I get a error

21/08/18 12:03:18 INFO ZooKeeper: Initiating client connection, 
connectString=bigdser5:2181,bigdser2:2181,bigdser3:2181 sessionTimeout=9 
watcher=org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$$Lambda$16/828536028@17ea11aa
21/08/18 12:03:18 INFO ClientCnxn: Opening socket connection to server 
bigdser5/10.3.87.27:2181. Will not attempt to authenticate using SASL (unknown 
error)
21/08/18 12:03:18 INFO ClientCnxn: Socket connection established, initiating 
session, client: /10.2.65.14:5908, server: bigdser5/10.3.87.27:2181
21/08/18 12:03:18 INFO ClientCnxn: Session establishment complete on server 
bigdser5/10.3.87.27:2181, sessionid = 0x37acbbd0723c7cd, negotiated timeout = 
6
Exception in thread "main" org.apache.hadoop.hbase.MasterNotRunningException: 
org.apache.hadoop.hbase.MasterNotRunningException: java.io.IOException: Call to 
bigdser2/10.3.87.24:16000 failed on local exception: java.io.IOException: 
你的主机中的软件中止了一个已建立的连接。
at 
org.apache.hadoop.hbase.client.ConnectionImplementation.isMasterRunning(ConnectionImplementation.java:585)
at org.apache.hadoop.hbase.client.HBaseAdmin.available(HBaseAdmin.java:2366)
at com.join.hbase.reader.HbaseReader.readFrom(HbaseReader.scala:36)
at com.join.Synctool$.main(Synctool.scala:524)
at com.join.Synctool.main(Synctool.scala)
Caused by: org.apache.hadoop.hbase.MasterNotRunningException: 
java.io.IOException: Call to bigdser2/10.3.87.24:16000 failed on local 
exception: java.io.IOException: 你的主机中的软件中止了一个已建立的连接。
at 
org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStub(ConnectionImplementation.java:1175)
at 
org.apache.hadoop.hbase.client.ConnectionImplementation.getKeepAliveMasterService(ConnectionImplementation.java:1234)
at 
org.apache.hadoop.hbase.client.ConnectionImplementation.isMasterRunning(ConnectionImplementation.java:583)
... 4 more
Caused by: java.io.IOException: Call to bigdser2/10.3.87.24:16000 failed on 
local exception: java.io.IOException: 你的主机中的软件中止了一个已建立的连接。
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:221)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:390)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406)
at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
at 
org.apache.hadoop.hbase.ipc.NettyRpcConnection$4.operationComplete(NettyRpcConnection.java:292)
at 
org.apache.hadoop.hbase.ipc.NettyRpcConnection$4.operationComplete(NettyRpcConnection.java:284)
at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:502)
at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:476)
at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:415)
at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:152)
at 
org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:95)
at 
org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:30)
at 
org.apache.hadoop.hbase.ipc.NettyRpcConnection.write(NettyRpcConnection.java:284)
at 
org.apache.hadoop.hbase.ipc.NettyRpcConnection.access$1100(NettyRpcConnection.java:71)
at 
org.apache.hadoop.hbase.ipc.NettyRpcConnection$6$1.run(NettyRpcConnection.java:344)
at 
org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at 

Java : Testing RDD aggregateByKey

2021-08-17 Thread Pedro Tuero
Context: spark-core_2.12-3.1.1
Testing with maven and eclipse.

I'm modifying a project and a test stops working as expected.
The difference is in the parameters passed to the function aggregateByKey
of JavaPairRDD.

JavaSparkContext is created this way:
new JavaSparkContext(new SparkConf()
.setMaster("local[1]")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"));
Then I construct a JavaPairRdd using sparkContext.paralellizePairs and call
a method which makes an aggregateByKey over the input JavaPairRDD  and test
that the result is the expected.

When I use JavaPairRDD line 369 (doing .aggregateByKey(zeroValue, combiner,
merger);
 def aggregateByKey[U](zeroValue: U, seqFunc: JFunction2[U, V, U],
combFunc: JFunction2[U, U, U]):
  JavaPairRDD[K, U] = {
implicit val ctag: ClassTag[U] = fakeClassTag
fromRDD(rdd.aggregateByKey(zeroValue)(seqFunc, combFunc))
  }
The test works as expected.
But when I use: JavaPairRDD line 355 (doing .aggregateByKey(zeroValue,
*partitions*,combiner, merger);)
def aggregateByKey[U](zeroValue: U, *numPartitions: Int,* seqFunc:
JFunction2[U, V, U],
  combFunc: JFunction2[U, U, U]): JavaPairRDD[K, U] = {
implicit val ctag: ClassTag[U] = fakeClassTag
fromRDD(rdd.aggregateByKey(zeroValue, *numPartitions)*(seqFunc,
combFunc))
  }
The result is always empty. It looks like there is a problem with the
hashPartitioner created at PairRddFunctions :
 def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int)(seqOp:
(U, V) => U,
  combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
aggregateByKey(zeroValue, *new HashPartitioner(numPartitions)*)(seqOp,
combOp)
  }
vs:
 def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
  combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
aggregateByKey(zeroValue, *defaultPartitioner*(self))(seqOp, combOp)
  }
I can't debug it properly with eclipse, and error occurs when threads are
in spark code (system editor can only open file base resources).

Does anyone know how to resolve this issue?

Thanks in advance,
Pedro.


Re: Databricks notebook - cluster taking a long time to get created, often timing out

2021-08-17 Thread Denny Lee
Hi Karan,

You may want to ping Databricks Help  or
Forums  as this is a Databricks
specific question.  I'm a little surprised that a Databricks cluster would
take a long time to create so it may be best to utilize these forums to
grok the cause.

HTH!
Denny


Sent via Superhuman 


On Mon, Aug 16, 2021 at 11:10 PM, karan alang  wrote:

> Hello - i've been using the Databricks notebook(for pyspark or scala/spark
> development), and recently have had issues wherein the cluster creation
> takes a long time to get created, often timing out.
>
> Any ideas on how to resolve this ?
> Any other alternatives to databricks notebook ?
>


Re: Databricks notebook - cluster taking a long time to get created, often timing out

2021-08-17 Thread Jeff Zhang
Maybe you can try the zeppelin notebook. http://zeppelin.apache.org/


karan alang  于2021年8月17日周二 下午2:11写道:

> Hello - i've been using the Databricks notebook(for pyspark or scala/spark
> development), and recently have had issues wherein the cluster creation
> takes a long time to get created, often timing out.
>
> Any ideas on how to resolve this ?
> Any other alternatives to databricks notebook ?
>
>

-- 
Best Regards

Jeff Zhang


Databricks notebook - cluster taking a long time to get created, often timing out

2021-08-17 Thread karan alang
Hello - i've been using the Databricks notebook(for pyspark or scala/spark
development), and recently have had issues wherein the cluster creation
takes a long time to get created, often timing out.

Any ideas on how to resolve this ?
Any other alternatives to databricks notebook ?