Re: Spark yarn cluster

2020-07-11 Thread Diwakar Dhanuskodi
Thanks Martin.

I was not clear in my question initially . Thanks for  understanding and
briefing.

The idea as you said is  to explore the possibility of using yarn for
cluster scheduling with spark being used without hdfs.  Thanks again  for
clarification.

On Sat, Jul 11, 2020 at 1:27 PM Juan Martín Guillén <
juanmartinguil...@yahoo.com.ar> wrote:

> Hi Diwakar,
>
> A Yarn cluster not having Hadoop is kind of a fuzzy concept.
>
> Definitely you may want to have Hadoop and don't need to use MapReduce and
> use Spark instead. That is the main reason to use Spark in a Hadoop cluster
> anyway.
>
> On the other hand it is highly probable you may want to use HDFS although
> not strictly necessary.
>
> So answering your question you are using Hadoop by using Yarn because it
> is one of the 3 main components of it but that doesn't mean you need to use
> other components of the Hadoop cluster, namely MapReduce and HDFS.
>
> That being said, if you just need cluster scheduling and not using
> MapReduce nor HDFS it is possible you will be fine with the Spark
> Standalone cluster.
>
> Regards,
> Juan Martín.
>
> El sábado, 11 de julio de 2020 13:57:40 ART, Diwakar Dhanuskodi <
> diwakar.dhanusk...@gmail.com> escribió:
>
>
> Hi ,
>
> Could it be possible to setup Spark within Yarn cluster which may not have
> Hadoop?.
>
> Thanks.
>


Re: Spark yarn cluster

2020-07-11 Thread Juan Martín Guillén
 Hi Diwakar,

A Yarn cluster not having Hadoop is kind of a fuzzy concept.
Definitely you may want to have Hadoop and don't need to use MapReduce and use 
Spark instead. That is the main reason to use Spark in a Hadoop cluster anyway.
On the other hand it is highly probable you may want to use HDFS although not 
strictly necessary.
So answering your question you are using Hadoop by using Yarn because it is one 
of the 3 main components of it but that doesn't mean you need to use other 
components of the Hadoop cluster, namely MapReduce and HDFS.
That being said, if you just need cluster scheduling and not using MapReduce 
nor HDFS it is possible you will be fine with the Spark Standalone cluster.

Regards,Juan Martín.

El sábado, 11 de julio de 2020 13:57:40 ART, Diwakar Dhanuskodi 
 escribió:  
 
 Hi ,
Could it be possible to setup Spark within Yarn cluster which may not have 
Hadoop?. 
Thanks.  

Spark yarn cluster

2020-07-11 Thread Diwakar Dhanuskodi
Hi ,

Could it be possible to setup Spark within Yarn cluster which may not have
Hadoop?.

Thanks.


Re: SPark - YARN Cluster Mode

2017-02-27 Thread ayan guha
Hi

Thanks a lot, i used property file to resolve the issue. I think
documentation should mention it though.

On Tue, 28 Feb 2017 at 5:05 am, Marcelo Vanzin  wrote:

> >  none of my Config settings
>
> Is it none of the configs or just the queue? You can't set the YARN
> queue in cluster mode through code, it has to be set in the command
> line. It's a chicken & egg problem (in cluster mode, the YARN app is
> created before your code runs).
>
>  --property-file works the same as setting options in the command
> line, so you can use that instead.
>
>
> On Sun, Feb 26, 2017 at 4:52 PM, ayan guha  wrote:
> > Hi
> >
> > I am facing an issue with Cluster Mode, with pyspark
> >
> > Here is my code:
> >
> > conf = SparkConf()
> > conf.setAppName("Spark Ingestion")
> > conf.set("spark.yarn.queue","root.Applications")
> > conf.set("spark.executor.instances","50")
> > conf.set("spark.executor.memory","22g")
> > conf.set("spark.yarn.executor.memoryOverhead","4096")
> > conf.set("spark.executor.cores","4")
> > conf.set("spark.sql.hive.convertMetastoreParquet", "false")
> > sc = SparkContext(conf = conf)
> > sqlContext = HiveContext(sc)
> >
> > r = sc.parallelize(xrange(1,1))
> > print r.count()
> >
> > sc.stop()
> >
> > The problem is none of my Config settings are passed on to Yarn.
> >
> > spark-submit --master yarn --deploy-mode cluster ayan_test.py
> >
> > I tried the same code with deploy-mode=client and all config are passing
> > fine.
> >
> > Am I missing something? Will introducing --property-file be of any help?
> Can
> > anybody share some working example?
> >
> > Best
> > Ayan
> >
> > --
> > Best Regards,
> > Ayan Guha
>
>
>
> --
> Marcelo
>
-- 
Best Regards,
Ayan Guha


Re: SPark - YARN Cluster Mode

2017-02-27 Thread Marcelo Vanzin
>  none of my Config settings

Is it none of the configs or just the queue? You can't set the YARN
queue in cluster mode through code, it has to be set in the command
line. It's a chicken & egg problem (in cluster mode, the YARN app is
created before your code runs).

 --property-file works the same as setting options in the command
line, so you can use that instead.


On Sun, Feb 26, 2017 at 4:52 PM, ayan guha  wrote:
> Hi
>
> I am facing an issue with Cluster Mode, with pyspark
>
> Here is my code:
>
> conf = SparkConf()
> conf.setAppName("Spark Ingestion")
> conf.set("spark.yarn.queue","root.Applications")
> conf.set("spark.executor.instances","50")
> conf.set("spark.executor.memory","22g")
> conf.set("spark.yarn.executor.memoryOverhead","4096")
> conf.set("spark.executor.cores","4")
> conf.set("spark.sql.hive.convertMetastoreParquet", "false")
> sc = SparkContext(conf = conf)
> sqlContext = HiveContext(sc)
>
> r = sc.parallelize(xrange(1,1))
> print r.count()
>
> sc.stop()
>
> The problem is none of my Config settings are passed on to Yarn.
>
> spark-submit --master yarn --deploy-mode cluster ayan_test.py
>
> I tried the same code with deploy-mode=client and all config are passing
> fine.
>
> Am I missing something? Will introducing --property-file be of any help? Can
> anybody share some working example?
>
> Best
> Ayan
>
> --
> Best Regards,
> Ayan Guha



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SPark - YARN Cluster Mode

2017-02-26 Thread ayan guha
Also, I wanted to add if I specify the conf in the command line, it seems
to be working.

For example, if I use

spark-submit --master yarn --deploy-mode cluster --conf
spark.yarn.queue=root.Application ayan_test.py 10

Then it is going to correct queue.

Any help would be great

Best
Ayan

On Mon, Feb 27, 2017 at 11:52 AM, ayan guha  wrote:

> Hi
>
> I am facing an issue with Cluster Mode, with pyspark
>
> Here is my code:
>
> conf = SparkConf()
> conf.setAppName("Spark Ingestion")
> conf.set("spark.yarn.queue","root.Applications")
> conf.set("spark.executor.instances","50")
> conf.set("spark.executor.memory","22g")
> conf.set("spark.yarn.executor.memoryOverhead","4096")
> conf.set("spark.executor.cores","4")
> conf.set("spark.sql.hive.convertMetastoreParquet", "false")
> sc = SparkContext(conf = conf)
> sqlContext = HiveContext(sc)
>
> r = sc.parallelize(xrange(1,1))
> print r.count()
>
> sc.stop()
>
> The problem is none of my Config settings are passed on to Yarn.
>
> spark-submit --master yarn --deploy-mode cluster ayan_test.py
>
> I tried the same code with deploy-mode=client and all config are passing
> fine.
>
> Am I missing something? Will introducing --property-file be of any help?
> Can anybody share some working example?
>
> Best
> Ayan
>
> --
> Best Regards,
> Ayan Guha
>



-- 
Best Regards,
Ayan Guha


SPark - YARN Cluster Mode

2017-02-26 Thread ayan guha
Hi

I am facing an issue with Cluster Mode, with pyspark

Here is my code:

conf = SparkConf()
conf.setAppName("Spark Ingestion")
conf.set("spark.yarn.queue","root.Applications")
conf.set("spark.executor.instances","50")
conf.set("spark.executor.memory","22g")
conf.set("spark.yarn.executor.memoryOverhead","4096")
conf.set("spark.executor.cores","4")
conf.set("spark.sql.hive.convertMetastoreParquet", "false")
sc = SparkContext(conf = conf)
sqlContext = HiveContext(sc)

r = sc.parallelize(xrange(1,1))
print r.count()

sc.stop()

The problem is none of my Config settings are passed on to Yarn.

spark-submit --master yarn --deploy-mode cluster ayan_test.py

I tried the same code with deploy-mode=client and all config are passing
fine.

Am I missing something? Will introducing --property-file be of any help?
Can anybody share some working example?

Best
Ayan

-- 
Best Regards,
Ayan Guha


Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread ayan guha
You may try copying the file to same location on all nodes and try to read
from that place
On 24 Sep 2016 00:20, "ABHISHEK"  wrote:

> I have tried with hdfs/tmp location but it didn't work. Same error.
>
> On 23 Sep 2016 19:37, "Aditya"  wrote:
>
>> Hi Abhishek,
>>
>> Try below spark submit.
>> spark-submit --master yarn --deploy-mode cluster  --files hdfs://
>> abc.com:8020/tmp/abc.drl --class com.abc.StartMain
>> abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar abc.drl
>> 
>>
>> On Friday 23 September 2016 07:29 PM, ABHISHEK wrote:
>>
>> Thanks for your response Aditya and Steve.
>> Steve:
>> I have tried specifying both /tmp/filename in hdfs and local path but it
>> didn't work.
>> You may be write that Kie session is configured  to  access files from
>> Local path.
>> I have attached code here for your reference and if you find some thing
>> wrong, please help to correct it.
>>
>> Aditya:
>> I have attached code here for reference. --File option will distributed
>> reference file to all node but  Kie session is not able  to pickup it.
>>
>> Thanks,
>> Abhishek
>>
>> On Fri, Sep 23, 2016 at 2:25 PM, Steve Loughran 
>> wrote:
>>
>>>
>>> On 23 Sep 2016, at 08:33, ABHISHEK  wrote:
>>>
>>> at java.lang.Thread.run(Thread.java:745)
>>> Caused by: java.io.FileNotFoundException: hdfs:/abc.com:8020/user/abhiet
>>> c/abc.drl (No such file or directory)
>>> at java.io.FileInputStream.open(Native Method)
>>> at java.io.FileInputStream.(FileInputStream.java:146)
>>> at org.drools.core.io.impl.FileSystemResource.getInputStream(Fi
>>> leSystemResource.java:123)
>>> at org.drools.compiler.kie.builder.impl.KieFileSystemImpl.write
>>> (KieFileSystemImpl.java:58)
>>>
>>>
>>>
>>> Looks like this .KieFileSystemImpl class only works with local files, so
>>> when it gets an HDFS path in it tries to open it and gets confused.
>>>
>>> you may need to write to a local FS temp file then copy it into HDFS
>>>
>>
>>
>>
>>


Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread ABHISHEK
I have tried with hdfs/tmp location but it didn't work. Same error.

On 23 Sep 2016 19:37, "Aditya"  wrote:

> Hi Abhishek,
>
> Try below spark submit.
> spark-submit --master yarn --deploy-mode cluster  --files hdfs://
> abc.com:8020/tmp/abc.drl --class com.abc.StartMain
> abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar abc.drl
> 
>
> On Friday 23 September 2016 07:29 PM, ABHISHEK wrote:
>
> Thanks for your response Aditya and Steve.
> Steve:
> I have tried specifying both /tmp/filename in hdfs and local path but it
> didn't work.
> You may be write that Kie session is configured  to  access files from
> Local path.
> I have attached code here for your reference and if you find some thing
> wrong, please help to correct it.
>
> Aditya:
> I have attached code here for reference. --File option will distributed
> reference file to all node but  Kie session is not able  to pickup it.
>
> Thanks,
> Abhishek
>
> On Fri, Sep 23, 2016 at 2:25 PM, Steve Loughran 
> wrote:
>
>>
>> On 23 Sep 2016, at 08:33, ABHISHEK  wrote:
>>
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.io.FileNotFoundException: hdfs:/abc.com:8020/user/abhiet
>> c/abc.drl (No such file or directory)
>> at java.io.FileInputStream.open(Native Method)
>> at java.io.FileInputStream.(FileInputStream.java:146)
>> at org.drools.core.io.impl.FileSystemResource.getInputStream(Fi
>> leSystemResource.java:123)
>> at org.drools.compiler.kie.builder.impl.KieFileSystemImpl.
>> write(KieFileSystemImpl.java:58)
>>
>>
>>
>> Looks like this .KieFileSystemImpl class only works with local files, so
>> when it gets an HDFS path in it tries to open it and gets confused.
>>
>> you may need to write to a local FS temp file then copy it into HDFS
>>
>
>
>
>


Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread Aditya

Hi Abhishek,

Try below spark submit.
spark-submit --master yarn --deploy-mode cluster  --files 
hdfs://abc.com:8020/tmp/abc.drl  
--class com.abc.StartMain abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar 
abc.drl 


On Friday 23 September 2016 07:29 PM, ABHISHEK wrote:

Thanks for your response Aditya and Steve.
Steve:
I have tried specifying both /tmp/filename in hdfs and local path but 
it didn't work.
You may be write that Kie session is configured  to  access files from 
Local path.
I have attached code here for your reference and if you find some 
thing wrong, please help to correct it.


Aditya:
I have attached code here for reference. --File option will 
distributed reference file to all node but  Kie session is not able 
 to pickup it.


Thanks,
Abhishek

On Fri, Sep 23, 2016 at 2:25 PM, Steve Loughran 
> wrote:




On 23 Sep 2016, at 08:33, ABHISHEK > wrote:

at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException:
hdfs:/abc.com:8020/user/abhietc/abc.drl
(No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:146)
at

org.drools.core.io.impl.FileSystemResource.getInputStream(FileSystemResource.java:123)
at

org.drools.compiler.kie.builder.impl.KieFileSystemImpl.write(KieFileSystemImpl.java:58)



Looks like this .KieFileSystemImpl class only works with local
files, so when it gets an HDFS path in it tries to open it and
gets confused.

you may need to write to a local FS temp file then copy it into HDFS








Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread ABHISHEK
Thanks for your response Aditya and Steve.
Steve:
I have tried specifying both /tmp/filename in hdfs and local path but it
didn't work.
You may be write that Kie session is configured  to  access files from
Local path.
I have attached code here for your reference and if you find some thing
wrong, please help to correct it.

Aditya:
I have attached code here for reference. --File option will distributed
reference file to all node but  Kie session is not able  to pickup it.

Thanks,
Abhishek

On Fri, Sep 23, 2016 at 2:25 PM, Steve Loughran 
wrote:

>
> On 23 Sep 2016, at 08:33, ABHISHEK  wrote:
>
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.FileNotFoundException: hdfs:/abc.com:8020/user/
> abhietc/abc.drl (No such file or directory)
> at java.io.FileInputStream.open(Native Method)
> at java.io.FileInputStream.(FileInputStream.java:146)
> at org.drools.core.io.impl.FileSystemResource.getInputStream(
> FileSystemResource.java:123)
> at org.drools.compiler.kie.builder.impl.KieFileSystemImpl.write(
> KieFileSystemImpl.java:58)
>
>
>
> Looks like this .KieFileSystemImpl class only works with local files, so
> when it gets an HDFS path in it tries to open it and gets confused.
>
> you may need to write to a local FS temp file then copy it into HDFS
>
object Mymain {
  def main(args: Array[String]): Unit = {

   // @  abhishek 
   //val fileName = "./abc.drl"   //  code works if I run app in local mode 
   
  
val fileName = args(0)
val conf = new SparkConf().setAppName("LocalStreaming")
val sc = new SparkContext(conf) 
val ssc = new StreamingContext(sc, Seconds(1)) 
val brokers = "190.51.231.132:9092"
val groupId = "testgroup"
val offsetReset = "smallest"
val pollTimeout = "1000"
val topics = "NorthPole"
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String](
  ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
  ConsumerConfig.GROUP_ID_CONFIG -> groupId,
  ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> 
"org.apache.kafka.common.serialization.StringDeserializer",
  ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> 
"org.apache.kafka.common.serialization.StringDeserializer",
  ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> offsetReset,
  ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> "false",
  "spark.kafka.poll.time" -> pollTimeout)

val detailTable = "emp1 "
val summaryTable= "emp2"
val confHBase = HBaseConfiguration.create()
confHBase.set("hbase.zookeeper.quorum", "190.51.231.132")
confHBase.set("hbase.zookeeper.property.clientPort", "2181")
confHBase.set("hbase.master", "190.51.231.132:6")
val emp_detail_config = Job.getInstance(confHBase)
val emp_summary_config = Job.getInstance(confHBase)
 
emp_detail_config.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, 
"emp_detail");

emp_detail_config.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
  
emp_summary_config.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, 
"emp_summary")

emp_summary_config.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])

val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder](ssc, kafkaParams, topicsSet)

messages.foreachRDD(rdd => {
  if (rdd.count > 0) {

val messageRDD: RDD[String] = rdd.map { y => y._2 }
val inputJsonObjectRDD = messageRDD.map(row => 
Utilities.convertToJsonObject(row))

inputJsonObjectRDD.map(row => 
BuildJsonStrctures.buildTaxCalcDetail(row)).saveAsNewAPIHadoopDataset(emp_detail_config.getConfiguration)

val inputObjectRDD = messageRDD.map(row => 
Utilities.convertToSubmissionJavaObject(row))
val executedObjectRDD = inputObjectRDD.mapPartitions(row => 
KieSessionFactory.execute(row, fileName.toString()))
val executedJsonRDD = executedObjectRDD.map(row => 
Utilities.convertToSubmissionJSonString(row))
  .map(row => Utilities.convertToJsonObject(row))

val summaryInputJsonRDD = executedObjectRDD

summaryInputJsonRDD.map(row => 
BuildJsonStrctures.buildSummary2(row)).saveAsNewAPIHadoopDataset(emp_summary_config.getConfiguration)

  } else {
println("No message received") //this works only in master local mode
  }
})
ssc.start()
ssc.awaitTermination()

  }
}
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread Steve Loughran

On 23 Sep 2016, at 08:33, ABHISHEK 
> wrote:

at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: 
hdfs:/abc.com:8020/user/abhietc/abc.drl
 (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:146)
at 
org.drools.core.io.impl.FileSystemResource.getInputStream(FileSystemResource.java:123)
at 
org.drools.compiler.kie.builder.impl.KieFileSystemImpl.write(KieFileSystemImpl.java:58)


Looks like this .KieFileSystemImpl class only works with local files, so when 
it gets an HDFS path in it tries to open it and gets confused.

you may need to write to a local FS temp file then copy it into HDFS


Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread Aditya

Hi Abhishek,

From your spark-submit it seems your passing the file as a parameter to 
the driver program. So now it depends what exactly you are doing with 
that parameter. Using --files option it will be available to all the 
worker nodes but if in your code if you are referencing using the 
specified path in distributed mode it wont get the file on the worker nodes.


If you can share the snippet of code it will be easy to debug.

On Friday 23 September 2016 01:03 PM, ABHISHEK wrote:

Hello there,

I have Spark Application which refer to an external file ‘abc.drl’ and 
having unstructured data.
Application is able to find this reference file if I  run app in Local 
mode but in Yarn with Cluster mode, it is not able to  find the file 
in the specified path.
I tried with both local and hdfs path with –-files option but it 
didn’t work.



What is working ?
1.Current  Spark Application runs fine if I run it in Local mode as 
mentioned below.

In below command   file path is local path not HDFS.
spark-submit --master local[*]  --class "com.abc.StartMain" 
abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar /home/abhietc/abc/abc.drl


3.I want to run this Spark application using Yarn with cluster mode.
For that, I used below command but application is not able to find the 
path for the reference file abc.drl.I tried giving both local and HDFS 
path but didn’t work.


spark-submit --master yarn --deploy-mode cluster  --files 
/home/abhietc/abc/abc.drl --class com.abc.StartMain 
abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar /home/abhietc/abc/abc.drl


spark-submit --master yarn --deploy-mode cluster  --files 
hdfs://abhietc.com:8020/user/abhietc/abc.drl 
 --class 
com.abc.StartMain abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar 
hdfs://abhietc.com:8020/user/abhietc/abc.drl 



spark-submit --master yarn --deploy-mode cluster  --files 
hdfs://abc.com:8020/tmp/abc.drl  
--class com.abc.StartMain abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar 
hdfs://abc.com:8020/tmp/abc.drl 



Error Messages:
Surprising we are not doing any Write operation on reference file but 
still log shows that application is trying to write to file instead 
reading the file.

Also log shows File not found exception for both HDFS and Local path.
-
16/09/20 14:49:50 ERROR scheduler.JobScheduler: Error running job 
streaming job 1474363176000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 
in stage 1.0 (TID 4, abc.com ): 
java.lang.RuntimeException: Unable to write Resource: 
FileResource[file=hdfs:/abc.com:8020/user/abhietc/abc.drl 
]
at 
org.drools.compiler.kie.builder.impl.KieFileSystemImpl.write(KieFileSystemImpl.java:71)
at 
com.hmrc.taxcalculator.KieSessionFactory$.getNewSession(KieSessionFactory.scala:49)
at 
com.hmrc.taxcalculator.KieSessionFactory$.getKieSession(KieSessionFactory.scala:21)
at 
com.hmrc.taxcalculator.KieSessionFactory$.execute(KieSessionFactory.scala:27)
at 
com.abc.StartMain$$anonfun$main$1$$anonfun$4.apply(TaxCalculatorMain.scala:124)
at 
com.abc.StartMain$$anonfun$main$1$$anonfun$4.apply(TaxCalculatorMain.scala:124)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)

at org.apache.spark.scheduler.Task.run(Task.scala:89)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: 
hdfs:/abc.com:8020/user/abhietc/abc.drl 
 (No such file or directory)

at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:146)
at 
org.drools.core.io.impl.FileSystemResource.getInputStream(FileSystemResource.java:123)
at 

Spark Yarn Cluster with Reference File

2016-09-23 Thread ABHISHEK
Hello there,

I have Spark Application which refer to an external file ‘abc.drl’ and
having unstructured data.
Application is able to find this reference file if I  run app in Local mode
but in Yarn with Cluster mode, it is not able to  find the file in the
specified path.
I tried with both local and hdfs path with –-files option but it didn’t
work.


What is working ?
1. Current  Spark Application runs fine if I run it in Local mode as
mentioned below.
In below command   file path is local path not HDFS.
spark-submit --master local[*]  --class "com.abc.StartMain"
abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar /home/abhietc/abc/abc.drl

3. I want to run this Spark application using Yarn with cluster mode.
For that, I used below command but application is not able to find the path
for the reference file abc.drl.I tried giving both local and HDFS path but
didn’t work.

spark-submit --master yarn --deploy-mode cluster  --files
/home/abhietc/abc/abc.drl --class com.abc.StartMain
abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar /home/abhietc/abc/abc.drl

spark-submit --master yarn --deploy-mode cluster  --files hdfs://
abhietc.com:8020/user/abhietc/abc.drl --class com.abc.StartMain
abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar hdfs://
abhietc.com:8020/user/abhietc/abc.drl

spark-submit --master yarn --deploy-mode cluster  --files hdfs://
abc.com:8020/tmp/abc.drl --class com.abc.StartMain
abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar hdfs://abc.com:8020/tmp/abc.drl


Error Messages:
Surprising we are not doing any Write operation on reference file but still
log shows that application is trying to write to file instead reading the
file.
Also log shows File not found exception for both HDFS and Local path.
-
16/09/20 14:49:50 ERROR scheduler.JobScheduler: Error running job streaming
job 1474363176000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage
1.0 (TID 4, abc.com): java.lang.RuntimeException: Unable to write Resource:
FileResource[file=hdfs:/abc.com:8020/user/abhietc/abc.drl]
at
org.drools.compiler.kie.builder.impl.KieFileSystemImpl.write(KieFileSystemImpl.java:71)
at
com.hmrc.taxcalculator.KieSessionFactory$.getNewSession(KieSessionFactory.scala:49)
at
com.hmrc.taxcalculator.KieSessionFactory$.getKieSession(KieSessionFactory.scala:21)
at
com.hmrc.taxcalculator.KieSessionFactory$.execute(KieSessionFactory.scala:27)
at
com.abc.StartMain$$anonfun$main$1$$anonfun$4.apply(TaxCalculatorMain.scala:124)
at
com.abc.StartMain$$anonfun$main$1$$anonfun$4.apply(TaxCalculatorMain.scala:124)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: hdfs:/
abc.com:8020/user/abhietc/abc.drl (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:146)
at
org.drools.core.io.impl.FileSystemResource.getInputStream(FileSystemResource.java:123)
at
org.drools.compiler.kie.builder.impl.KieFileSystemImpl.write(KieFileSystemImpl.java:58)
... 19 more
--
Cheers,
Abhishek


Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-17 Thread Mich Talebzadeh
Hi

CLOSE_WAIT!

According to this <https://access.redhat.com/solutions/437133> link

- CLOSE_WAIT - Indicates that the server has received the first FIN signal
from the client and the connection is in the process of being closed .So
this essentially means that his is a state where socket is waiting for the
application to execute close() . A socket can be in CLOSE_WAIT state
indefinitely until the application closes it. Faulty scenarios would be
like file descriptor leak, server not being execute close() on socket
leading to pile up of close_wait sockets
- The CLOSE_WAIT status means that the other side has initiated a
connection close, but the application on the local side has not yet closed
the socket

Normally it should be LISTEN or ESTABLISHED.

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 17 September 2016 at 16:14, <anupama.gangad...@daimler.com> wrote:

> Hi,
>
>
>
> Yes. I am able to connect to Hive from simple Java program running in the
> cluster. When using spark-submit I faced the issue.
>
> The output of command is given below
>
>
>
> $> netstat -alnp |grep 10001
>
> (Not all processes could be identified, non-owned process info
>
> will not be shown, you would have to be root to see it all.)
>
> tcp1  0 53.244.194.223:2561253.244.194.221:10001
> CLOSE_WAIT  -
>
>
>
> Thanks
>
> Anupama
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Saturday, September 17, 2016 12:36 AM
> *To:* Gangadhar, Anupama (623)
> *Cc:* user @spark
> *Subject:* Re: Error trying to connect to Hive from Spark (Yarn-Cluster
> Mode)
>
>
>
> Is your Hive Thrift Server up and running on port
> jdbc:hive2://10001?
>
>
>
> Do  the following
>
>
>
>  netstat -alnp |grep 10001
>
> and see whether it is actually running
>
>
>
> HTH
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On 16 September 2016 at 19:53, <anupama.gangad...@daimler.com> wrote:
>
> Hi,
>
>
>
> I am trying to connect to Hive from Spark application in Kerborized
> cluster and get the following exception.  Spark version is 1.4.1 and Hive
> is 1.2.1. Outside of spark the connection goes through fine.
>
> Am I missing any configuration parameters?
>
>
>
> ava.sql.SQLException: Could not open connection to
> jdbc:hive2://10001/default;principal=hive/ server2 host>;ssl=false;transportMode=http;httpPath=cliservice: null
>
>at org.apache.hive.jdbc.HiveConnection.openTransport(
> HiveConnection.java:206)
>
>at org.apache.hive.jdbc.HiveConnection.(
> HiveConnection.java:178)
>
>at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.
> java:105)
>
>at java.sql.DriverManager.getConnection(DriverManager.
> java:571)
>
>at java.sql.DriverManager.getConnection(DriverManager.
> java:215)
>
>at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:124)
>
>at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:1)
>
>at org.apache.spark.api.java.JavaPairRDD$$anonfun$
> toScalaFunction$1.apply(JavaPairRDD.scala:1027)
>
>at scala.collection.Iterator$$anon$11.next(Iterator.scala:
> 328)
>
>at scala.collection.Iterator$$anon$11.next(Iterator.scala:
> 328)
>
>at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.
> apply$mcV$sp(PairRDDFunctions.scala:1109)
>
>at org.apache.spark.r

RE: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-17 Thread anupama . gangadhar
Hi,

Yes. I am able to connect to Hive from simple Java program running in the 
cluster. When using spark-submit I faced the issue.
The output of command is given below

$> netstat -alnp |grep 10001
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp1  0 53.244.194.223:2561253.244.194.221:10001CLOSE_WAIT  
-

Thanks
Anupama

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Saturday, September 17, 2016 12:36 AM
To: Gangadhar, Anupama (623)
Cc: user @spark
Subject: Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

Is your Hive Thrift Server up and running on port  jdbc:hive2://10001?

Do  the following

 netstat -alnp |grep 10001

and see whether it is actually running

HTH







Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 16 September 2016 at 19:53, 
<anupama.gangad...@daimler.com<mailto:anupama.gangad...@daimler.com>> wrote:
Hi,

I am trying to connect to Hive from Spark application in Kerborized cluster and 
get the following exception.  Spark version is 1.4.1 and Hive is 1.2.1. Outside 
of spark the connection goes through fine.
Am I missing any configuration parameters?

ava.sql.SQLException: Could not open connection to jdbc:hive2://10001/default;principal=hive/;ssl=false;transportMode=http;httpPath=cliservice: null
   at 
org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:206)
   at 
org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:178)
   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
   at java.sql.DriverManager.getConnection(DriverManager.java:571)
   at java.sql.DriverManager.getConnection(DriverManager.java:215)
   at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:124)
   at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:1)
   at 
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1027)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
   at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
   at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.thrift.transport.TTransportException
   at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
   at 
org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
   at 
org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:182)
   at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:258)
   at 
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
   at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
   at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInform

RE: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-17 Thread anupama . gangadhar
Hi,
@Deepak
I have used a separate user keytab(not hadoop services keytab) and able to 
connect to Hive via simple java program.
I am able to connect to Hive from spark-shell as well. However when I submit a 
spark job using this same keytab, I see the issue.
Do cache have a role to play here? In the cluster, transport mode is http and 
ssl is disabled.

Thanks

Anupama



From: Deepak Sharma [mailto:deepakmc...@gmail.com]
Sent: Saturday, September 17, 2016 8:35 AM
To: Gangadhar, Anupama (623)
Cc: spark users
Subject: Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)


Hi Anupama

To me it looks like issue with the SPN with which you are trying to connect to 
hive2 , i.e. hive@hostname.

Are you able to connect to hive from spark-shell?

Try getting the tkt using any other user keytab but not hadoop services keytab 
and then try running the spark submit.



Thanks

Deepak

On 17 Sep 2016 12:23 am, 
<anupama.gangad...@daimler.com<mailto:anupama.gangad...@daimler.com>> wrote:
Hi,

I am trying to connect to Hive from Spark application in Kerborized cluster and 
get the following exception.  Spark version is 1.4.1 and Hive is 1.2.1. Outside 
of spark the connection goes through fine.
Am I missing any configuration parameters?

ava.sql.SQLException: Could not open connection to jdbc:hive2://10001/default;principal=hive/;ssl=false;transportMode=http;httpPath=cliservice: null
   at 
org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:206)
   at 
org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:178)
   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
   at java.sql.DriverManager.getConnection(DriverManager.java:571)
   at java.sql.DriverManager.getConnection(DriverManager.java:215)
   at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:124)
   at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:1)
   at 
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1027)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
   at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
   at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
Caused by: 
org.apache.thrift.transport.TT<http://org.apache.thrift.transport.TT>ransportException
   at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
   at 
org.apache.thrift.transport.TT<http://org.apache.thrift.transport.TT>ransport.readAll(TTransport.java:84)
   at 
org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:182)
   at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:258)
   at 
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
   at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
   at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
   at java.security.AccessController.doPrivileged(Native Method)
   at 
javax.security.auth.Subject.do<http://javax.security.auth.Subject.do>As(Subject.java:415)
   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
   at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
   at 
org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:203)
   ... 21 more

In spark conf directory hive-site.xm

Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-16 Thread Deepak Sharma
Hi Anupama

To me it looks like issue with the SPN with which you are trying to connect
to hive2 , i.e. hive@hostname.

Are you able to connect to hive from spark-shell?

Try getting the tkt using any other user keytab but not hadoop services
keytab and then try running the spark submit.


Thanks

Deepak

On 17 Sep 2016 12:23 am,  wrote:

> Hi,
>
>
>
> I am trying to connect to Hive from Spark application in Kerborized
> cluster and get the following exception.  Spark version is 1.4.1 and Hive
> is 1.2.1. Outside of spark the connection goes through fine.
>
> Am I missing any configuration parameters?
>
>
>
> ava.sql.SQLException: Could not open connection to
> jdbc:hive2://10001/default;principal=hive/ server2 host>;ssl=false;transportMode=http;httpPath=cliservice: null
>
>at org.apache.hive.jdbc.HiveConne
> ction.openTransport(HiveConnection.java:206)
>
>at org.apache.hive.jdbc.HiveConne
> ction.(HiveConnection.java:178)
>
>at org.apache.hive.jdbc.HiveDrive
> r.connect(HiveDriver.java:105)
>
>at java.sql.DriverManager.getConn
> ection(DriverManager.java:571)
>
>at java.sql.DriverManager.getConn
> ection(DriverManager.java:215)
>
>at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:124)
>
>at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:1)
>
>at org.apache.spark.api.java.Java
> PairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1027)
>
>at scala.collection.Iterator$$ano
> n$11.next(Iterator.scala:328)
>
>at scala.collection.Iterator$$ano
> n$11.next(Iterator.scala:328)
>
>at org.apache.spark.rdd.PairRDDFu
> nctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$
> apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109)
>
>at org.apache.spark.rdd.PairRDDFu
> nctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(
> PairRDDFunctions.scala:1108)
>
>at org.apache.spark.rdd.PairRDDFu
> nctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(
> PairRDDFunctions.scala:1108)
>
>at org.apache.spark.util.Utils$.t
> ryWithSafeFinally(Utils.scala:1285)
>
>at org.apache.spark.rdd.PairRDDFu
> nctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(Pai
> rRDDFunctions.scala:1116)
>
>at org.apache.spark.rdd.PairRDDFu
> nctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(Pai
> rRDDFunctions.scala:1095)
>
>at org.apache.spark.scheduler.Res
> ultTask.runTask(ResultTask.scala:63)
>
>at org.apache.spark.scheduler.Task.run(Task.scala:70)
>
>at org.apache.spark.executor.Exec
> utor$TaskRunner.run(Executor.scala:213)
>
>at java.util.concurrent.ThreadPoo
> lExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>at java.util.concurrent.ThreadPoo
> lExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>at java.lang.Thread.run(Thread.java:745)
>
> Caused by: org.apache.thrift.transport.TTransportException
>
>at org.apache.thrift.transport.TI
> OStreamTransport.read(TIOStreamTransport.java:132)
>
>at org.apache.thrift.transport.TT
> ransport.readAll(TTransport.java:84)
>
>at org.apache.thrift.transport.TS
> aslTransport.receiveSaslMessage(TSaslTransport.java:182)
>
>at org.apache.thrift.transport.TS
> aslTransport.open(TSaslTransport.java:258)
>
>at org.apache.thrift.transport.TS
> aslClientTransport.open(TSaslClientTransport.java:37)
>
>at org.apache.hadoop.hive.thrift.
> client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
>
>at org.apache.hadoop.hive.thrift.
> client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
>
>at java.security.AccessController.doPrivileged(Native
> Method)
>
>at javax.security.auth.Subject.doAs(Subject.java:415)
>
>at org.apache.hadoop.security.Use
> rGroupInformation.doAs(UserGroupInformation.java:1657)
>
>at org.apache.hadoop.hive.thrift.
> client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
>
>at org.apache.hive.jdbc.HiveConne
> ction.openTransport(HiveConnection.java:203)
>
>... 21 more
>
>
>
> In spark conf directory hive-site.xml has the following properties
>
>
>
> 
>
>
>
> 
>
>   hive.metastore.kerberos.keytab.file
>
>   /etc/security/keytabs/hive.service.keytab
>
> 
>
>
>
> 
>
>   hive.metastore.kerberos.principal
>
>   hive/_HOST@
>
> 
>
>
>
> 
>
>   hive.metastore.sasl.enabled
>
>   true
>
> 
>
>
>
> 
>
>   hive.metastore.uris
>
>   thrift://:9083
>
> 
>
>
>
> 
>
>   hive.server2.authentication
>
>   KERBEROS
>
> 
>
>
>
> 
>
>   

Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-16 Thread Mich Talebzadeh
Is your Hive Thrift Server up and running on port
jdbc:hive2://10001?

Do  the following

 netstat -alnp |grep 10001

and see whether it is actually running

HTH





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 16 September 2016 at 19:53,  wrote:

> Hi,
>
>
>
> I am trying to connect to Hive from Spark application in Kerborized
> cluster and get the following exception.  Spark version is 1.4.1 and Hive
> is 1.2.1. Outside of spark the connection goes through fine.
>
> Am I missing any configuration parameters?
>
>
>
> ava.sql.SQLException: Could not open connection to
> jdbc:hive2://10001/default;principal=hive/ server2 host>;ssl=false;transportMode=http;httpPath=cliservice: null
>
>at org.apache.hive.jdbc.HiveConnection.openTransport(
> HiveConnection.java:206)
>
>at org.apache.hive.jdbc.HiveConnection.(
> HiveConnection.java:178)
>
>at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.
> java:105)
>
>at java.sql.DriverManager.getConnection(DriverManager.
> java:571)
>
>at java.sql.DriverManager.getConnection(DriverManager.
> java:215)
>
>at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:124)
>
>at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:1)
>
>at org.apache.spark.api.java.JavaPairRDD$$anonfun$
> toScalaFunction$1.apply(JavaPairRDD.scala:1027)
>
>at scala.collection.Iterator$$anon$11.next(Iterator.scala:
> 328)
>
>at scala.collection.Iterator$$anon$11.next(Iterator.scala:
> 328)
>
>at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.
> apply$mcV$sp(PairRDDFunctions.scala:1109)
>
>at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.
> apply(PairRDDFunctions.scala:1108)
>
>at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.
> apply(PairRDDFunctions.scala:1108)
>
>at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.
> scala:1285)
>
>at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
>
>at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
>
>at org.apache.spark.scheduler.
> ResultTask.runTask(ResultTask.scala:63)
>
>at org.apache.spark.scheduler.Task.run(Task.scala:70)
>
>at org.apache.spark.executor.Executor$TaskRunner.run(
> Executor.scala:213)
>
>at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
>
>at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
>
>at java.lang.Thread.run(Thread.java:745)
>
> Caused by: org.apache.thrift.transport.TTransportException
>
>at org.apache.thrift.transport.TIOStreamTransport.read(
> TIOStreamTransport.java:132)
>
>at org.apache.thrift.transport.
> TTransport.readAll(TTransport.java:84)
>
>at org.apache.thrift.transport.TSaslTransport.
> receiveSaslMessage(TSaslTransport.java:182)
>
>at org.apache.thrift.transport.TSaslTransport.open(
> TSaslTransport.java:258)
>
>at org.apache.thrift.transport.TSaslClientTransport.open(
> TSaslClientTransport.java:37)
>
>at org.apache.hadoop.hive.thrift.
> client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
>
>at org.apache.hadoop.hive.thrift.
> client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
>
>at java.security.AccessController.doPrivileged(Native
> Method)
>
>at javax.security.auth.Subject.doAs(Subject.java:415)
>
>at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1657)
>
>at org.apache.hadoop.hive.thrift.
> client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
>
>at org.apache.hive.jdbc.HiveConnection.openTransport(
> HiveConnection.java:203)
>
>... 21 more
>
>
>
> In spark conf directory hive-site.xml has the following properties
>
>
>
> 
>
>
>
> 
>
>   

Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-16 Thread anupama . gangadhar
Hi,

I am trying to connect to Hive from Spark application in Kerborized cluster and 
get the following exception.  Spark version is 1.4.1 and Hive is 1.2.1. Outside 
of spark the connection goes through fine.
Am I missing any configuration parameters?

ava.sql.SQLException: Could not open connection to jdbc:hive2://10001/default;principal=hive/;ssl=false;transportMode=http;httpPath=cliservice: null
   at 
org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:206)
   at 
org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:178)
   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
   at java.sql.DriverManager.getConnection(DriverManager.java:571)
   at java.sql.DriverManager.getConnection(DriverManager.java:215)
   at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:124)
   at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:1)
   at 
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1027)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
   at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
   at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.thrift.transport.TTransportException
   at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
   at 
org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
   at 
org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:182)
   at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:258)
   at 
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
   at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
   at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
   at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
   at 
org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:203)
   ... 21 more

In spark conf directory hive-site.xml has the following properties




  hive.metastore.kerberos.keytab.file
  /etc/security/keytabs/hive.service.keytab



  hive.metastore.kerberos.principal
  hive/_HOST@



  hive.metastore.sasl.enabled
  true



  hive.metastore.uris
  thrift://:9083



  hive.server2.authentication
  KERBEROS



  hive.server2.authentication.kerberos.keytab
  /etc/security/keytabs/hive.service.keytab



  hive.server2.authentication.kerberos.principal
  hive/_HOST@



  hive.server2.authentication.spnego.keytab
  /etc/security/keytabs/spnego.service.keytab



  hive.server2.authentication.spnego.principal
  HTTP/_HOST@


  

--Thank you

If you are not the addressee, please inform us immediately that you have 
received this e-mail by mistake, and delete it. We thank you for your support.



Re: Unable to read JSON input in Spark (YARN Cluster)

2016-01-02 Thread Vijay Gharge
Hi

Few suggestions:

1. Try storage mode as "memory and disk" both. >> to verify heap memory
error
2. Try to copy and read json source file from local filesystem (i.e.
Without hdfs) >> to verify minimum working code
3. Looks like some library issue which is causing lzo telated error.

On Saturday 2 January 2016, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> Version: Spark 1.5.2
>
> *Spark built with Hive*
> git clone git://github.com/apache/spark.git
> ./make-distribution.sh  --tgz -Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0
> -Phive -Phive-thriftserver
>
>
> *Input:*
> -sh-4.1$ hadoop fs -du -h /user/dvasthimal/poc_success_spark/data/input
> 2.5 G  /user/dvasthimal/poc_success_spark/data/input/dw_bid_1231.seq
> 2.5 G
>  /user/dvasthimal/poc_success_spark/data/input/dw_mao_item_best_offr_1231.seq
> *5.9 G
>  /user/dvasthimal/poc_success_spark/data/input/expt_session_1231.json*
> -sh-4.1$
>
> *Spark Shell:*
>
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/apache/hadoop/lib/native/
> export SPARK_HOME=*/home/dvasthimal/spark-1.5.2-bin-2.4.0*
> export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.4.0-hadoop2.4.0.jar
> export HADOOP_CONF_DIR=/apache/hadoop/conf
> cd $SPARK_HOME
> export
> SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-21.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-EBAY-21/share/hadoop/yarn/lib/guava-11.0.2.jar:/apache/hadoop-2.4.1-EBAY-21/share/hadoop/hdfs/hadoop-hdfs-2.4.1-EBAY-21.jar:/home/dvasthimal/pig_jars/sojourner-common-0.1.3-hadoop2.jar:/home/dvasthimal/pig_jars/jackson-mapper-asl-1.8.5.jar:/home/dvasthimal/pig_jars/sojourner-common-0.1.3-hadoop2.jar:/home/dvasthimal/pig_jars/experimentation-reporting-common-0.0.1-SNAPSHOT.jar:/apache/hadoop/share/hadoop/common/lib/hadoop-ebay-2.4.1-EBAY-11.jar
> ./bin/spark-shell
>
> import org.apache.hadoop.io.Text
> import org.codehaus.jackson.map.ObjectMapper
> import com.ebay.hadoop.platform.model.SessionContainer
> import scala.collection.JavaConversions._
> import  com.ebay.globalenv.sojourner.TrackingProperty
> import  java.net.URLDecoder
> import  com.ebay.ep.reporting.common.util.TagsUtil
> import org.apache.hadoop.conf.Configuration
> import sqlContext.implicits._
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val df =
> sqlContext.read.json("/user/dvasthimal/poc_success_spark/data/input/expt_session_1231.json")
>
>
> *Errors:*
> 1.
>
> 16/01/01 18:36:12 INFO json.JSONRelation: Listing
> hdfs://apollo-phx-nn-ha/user/dvasthimal/poc_success_spark/data/input/expt_session_1231.json
> on driver
> 16/01/01 18:36:12 INFO storage.MemoryStore: ensureFreeSpace(268744) called
> with curMem=0, maxMem=556038881
> 16/01/01 18:36:12 INFO storage.MemoryStore: Block broadcast_0 stored as
> values in memory (estimated size 262.4 KB, free 530.0 MB)
> 16/01/01 18:36:12 INFO storage.MemoryStore: ensureFreeSpace(24028) called
> with curMem=268744, maxMem=556038881
> 16/01/01 18:36:12 INFO storage.MemoryStore: Block broadcast_0_piece0
> stored as bytes in memory (estimated size 23.5 KB, free 530.0 MB)
> 16/01/01 18:36:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on localhost:59605 (size: 23.5 KB, free: 530.3 MB)
> 16/01/01 18:36:12 INFO spark.SparkContext: Created broadcast 0 from json
> at :36
> 16/01/01 18:36:12 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl
> library
> java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
> at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886)
> at java.lang.Runtime.loadLibrary0(Runtime.java:849)
> at java.lang.System.loadLibrary(System.java:1088)
> at
> com.hadoop.compression.lzo.GPLNativeCodeLoader.(GPLNativeCodeLoader.java:32)
> at com.hadoop.compression.lzo.LzoCodec.(LzoCodec.java:71)
>
>
> 2.
> 16/01/01 18:36:44 INFO executor.Executor: Finished task 3.0 in stage 0.0
> (TID 3). 2256 bytes result sent to driver
> 16/01/01 18:36:44 INFO scheduler.TaskSetManager: Finished task 3.0 in
> stage 0.0 (TID 3) in 32082 ms on localhost (2/24)
> 16/01/01 18:36:54 ERROR executor.Executor: Exception in task 9.0 in stage
> 0.0 (TID 9)
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:2271)
> at org.apache.hadoop.io.Text.setCapacity(Text.java:266)
> at org.apache.hadoop.io.Text.append(Text.java:236)
> at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:243)
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
> at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
> at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248)
>
> 3.
> TypeRef(TypeSymbol(class $read extends Serializable))
>
> uncaught exception during compilation: java.lang.AssertionError
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 9
> in stage 0.0 failed 1 times, most recent failure: Lost task 9.0 in stage
> 0.0 (TID 9, localhost): 

Unable to read JSON input in Spark (YARN Cluster)

2016-01-01 Thread ๏̯͡๏
Version: Spark 1.5.2

*Spark built with Hive*
git clone git://github.com/apache/spark.git
./make-distribution.sh  --tgz -Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0
-Phive -Phive-thriftserver


*Input:*
-sh-4.1$ hadoop fs -du -h /user/dvasthimal/poc_success_spark/data/input
2.5 G  /user/dvasthimal/poc_success_spark/data/input/dw_bid_1231.seq
2.5 G
 /user/dvasthimal/poc_success_spark/data/input/dw_mao_item_best_offr_1231.seq
*5.9 G
 /user/dvasthimal/poc_success_spark/data/input/expt_session_1231.json*
-sh-4.1$

*Spark Shell:*

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/apache/hadoop/lib/native/
export SPARK_HOME=*/home/dvasthimal/spark-1.5.2-bin-2.4.0*
export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.4.0-hadoop2.4.0.jar
export HADOOP_CONF_DIR=/apache/hadoop/conf
cd $SPARK_HOME
export
SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-21.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-EBAY-21/share/hadoop/yarn/lib/guava-11.0.2.jar:/apache/hadoop-2.4.1-EBAY-21/share/hadoop/hdfs/hadoop-hdfs-2.4.1-EBAY-21.jar:/home/dvasthimal/pig_jars/sojourner-common-0.1.3-hadoop2.jar:/home/dvasthimal/pig_jars/jackson-mapper-asl-1.8.5.jar:/home/dvasthimal/pig_jars/sojourner-common-0.1.3-hadoop2.jar:/home/dvasthimal/pig_jars/experimentation-reporting-common-0.0.1-SNAPSHOT.jar:/apache/hadoop/share/hadoop/common/lib/hadoop-ebay-2.4.1-EBAY-11.jar
./bin/spark-shell

import org.apache.hadoop.io.Text
import org.codehaus.jackson.map.ObjectMapper
import com.ebay.hadoop.platform.model.SessionContainer
import scala.collection.JavaConversions._
import  com.ebay.globalenv.sojourner.TrackingProperty
import  java.net.URLDecoder
import  com.ebay.ep.reporting.common.util.TagsUtil
import org.apache.hadoop.conf.Configuration
import sqlContext.implicits._

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df =
sqlContext.read.json("/user/dvasthimal/poc_success_spark/data/input/expt_session_1231.json")


*Errors:*
1.

16/01/01 18:36:12 INFO json.JSONRelation: Listing
hdfs://apollo-phx-nn-ha/user/dvasthimal/poc_success_spark/data/input/expt_session_1231.json
on driver
16/01/01 18:36:12 INFO storage.MemoryStore: ensureFreeSpace(268744) called
with curMem=0, maxMem=556038881
16/01/01 18:36:12 INFO storage.MemoryStore: Block broadcast_0 stored as
values in memory (estimated size 262.4 KB, free 530.0 MB)
16/01/01 18:36:12 INFO storage.MemoryStore: ensureFreeSpace(24028) called
with curMem=268744, maxMem=556038881
16/01/01 18:36:12 INFO storage.MemoryStore: Block broadcast_0_piece0 stored
as bytes in memory (estimated size 23.5 KB, free 530.0 MB)
16/01/01 18:36:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
in memory on localhost:59605 (size: 23.5 KB, free: 530.3 MB)
16/01/01 18:36:12 INFO spark.SparkContext: Created broadcast 0 from json at
:36
16/01/01 18:36:12 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl
library
java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886)
at java.lang.Runtime.loadLibrary0(Runtime.java:849)
at java.lang.System.loadLibrary(System.java:1088)
at
com.hadoop.compression.lzo.GPLNativeCodeLoader.(GPLNativeCodeLoader.java:32)
at com.hadoop.compression.lzo.LzoCodec.(LzoCodec.java:71)


2.
16/01/01 18:36:44 INFO executor.Executor: Finished task 3.0 in stage 0.0
(TID 3). 2256 bytes result sent to driver
16/01/01 18:36:44 INFO scheduler.TaskSetManager: Finished task 3.0 in stage
0.0 (TID 3) in 32082 ms on localhost (2/24)
16/01/01 18:36:54 ERROR executor.Executor: Exception in task 9.0 in stage
0.0 (TID 9)
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at org.apache.hadoop.io.Text.setCapacity(Text.java:266)
at org.apache.hadoop.io.Text.append(Text.java:236)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:243)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248)

3.
TypeRef(TypeSymbol(class $read extends Serializable))

uncaught exception during compilation: java.lang.AssertionError
org.apache.spark.SparkException: Job aborted due to stage failure: Task 9
in stage 0.0 failed 1 times, most recent failure: Lost task 9.0 in stage
0.0 (TID 9, localhost): java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at org.apache.hadoop.io.Text.setCapacity(Text.java:266)
at org.apache.hadoop.io.Text.append(Text.java:236)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:243)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
4.
16/01/01 18:36:56 ERROR util.Utils: Uncaught exception in thread Executor
task launch worker-19
java.lang.NullPointerException
at
org.apache.spark.scheduler.Task$$anonfun$run$1.apply$mcV$sp(Task.scala:94)
at 

Re: Autoscaling of Spark YARN cluster

2015-12-14 Thread Deepak Sharma
An approach I can think of  is using Ambari Metrics Service(AMS)
Using these metrics , you can decide upon if the cluster is low in
resources.
If yes, call the Ambari management API to add the node to the cluster.

Thanks
Deepak

On Mon, Dec 14, 2015 at 2:48 PM, cs user <acldstk...@gmail.com> wrote:

> Hi Mingyu,
>
> I'd be interested in hearing about anything else you find which might meet
> your needs for this.
>
> One way perhaps this could be done would be to use Ambari. Ambari comes
> with a nice api which you can use to add additional nodes into a cluster:
>
>
> https://github.com/apache/ambari/blob/trunk/ambari-server/docs/api/v1/index.md
>
> Once the node has been built, the ambari agent installed, you can then
> call back to the management node via the api, tell it what you want the new
> node to be, and it will connect, configure your new node and add it to the
> cluster.
>
> You could create a host group within the cluster blueprint with the
> minimal components you need to install to have it operate as a yarn node.
>
> As for the decision to scale, that is outside of the remit of Ambari. I
> guess you could look into using aws autoscaling or you could look into a
> product called scalr, which has an opensource version. We are using this to
> install an ambari cluster using chef to configure the nodes up until the
> point it hands over to Ambari.
>
> Scalr allows you to write custom scaling metrics which you could use to
> query the # of applications queued, # of resources available values and
> add nodes when required.
>
> Cheers!
>
> On Mon, Dec 14, 2015 at 8:57 AM, Mingyu Kim <m...@palantir.com> wrote:
>
>> Hi all,
>>
>> Has anyone tried out autoscaling Spark YARN cluster on a public cloud
>> (e.g. EC2) based on workload? To be clear, I’m interested in scaling the
>> cluster itself up and down by adding and removing YARN nodes based on the
>> cluster resource utilization (e.g. # of applications queued, # of resources
>> available), as opposed to scaling resources assigned to Spark applications,
>> which is natively supported by Spark’s dynamic resource scheduling. I’ve
>> found that Cloudbreak
>> <http://sequenceiq.com/cloudbreak-docs/latest/periscope/#how-it-works> has
>> a similar feature, but it’s in “technical preview”, and I didn’t find much
>> else from my search.
>>
>> This might be a general YARN question, but wanted to check if there’s a
>> solution popular in the Spark community. Any sharing of experience around
>> autoscaling will be helpful!
>>
>> Thanks,
>> Mingyu
>>
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Autoscaling of Spark YARN cluster

2015-12-14 Thread cs user
Hi Mingyu,

I'd be interested in hearing about anything else you find which might meet
your needs for this.

One way perhaps this could be done would be to use Ambari. Ambari comes
with a nice api which you can use to add additional nodes into a cluster:

https://github.com/apache/ambari/blob/trunk/ambari-server/docs/api/v1/index.md

Once the node has been built, the ambari agent installed, you can then call
back to the management node via the api, tell it what you want the new node
to be, and it will connect, configure your new node and add it to the
cluster.

You could create a host group within the cluster blueprint with the minimal
components you need to install to have it operate as a yarn node.

As for the decision to scale, that is outside of the remit of Ambari. I
guess you could look into using aws autoscaling or you could look into a
product called scalr, which has an opensource version. We are using this to
install an ambari cluster using chef to configure the nodes up until the
point it hands over to Ambari.

Scalr allows you to write custom scaling metrics which you could use to
query the # of applications queued, # of resources available values and add
nodes when required.

Cheers!

On Mon, Dec 14, 2015 at 8:57 AM, Mingyu Kim <m...@palantir.com> wrote:

> Hi all,
>
> Has anyone tried out autoscaling Spark YARN cluster on a public cloud
> (e.g. EC2) based on workload? To be clear, I’m interested in scaling the
> cluster itself up and down by adding and removing YARN nodes based on the
> cluster resource utilization (e.g. # of applications queued, # of resources
> available), as opposed to scaling resources assigned to Spark applications,
> which is natively supported by Spark’s dynamic resource scheduling. I’ve
> found that Cloudbreak
> <http://sequenceiq.com/cloudbreak-docs/latest/periscope/#how-it-works> has
> a similar feature, but it’s in “technical preview”, and I didn’t find much
> else from my search.
>
> This might be a general YARN question, but wanted to check if there’s a
> solution popular in the Spark community. Any sharing of experience around
> autoscaling will be helpful!
>
> Thanks,
> Mingyu
>


Re: Autoscaling of Spark YARN cluster

2015-12-14 Thread Mingyu Kim
Cool. Using Ambari to monitor and scale up/down the cluster sounds
promising. Thanks for the pointer!

Mingyu

From:  Deepak Sharma <deepakmc...@gmail.com>
Date:  Monday, December 14, 2015 at 1:53 AM
To:  cs user <acldstk...@gmail.com>
Cc:  Mingyu Kim <m...@palantir.com>, "user@spark.apache.org"
<user@spark.apache.org>
Subject:  Re: Autoscaling of Spark YARN cluster

An approach I can think of  is using Ambari Metrics Service(AMS)
Using these metrics , you can decide upon if the cluster is low in
resources.
If yes, call the Ambari management API to add the node to the cluster.

Thanks
Deepak

On Mon, Dec 14, 2015 at 2:48 PM, cs user <acldstk...@gmail.com> wrote:
> Hi Mingyu, 
> 
> I'd be interested in hearing about anything else you find which might meet
> your needs for this.
> 
> One way perhaps this could be done would be to use Ambari. Ambari comes with a
> nice api which you can use to add additional nodes into a cluster:
> 
> https://github.com/apache/ambari/blob/trunk/ambari-server/docs/api/v1/index.md
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_ambari
> _blob_trunk_ambari-2Dserver_docs_api_v1_index.md=CwMFaQ=izlc9mHr637UR4lpLE
> ZLFFS3Vn2UXBrZ4tFb6oOnmz8=ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84=tDt9
> pyS5Gz-4R50zQ9pG1lDSVv8Gg03JsQXzTtsghag=aceNpj9HLTmsTeVMI5VMxj9HmbU3ls0gqxa2
> OVkkUOA=> 
> 
> Once the node has been built, the ambari agent installed, you can then call
> back to the management node via the api, tell it what you want the new node to
> be, and it will connect, configure your new node and add it to the cluster.
> 
> You could create a host group within the cluster blueprint with the minimal
> components you need to install to have it operate as a yarn node.
> 
> As for the decision to scale, that is outside of the remit of Ambari. I guess
> you could look into using aws autoscaling or you could look into a product
> called scalr, which has an opensource version. We are using this to install an
> ambari cluster using chef to configure the nodes up until the point it hands
> over to Ambari. 
> 
> Scalr allows you to write custom scaling metrics which you could use to query
> the # of applications queued, # of resources available values and add nodes
> when required. 
> 
> Cheers!
> 
> On Mon, Dec 14, 2015 at 8:57 AM, Mingyu Kim <m...@palantir.com> wrote:
>> Hi all,
>> 
>> Has anyone tried out autoscaling Spark YARN cluster on a public cloud (e.g.
>> EC2) based on workload? To be clear, I¹m interested in scaling the cluster
>> itself up and down by adding and removing YARN nodes based on the cluster
>> resource utilization (e.g. # of applications queued, # of resources
>> available), as opposed to scaling resources assigned to Spark applications,
>> which is natively supported by Spark¹s dynamic resource scheduling. I¹ve
>> found that Cloudbreak
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__sequenceiq.com_cloudbrea
>> k-2Ddocs_latest_periscope_-23how-2Dit-2Dworks=CwMFaQ=izlc9mHr637UR4lpLEZL
>> FFS3Vn2UXBrZ4tFb6oOnmz8=ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84=tDt9p
>> yS5Gz-4R50zQ9pG1lDSVv8Gg03JsQXzTtsghag=qKfLbs_mv_rLKTEHN1FUW98fehzu7HAbdD7t
>> h9dykTg=>  has a similar feature, but it¹s in ³technical preview², and I
>> didn¹t find much else from my search.
>> 
>> This might be a general YARN question, but wanted to check if there¹s a
>> solution popular in the Spark community. Any sharing of experience around
>> autoscaling will be helpful!
>> 
>> Thanks,
>> Mingyu
> 



-- 
Thanks
Deepak
www.bigdatabig.com 
<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.bigdatabig.com=Cw
MFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=ennQJq47pNnObsDh-88a9YU
rUulcYQoV8giPASqXB84=tDt9pyS5Gz-4R50zQ9pG1lDSVv8Gg03JsQXzTtsghag=HGOZP3P
urGS6jiGFWaz2IevpABa9qmCrmkbP-hwvmhI=>
www.keosha.net 
<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.keosha.net=CwMFaQ
=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=ennQJq47pNnObsDh-88a9YUrUul
cYQoV8giPASqXB84=tDt9pyS5Gz-4R50zQ9pG1lDSVv8Gg03JsQXzTtsghag=U8sfm5YwpBP
1s8c4QjkSsmIESUG56RNKo3O6ZEnijA4=>




smime.p7s
Description: S/MIME cryptographic signature


Autoscaling of Spark YARN cluster

2015-12-14 Thread Mingyu Kim
Hi all,

Has anyone tried out autoscaling Spark YARN cluster on a public cloud (e.g.
EC2) based on workload? To be clear, I¹m interested in scaling the cluster
itself up and down by adding and removing YARN nodes based on the cluster
resource utilization (e.g. # of applications queued, # of resources
available), as opposed to scaling resources assigned to Spark applications,
which is natively supported by Spark¹s dynamic resource scheduling. I¹ve
found that Cloudbreak
<http://sequenceiq.com/cloudbreak-docs/latest/periscope/#how-it-works>  has
a similar feature, but it¹s in ³technical preview², and I didn¹t find much
else from my search.

This might be a general YARN question, but wanted to check if there¹s a
solution popular in the Spark community. Any sharing of experience around
autoscaling will be helpful!

Thanks,
Mingyu




smime.p7s
Description: S/MIME cryptographic signature


spark yarn-cluster job failing in batch processing

2015-04-23 Thread sachin Singh
Hi All,
I am trying to execute batch processing in yarn-cluster mode i.e. I have
many sql insert queries,based on argument provided it will it will fetch the
queries ,create context , schema RDD and insert in hive tables,

Please Note- in standalone mode its working and in cluster mode working is I
configured one query,also I have configured
yarn.nodemanager.delete.debug-sec = 600

I am using below command-

spark-submit --jars
./analiticlibs/utils-common-1.0.0.jar,./analiticlibs/mysql-connector-java-5.1.17.jar,./analiticlibs/log4j-1.2.17.jar
--files datasource.properties,log4j.properties,hive-site.xml --deploy-mode
cluster --master yarn --num-executors 1 --driver-memory 2g
--driver-java-options -XX:MaxPermSize=1G --executor-memory 1g
--executor-cores 1 --class com.java.analitics.jobs.StandaloneAggregationJob
sparkanalitics-1.0.0.jar daily_agg 2015-04-21


Exception from Container log-

Exception in thread Driver java.lang.ArrayIndexOutOfBoundsException: 2
at
com.java.analitics.jobs.StandaloneAggregationJob.main(StandaloneAggregationJob.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:427)

exception in our exception log file-

 diagnostics: Application application_1429800386537_0001 failed 2 times due
to AM Container for appattempt_1429800386537_0001_02 exited with 
exitCode: 15 due to: Exception from container-launch.
Container id: container_1429800386537_0001_02_01
Exit code: 15
Stack trace: ExitCodeException exitCode=15: 
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 15
.Failing this attempt.. Failing the application.
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 queue: root.hdfs
 start time: 1429800525569
 final status: FAILED
 tracking URL:
http://tejas.alcatel.com:8088/cluster/app/application_1429800386537_0001
 user: hdfs
2015-04-23 20:19:27 DEBUG Client - stopping client from cache:
org.apache.hadoop.ipc.Client@12f5f40b
2015-04-23 20:19:27 DEBUG Utils - Shutdown hook called

need urgent support,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-yarn-cluster-job-failing-in-batch-processing-tp22626.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark yarn cluster Application Master not running yarn container

2014-11-25 Thread firemonk9

I am running a 3 node(32 core, 60gb) Yarn cluster for Spark jobs.

1) Below are my Yarn memory settings

yarn.nodemanager.resource.memory-mb = 52224
yarn.scheduler.minimum-allocation-mb = 40960
yarn.scheduler.maximum-allocation-mb = 52224
Apache Spark Memory Settings

export SPARK_EXECUTOR_MEMORY=40G
export SPARK_EXECUTOR_CORES=27
export SPARK_EXECUTOR_INSTANCES=3
With above settings I am hoping to see my job run on two nodes how ever the
the job is not running on the node where Application Master is running.

2) Yarn memory settings

yarn.nodemanager.resource.memory-mb = 52224
yarn.scheduler.minimum-allocation-mb = 20480
yarn.scheduler.maximum-allocation-mb = 52224
Apache Spark Memory Settings

export SPARK_EXECUTOR_MEMORY=18G
export SPARK_EXECUTOR_CORES=13
export SPARK_EXECUTOR_INSTANCES=4
I would like to know how can I run the job on both the nodes with the first
memory settings ? Thanks for the help.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-yarn-cluster-Application-Master-not-running-yarn-container-tp19761.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org