command to get list oin spark 2.0 scala of all persisted rdd's in spark 2.0 scala shell

2017-06-01 Thread nancy henry
Hi Team,

Please let me know how to get list of all persisted RDD's ins park 2.0
shell


Regards,
Nancy


access error while trying to run distcp from source cluster

2017-05-25 Thread nancy henry
Hi Team,

I am trying to copy data from A cluster to B cluster and same user for both

I am running distcp command on source cluster A

but i am getting error

17/05/25 07:24:08 INFO mapreduce.Job: Running job: job_1492549627402_344485
17/05/25 07:24:17 INFO mapreduce.Job: Job job_1492549627402_344485 running
in uber mode : false 17/05/25 07:24:17 INFO mapreduce.Job:  map 0% reduce
0% 17/05/25 07:24:26 INFO mapreduce.Job: Task Id :
attempt_1492549627402_344485_m_00_0, Status : FAILED Error:
org.apache.hadoop.security.AccessControlException: User abcde (user id
50006054)  has been denied access to create distcptest2 at
com.mapr.fs.MapRFileSystem.makeDir(MapRFileSystem.java:1282) at
com.mapr.fs.MapRFileSystem.mkdirs(MapRFileSystem.java:1302) at
org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1913) at
org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:272)
at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:51)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:796) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:346) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(
UserGroupInformation.java:1595) at org.apache.hadoop.mapred.
YarnChild.main(YarnChild.java:158)
this is the error


Hive ::: how to select where conditions dynamically using CASE

2017-04-12 Thread nancy henry
Hi ,

Lets say I have a employee table
testtab1.empid  testtab1.empnametesttab1.joindate
testtab1.bonus
1   sirisha 15-06-2016  60
2   Arun15-10-2016  20
3   divya   17-06-2016  80
4   rahul   16-01-2016  30
5   kokila  17-02-2016  90
6   utiya   16-09-2016  60
7   satish  17-10-2016  10


keep or remove sc.stop() coz of RpcEnv already stopped error

2017-03-13 Thread nancy henry
 Hi Team,


getting this error if we put sc.stop() in application..



can we remove it from application but i read if yu dont explicitly stop
using sc.stop the yarn application will not get registered in history
service.. SO what to do ?


 WARN Dispatcher: Message RemoteProcessDisconnected dropped.

java.lang.IllegalStateException: RpcEnv already stopped.

at
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:159)

at
org.apache.spark.rpc.netty.Dispatcher.postToAll(Dispatcher.scala:109)

at
org.apache.spark.rpc.netty.NettyRpcHandler.connectionTerminated(NettyRpcEnv.scala:630)

at
org.apache.spark.network.server.TransportRequestHandler.channelUnregistered(TransportRequestHandler.java:94)


Re: spark-sql use case beginner question

2017-03-08 Thread nancy henry
okay what is difference between keep set hive.execution.engine =spark
and
running the script through hivecontext.sql

Show quoted text


On Mar 9, 2017 8:52 AM, "ayan guha" <guha.a...@gmail.com> wrote:

> Hi
>
> Subject to your version of Hive & Spark, you may want to set
> hive.execution.engine=spark as beeline command line parameter, assuming you
> are running hive scripts using beeline command line (which is suggested
> practice for security purposes).
>
>
>
> On Thu, Mar 9, 2017 at 2:09 PM, nancy henry <nancyhenry6...@gmail.com>
> wrote:
>
>>
>> Hi Team,
>>
>> basically we have all data as hive tables ..and processing it till now in
>> hive on MR.. now that we have hivecontext which can run hivequeries on
>> spark, we are making all these complex hive scripts to run using
>> hivecontext.sql(sc.textfile(hivescript)) kind of approach ie basically
>> running hive queries on spark and not coding anything yet in scala still we
>> see just making hive queries to run on spark is showing a lot difference in
>> time than run on MR..
>>
>> so as we already have hivescripts lets make those complex hivescript run
>> using hc.sql as hc.sql is able to do it
>>
>> or is this not best practice even though spark can do it its still better
>> to load all those individual hive tables in spark and make rdds and write
>> scala code to get the same functionality happening in hive
>>
>> its becoming difficult for us to choose whether to leave it to hc.sql to
>> do the work of running complex scripts also or we have to code in
>> scala..will it be worth the effort of manual intervention in terms of
>> performance
>>
>> ex of our sample scripts
>> use db;
>> create tempfunction1 as com.fgh.jkl.TestFunction;
>>
>> create destable in hive;
>> insert overwrite desttable select (big complext transformations and usage
>> of hive udf)
>> from table1,table2,table3 join table4 on some condition complex and join
>> table 7 on another complex condition where complex filtering
>>
>> So please help what would be best approach and why i should not give
>> entire script for hivecontext to make its own rdds and run on spark if we
>> are able to do it
>>
>> coz all examples i see online are only showing hc.sql("select * from
>> table1) and nothing complex than that
>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


spark-sql use case beginner question

2017-03-08 Thread nancy henry
Hi Team,

basically we have all data as hive tables ..and processing it till now in
hive on MR.. now that we have hivecontext which can run hivequeries on
spark, we are making all these complex hive scripts to run using
hivecontext.sql(sc.textfile(hivescript)) kind of approach ie basically
running hive queries on spark and not coding anything yet in scala still we
see just making hive queries to run on spark is showing a lot difference in
time than run on MR..

so as we already have hivescripts lets make those complex hivescript run
using hc.sql as hc.sql is able to do it

or is this not best practice even though spark can do it its still better
to load all those individual hive tables in spark and make rdds and write
scala code to get the same functionality happening in hive

its becoming difficult for us to choose whether to leave it to hc.sql to do
the work of running complex scripts also or we have to code in scala..will
it be worth the effort of manual intervention in terms of performance

ex of our sample scripts
use db;
create tempfunction1 as com.fgh.jkl.TestFunction;

create destable in hive;
insert overwrite desttable select (big complext transformations and usage
of hive udf)
from table1,table2,table3 join table4 on some condition complex and join
table 7 on another complex condition where complex filtering

So please help what would be best approach and why i should not give entire
script for hivecontext to make its own rdds and run on spark if we are able
to do it

coz all examples i see online are only showing hc.sql("select * from
table1) and nothing complex than that


Re: spark-sql use case beginner question

2017-03-08 Thread nancy henry
Hi Team,

basically we have all data as hive tables ..and processing it till now in
hive on MR.. now that we have hivecontext which can run hivequeries on
spark, we are making all these complex hive scripts to run using
hivecontext.sql(sc.textfile(hivescript)) kind of approach ie basically
running hive queries on spark and not coding anything yet in scala still we
see just making hive queries to run on spark is showing a lot difference in
time than run on MR..

so as we already have hivescripts lets make those complex hivescript run
using hc.sql as hc.sql is able to do it

or is this not best practice even though spark can do it its still better
to load all those individual hive tables in spark and make rdds and write
scala code to get the same functionality happening in hive

its becoming difficult for us to choose whether to leave it to hc.sql to do
the work of running complex scripts also or we have to code in scala..will
it be worth the effort of manual intervention in terms of performance

ex of our sample scripts
use db;
create tempfunction1 as com.fgh.jkl.TestFunction;

create destable in hive;
insert overwrite desttable select (big complext transformations and usage
of hive udf)
from table1,table2,table3 join table4 on some condition complex and join
table 7 on another complex condition where complex filtering

So please help what would be best approach and why i should not give entire
script for hivecontext to make its own rdds and run on spark if we are able
to do it

coz all examples i see online are only showing hc.sql("select * from
table1) and nothing complex than that


made spark job to throw exception still going under finished succeeded status in yarn

2017-03-07 Thread nancy henry
Hi Team,

Wrote below code to throw exception.. How to make below code to throw
exception and make the job to goto failed status in yarn if under some
condition but still close spark context and release resources ..


object Demo {
  def main(args: Array[String]) = {

var a = 0; var c = 0;

val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)

val hiveSqlContext: HiveContext = new
org.apache.spark.sql.hive.HiveContext(sc)
for (a <- 0 to args.length - 1) {
  val query = sc.textFile(args(a)).collect.filter(query =>
!query.contains("--")).mkString(" ")
  var queryarray = query.split(";")
  var b = query.split(";").length

  var querystatuscheck = true;
  for (c <- 0 to b - 1) {

if (querystatuscheck) {

  if (!(StringUtils.isBlank(queryarray(c {

*val querystatus = Try { hiveSqlContext.sql(queryarray(c)) }*



var b = c + 1
querystatuscheck = querystatus.isSuccess
System.out.println("Your" + b + "query status is " +
querystatus)
System.out.println("querystatuschecktostring is" +
querystatuscheck.toString())
querystatuscheck.toString() match {
  case "false" => {

  *  throw querystatus.failed.get*
System.out.println("case true executed")
sc.stop()
  }
  case _ => {
sc.stop()
System.out.println("case default executed")
  }

}

  }

}
  }

  System.out.println("Okay")

}

  }

}


care to share latest pom forspark scala applications eclipse?

2017-02-24 Thread nancy henry
Hi Guys,

Please one of you who is successfully able to bbuild maven packages in
eclipse scala IDE please share your pom.xml


quick question: best to use cluster mode or client mode for production?

2017-02-23 Thread nancy henry
Hi Team,

I have set of hc.sql("hivequery") kind of scripts which i am running right
now in spark-shell

How should i schedule it in production
making it spark-shell -i script.scala
or keeping it in jar file through eclipse and use spark-submit deploy mode
cluster?

which is advisable?


please send me pom.xml for scala 2.10

2017-02-21 Thread nancy henry
Hi,

Please send me a copy of pom.xml as I am getting no sources to compile
error how much eve i try to set source in pom.xml

its not recognizing source fils from my src/main/scala

So please send me one

(includes hive context and spark core)


how to give hdfs file path as argument to spark-submit

2017-02-17 Thread nancy henry
Hi All,


object Step1 {
  def main(args: Array[String]) = {

val sparkConf = new SparkConf().setAppName("my-app")
val sc = new SparkContext(sparkConf)

val hiveSqlContext: HiveContext = new
org.apache.spark.sql.hive.HiveContext(sc)

hiveSqlContext.sql(scala.io.Source.fromFile(args(0)).mkString)

System.out.println("Okay")

  }

}



This is my spark program and my hivescript is at args(0)

$SPARK_HOME/bin/./spark-submit --class com.spark.test.Step1 --master yarn
--deploy-mode cluster com.spark.test-0.1-SNAPSHOT.jar
 hdfs://spirui-d86-f03-06:9229/samples/testsubquery.hql

but file not found exception is coming

why?

where it is expecting the file to be ?
in local or hdfs?
if in hdfs how i should give its path

and is there any better way for hive context than using this to read query
from a  file from hdfs?


scala.io.Source.fromFile protocol for hadoop

2017-02-16 Thread nancy henry
Hello,



hiveSqlContext.sql(scala.io.Source.fromFile(args(0).toString()).mkString).collect()

I have a file in my local system

and i am spark-submit deploy mode cluster  on hadoop

so args(0) should be on hadoop cluster or local?

what should be the protocol file:///
for hadoop what is the protocol?


Re: Lost executor 4 Container killed by YARN for exceeding memory limits.

2017-02-13 Thread nancy henry
Hi,


How to set this parameters while launching spark shell

spark.shuffle.memoryFraction=0.5

and

spark.yarn.executor.memoryOverhead=1024


I tried giving like this but I am giving below error

spark-shell --master yarn --deploy-mode client --driver-memory 16G
--num-executors 500 executor-cores 4 --executor-memory 7G --conf
spark.shuffle.memoryFraction=0.5 --conf
spark.yarn.executor.memoryOverhead=1024

Warning
17/02/13 22:42:02 WARN SparkConf: Detected deprecated memory fraction
settings: [spark.shuffle.memoryFraction]. As of Spark 1.6, execution and
storage memory management are unified. All memory fractions used in the old
model are now deprecated and no longer read. If you wish to use the old
memory management, you may explicitly enable `spark.memory.useLegacyMode`
(not recommended).



On Mon, Feb 13, 2017 at 11:23 PM, Thakrar, Jayesh <
jthak...@conversantmedia.com> wrote:

> Nancy,
>
>
>
> As your log output indicated, your executor 11 GB memory limit.
>
> While you might want to address the root cause/data volume as suggested by
> Jon, you can do an immediate test by changing your command as follows
>
>
>
> spark-shell --master yarn --deploy-mode client --driver-memory 16G
> --num-executors 500 executor-cores 7 --executor-memory 14G
>
>
>
> This essentially increases your executor memory from 11 GB to 14 GB.
>
> Note that it will result in a potentially large footprint - from 500x11 to
> 500x14 GB.
>
> You may want to consult with your DevOps/Operations/Spark Admin team first.
>
>
>
> *From: *Jon Gregg <coble...@gmail.com>
> *Date: *Monday, February 13, 2017 at 8:58 AM
> *To: *nancy henry <nancyhenry6...@gmail.com>
> *Cc: *"user @spark" <user@spark.apache.org>
> *Subject: *Re: Lost executor 4 Container killed by YARN for exceeding
> memory limits.
>
>
>
> Setting Spark's memoryOverhead configuration variable is recommended in
> your logs, and has helped me with these issues in the past.  Search for
> "memoryOverhead" here:  http://spark.apache.org/docs/
> latest/running-on-yarn.html
>
>
>
> That said, you're running on a huge cluster as it is.  If it's possible to
> filter your tables down before the join (keeping just the rows/columns you
> need), that may be a better solution.
>
>
>
> Jon
>
>
>
> On Mon, Feb 13, 2017 at 5:27 AM, nancy henry <nancyhenry6...@gmail.com>
> wrote:
>
> Hi All,,
>
>
>
> I am getting below error while I am trying to join 3 tables which are in
> ORC format in hive from 5 10gb tables through hive context in spark
>
>
>
> Container killed by YARN for exceeding memory limits. 11.1 GB of 11 GB
> physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
>
> 17/02/13 02:21:19 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> Container killed by YARN for exceeding memory limits. 11.1 GB of 11 GB
> physical memory used
>
>
>
>
>
> I am using below memory parameters to launch shell .. what else i could
> increase from these parameters or do I need to change any configuration
> settings please let me know
>
>
>
> spark-shell --master yarn --deploy-mode client --driver-memory 16G
> --num-executors 500 executor-cores 7 --executor-memory 10G
>
>
>
>
>


Lost executor 4 Container killed by YARN for exceeding memory limits.

2017-02-13 Thread nancy henry
Hi All,,

I am getting below error while I am trying to join 3 tables which are in
ORC format in hive from 5 10gb tables through hive context in spark

Container killed by YARN for exceeding memory limits. 11.1 GB of 11 GB
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
17/02/13 02:21:19 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
Container killed by YARN for exceeding memory limits. 11.1 GB of 11 GB
physical memory used


I am using below memory parameters to launch shell .. what else i could
increase from these parameters or do I need to change any configuration
settings please let me know

spark-shell --master yarn --deploy-mode client --driver-memory 16G
--num-executors 500 executor-cores 7 --executor-memory 10G


Is it better to Use Java or Python on Scala for Spark for using big data sets

2017-02-09 Thread nancy henry
Hi All,

Is it better to Use Java or Python on Scala for Spark coding..

Mainly My work is with getting file data which is in csv format  and I have
to do some rule checking and rule aggrgeation

and put the final filtered data back to oracle so that real time apps can
use it..