How Spark sql query optimisation work if we are using .rdd action ?

2016-08-13 Thread mayur bhole
HI All,

Lets say, we have

val df = bigTableA.join(bigTableB,bigTableA("A")===bigTableB("A"),"left")
val rddFromDF = df.rdd
println(rddFromDF.count)

My understanding is that spark will convert all data frame operations
before "rddFromDF.count" into RDD equivalent operation as we are not
performing any action on dataframe directly. In that case, spark will not
be using optimization engine. Is my assumption right? Please point me to
right resources.

[ Note : I have posted same question on so :
http://stackoverflow.com/questions/38889812/how-spark-dataframe-optimization-engine-works-with-dag
]

Thanks


Re: Why I can't use broadcast var defined in a global object?

2016-08-13 Thread Ted Yu
Can you (or David) resend David's reply ?

I don't see the reply in this thread. 

Thanks

> On Aug 13, 2016, at 8:39 PM, yaochunnan  wrote:
> 
> Hi David, 
> Your answers have solved my problem! Detailed and accurate. Thank you very
> much!
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-I-can-t-use-broadcast-var-defined-in-a-global-object-tp27523p27531.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Why I can't use broadcast var defined in a global object?

2016-08-13 Thread yaochunnan
Hi David, 
Your answers have solved my problem! Detailed and accurate. Thank you very
much!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-I-can-t-use-broadcast-var-defined-in-a-global-object-tp27523p27531.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Does Spark SQL support indexes?

2016-08-13 Thread Chanh Le
Hi Taotao,

Spark SQL doesn’t support index :).




> On Aug 14, 2016, at 10:03 AM, Taotao.Li  wrote:
> 
> 
> hi, guys, does Spark SQL support indexes?  if so, how can I create an index 
> on my temp table? if not, how can I handle some specific queries on a very 
> large table? it would iterate all the table even though all I want is just a 
> small piece of that table.
> 
> great thanks, 
> 
> 
> ___
> Quant | Engineer | Boy
> ___
> blog:http://litaotao.github.io 
> 
> github: www.github.com/litaotao 
> 
> 



Does Spark SQL support indexes?

2016-08-13 Thread Taotao.Li
hi, guys, does Spark SQL support indexes?  if so, how can I create an index
on my temp table? if not, how can I handle some specific queries on a very
large table? it would iterate all the table even though all I want is just
a small piece of that table.

great thanks,


*___*
Quant | Engineer | Boy
*___*
*blog*:http://litaotao.github.io

*github*: www.github.com/litaotao


Re: KafkaUtils.createStream not picking smallest offset

2016-08-13 Thread Diwakar Dhanuskodi

Not using  check  pointing now.  Source is  producing 1.2million messages to 
topic. We  are  using  zookeeper offsets for other  downstreams  too. That's  
the  reason  going  with  createstream which  stores offsets in zookeeper. 


Sent from Samsung Mobile.

 Original message From: Cody Koeninger 
 Date:12/08/2016  23:42  (GMT+05:30) 
To: Diwakar Dhanuskodi , 
user@spark.apache.org Cc:  Subject: Re: 
KafkaUtils.createStream not picking smallest offset 
Are you checkpointing?


Beyond that, why are you using createStream instead of createDirectStream

On Fri, Aug 12, 2016 at 12:32 PM, Diwakar Dhanuskodi
 wrote:
> Okay .
> I could  delete  the  consumer group in zookeeper and  start  again to  re
> use same consumer group name. But  this  is  not  working  though . Somehow
> createstream  is  picking  the  offset from  some where other than
> /consumers/ from  zookeeper
>
>
> Sent from Samsung Mobile.
>
>
>
>
>
>
>
>
>  Original message 
> From: Cody Koeninger 
> Date:12/08/2016 18:02 (GMT+05:30)
> To: Diwakar Dhanuskodi 
> Cc:
> Subject: Re: KafkaUtils.createStream not picking smallest offset
>
> Auto offset reset only applies if there aren't offsets available otherwise.
>
> The old high level consumer stores offsets in zookeeper.
>
> If you want to make sure you're starting clean, use a new consumer group
> I'd.
>
> On Aug 12, 2016 3:35 AM, "Diwakar Dhanuskodi" 
> wrote:
>>
>>
>> Hi,
>> We are  using  spark  1.6.1 and  kafka 0.9.
>>
>> KafkaUtils.createStream is  showing strange behaviour. Though
>> auto.offset.reset is  set  to  smallest .  Whenever we  need  to  restart
>> the  stream it  is  picking up  the  latest  offset which  is not  expected.
>> Do  we  need  to  set  any  other  properties ?.
>>
>> createDirectStream works fine  in  this  above  case.
>>
>>
>> Sent from Samsung Mobile.


Re: mesos or kubernetes ?

2016-08-13 Thread Jacek Laskowski
Hi,

Thanks Michael! That's exactly what I missed in my understanding of
the different options for Spark on XYZ. Thanks!

And the last sentence was excellent to help me understand DC/OS to, say, CDH.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sat, Aug 13, 2016 at 1:44 PM, Michael Gummelt  wrote:
> DC/OS Spark *is* Apache Spark on Mesos, along with some packaging that makes
> it easy to install and manage on DC/OS.
>
> For example:
>
> $ dcos package install spark
> $ dcos spark run --submit-args="--class SparkPi ..."
>
> The single command install gives runs the cluster dispatcher and the history
> server in your cluster via marathon, so it's HA.
> It provides a local CLI that your end users can use to submit jobs.
> And it's integrated with other DC/OS packages like HDFS.
>
> It sort of does for Spark what e.g. CDH does for Hadoop.
>
> On Sat, Aug 13, 2016 at 1:35 PM, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> I'm wondering why not DC/OS (with Mesos)?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Sat, Aug 13, 2016 at 11:24 AM, guyoh  wrote:
>> > My company is trying to decide whether to use kubernetes or mesos. Since
>> > we
>> > are planning to use Spark in the near future, I was wandering what is
>> > the
>> > best choice for us.
>> > Thanks,
>> > Guy
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/mesos-or-kubernetes-tp27530.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [SQL] Why does (0 to 9).toDF("num").as[String] work?

2016-08-13 Thread Jacek Laskowski
Hi,

The point is that I could go full-type with Dataset[String] and wonder
why it's possible with ints.

You're working with DataFrames which are Dataset[Row]. It's too little
to me these days :)

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sat, Aug 13, 2016 at 1:43 PM, Mich Talebzadeh
 wrote:
> Would not that be as simple as:
>
> scala> (0 to 9).toDF
> res14: org.apache.spark.sql.DataFrame = [value: int]
>
> scala> (0 to 9).toDF.map(_.toString)
> res13: org.apache.spark.sql.Dataset[String] = [value: string]
>
> with my little knowledge
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 13 August 2016 at 21:17, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> Just ran into it and can't explain why it works. Please help me understand
>> it.
>>
>> Q1: Why can I `as[String]` with Ints? Is this type safe?
>>
>> scala> (0 to 9).toDF("num").as[String]
>> res12: org.apache.spark.sql.Dataset[String] = [num: int]
>>
>> Q2: Why can I map over strings even though there are really ints?
>>
>> scala> (0 to 9).toDF("num").as[String].map(_.toUpperCase)
>> res11: org.apache.spark.sql.Dataset[String] = [value: string]
>>
>> Why are the two lines possible?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: mesos or kubernetes ?

2016-08-13 Thread Michael Gummelt
DC/OS Spark *is* Apache Spark on Mesos, along with some packaging that
makes it easy to install and manage on DC/OS.

For example:

$ dcos package install spark
$ dcos spark run --submit-args="--class SparkPi ..."

The single command install gives runs the cluster dispatcher and the
history server in your cluster via marathon, so it's HA.
It provides a local CLI that your end users can use to submit jobs.
And it's integrated with other DC/OS packages like HDFS.

It sort of does for Spark what e.g. CDH does for Hadoop.

On Sat, Aug 13, 2016 at 1:35 PM, Jacek Laskowski  wrote:

> Hi,
>
> I'm wondering why not DC/OS (with Mesos)?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sat, Aug 13, 2016 at 11:24 AM, guyoh  wrote:
> > My company is trying to decide whether to use kubernetes or mesos. Since
> we
> > are planning to use Spark in the near future, I was wandering what is the
> > best choice for us.
> > Thanks,
> > Guy
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/mesos-or-kubernetes-tp27530.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Michael Gummelt
Software Engineer
Mesosphere


Re: [SQL] Why does (0 to 9).toDF("num").as[String] work?

2016-08-13 Thread Mich Talebzadeh
Would not that be as simple as:

scala> (0 to 9).toDF
res14: org.apache.spark.sql.DataFrame = [value: int]

scala> (0 to 9).toDF.map(_.toString)
res13: org.apache.spark.sql.Dataset[String] = [value: string]

with my little knowledge

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 13 August 2016 at 21:17, Jacek Laskowski  wrote:

> Hi,
>
> Just ran into it and can't explain why it works. Please help me understand
> it.
>
> Q1: Why can I `as[String]` with Ints? Is this type safe?
>
> scala> (0 to 9).toDF("num").as[String]
> res12: org.apache.spark.sql.Dataset[String] = [num: int]
>
> Q2: Why can I map over strings even though there are really ints?
>
> scala> (0 to 9).toDF("num").as[String].map(_.toUpperCase)
> res11: org.apache.spark.sql.Dataset[String] = [value: string]
>
> Why are the two lines possible?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark 2.0.0 - Java API - Modify a column in a dataframe

2016-08-13 Thread Jacek Laskowski
Hi,

Could Encoders.STRING work?

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Thu, Aug 11, 2016 at 5:28 AM, Aseem Bansal  wrote:
> Hi
>
> I have a Dataset
>
> I will change a String to String so there will be no schema changes.
>
> Is there a way I can run a map on it? I have seen the function at
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/Dataset.html#map(org.apache.spark.api.java.function.MapFunction,%20org.apache.spark.sql.Encoder)
>
> But the problem is the second argument. What should I give? The row is not
> in a specific format so I cannot go and create encoder for a bean. I want
> the schema to remain the same.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: mesos or kubernetes ?

2016-08-13 Thread Jacek Laskowski
Hi,

I'm wondering why not DC/OS (with Mesos)?

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sat, Aug 13, 2016 at 11:24 AM, guyoh  wrote:
> My company is trying to decide whether to use kubernetes or mesos. Since we
> are planning to use Spark in the near future, I was wandering what is the
> best choice for us.
> Thanks,
> Guy
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/mesos-or-kubernetes-tp27530.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[SQL] Why does (0 to 9).toDF("num").as[String] work?

2016-08-13 Thread Jacek Laskowski
Hi,

Just ran into it and can't explain why it works. Please help me understand it.

Q1: Why can I `as[String]` with Ints? Is this type safe?

scala> (0 to 9).toDF("num").as[String]
res12: org.apache.spark.sql.Dataset[String] = [num: int]

Q2: Why can I map over strings even though there are really ints?

scala> (0 to 9).toDF("num").as[String].map(_.toUpperCase)
res11: org.apache.spark.sql.Dataset[String] = [value: string]

Why are the two lines possible?

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: mesos or kubernetes ?

2016-08-13 Thread Shuai Lin
Good summary! One more advantage of running spark on mesos: community
support. There are quite a big user base that runs spark on mesos, so if
you encounter a problem with your deployment, it's very likely you can get
the answer by a simple google search, or asking in the spark/mesos user
list. By contrast it's not likely to achieve the same with spark on k8s.

On Sun, Aug 14, 2016 at 2:40 AM, Michael Gummelt 
wrote:

> Spark has a first-class scheduler for Mesos, whereas it doesn't for
> Kubernetes.  Running Spark on Kubernetes means running Spark in standalone
> mode, wrapped in a Kubernetes service: https://github.com/kubernetes/
> kubernetes/tree/master/examples/spark
>
> So you're effectively comparing standalone vs. Mesos.  For basic purposes,
> standalone works fine.  Mesos adds support for things like docker images,
> security, resource reservations via roles, targeting specific nodes via
> attributes, etc.
>
> The main benefit of Mesos, however, is that you can share the same
> infrastructure with other, non-Spark services.  We have users, for example,
> running Spark on the same cluster as HDFS, Cassandra, Kafka, web apps,
> Jenkins, etc.  You can do this with Kubernetes to some extent, but running
> in standalone means that the Spark "partition" isn't elastic.  You must
> statically partition to exclusively run Spark.
>
> On Sat, Aug 13, 2016 at 11:24 AM, guyoh  wrote:
>
>> My company is trying to decide whether to use kubernetes or mesos. Since
>> we
>> are planning to use Spark in the near future, I was wandering what is the
>> best choice for us.
>> Thanks,
>> Guy
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/mesos-or-kubernetes-tp27530.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>


Re: mesos or kubernetes ?

2016-08-13 Thread Michael Gummelt
Spark has a first-class scheduler for Mesos, whereas it doesn't for
Kubernetes.  Running Spark on Kubernetes means running Spark in standalone
mode, wrapped in a Kubernetes service:
https://github.com/kubernetes/kubernetes/tree/master/examples/spark

So you're effectively comparing standalone vs. Mesos.  For basic purposes,
standalone works fine.  Mesos adds support for things like docker images,
security, resource reservations via roles, targeting specific nodes via
attributes, etc.

The main benefit of Mesos, however, is that you can share the same
infrastructure with other, non-Spark services.  We have users, for example,
running Spark on the same cluster as HDFS, Cassandra, Kafka, web apps,
Jenkins, etc.  You can do this with Kubernetes to some extent, but running
in standalone means that the Spark "partition" isn't elastic.  You must
statically partition to exclusively run Spark.

On Sat, Aug 13, 2016 at 11:24 AM, guyoh  wrote:

> My company is trying to decide whether to use kubernetes or mesos. Since we
> are planning to use Spark in the near future, I was wandering what is the
> best choice for us.
> Thanks,
> Guy
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/mesos-or-kubernetes-tp27530.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Michael Gummelt
Software Engineer
Mesosphere


mesos or kubernetes ?

2016-08-13 Thread guyoh
My company is trying to decide whether to use kubernetes or mesos. Since we
are planning to use Spark in the near future, I was wandering what is the
best choice for us. 
Thanks, 
Guy



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/mesos-or-kubernetes-tp27530.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: call a mysql stored procedure from spark

2016-08-13 Thread Mich Talebzadeh
to be executed in MySQL and results sent back to Spark?

No I don't think so.

On the other hand a stored procedure is nothing but a compiled code so can
you use the raw SQL behind the stored proc?

You can certainly send the SQL via JDBC and the RS back.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 13 August 2016 at 18:40, sujeet jog  wrote:

> Hi,
>
> Is there a way to call a stored procedure using spark ?
>
>
> thanks,
> Sujeet
>


call a mysql stored procedure from spark

2016-08-13 Thread sujeet jog
Hi,

Is there a way to call a stored procedure using spark ?


thanks,
Sujeet


Re: Accessing HBase through Spark with Security enabled

2016-08-13 Thread Jacek Laskowski
Hi Aneela,

My (little to no) understanding of how to make it work is to use
hbase.security.authentication property set to kerberos (see [1]).

Spark on YARN uses it to get the tokens for Hive, HBase et al (see
[2]). It happens when Client starts conversation to YARN RM (see [3]).

You should not do that yourself (and BTW you've got a typo in
spark.yarn.security.tokens.habse.enabled setting). I think that the
entire code you pasted matches the code Spark's doing itself before
requesting resources from YARN.

Give it a shot and report back since I've never worked in such a
configuration and would love improving in this (security) area.
Thanks!

[1] 
http://www.cloudera.com/documentation/enterprise/5-5-x/topics/cdh_sg_hbase_authentication.html#concept_zyz_vg5_nt__section_s1l_nwv_ls
[2] 
https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/security/HBaseCredentialProvider.scala#L58
[3] 
https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L396

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, Aug 12, 2016 at 11:30 PM, Aneela Saleem  wrote:
> Thanks for your response Jacek!
>
> Here is the code, how spark accesses HBase:
> System.setProperty("java.security.krb5.conf", "/etc/krb5.conf");
> System.setProperty("java.security.auth.login.config",
> "/etc/hbase/conf/zk-jaas.conf");
> val hconf = HBaseConfiguration.create()
> val tableName = "emp"
> hconf.set("hbase.zookeeper.quorum", "hadoop-master")
> hconf.set(TableInputFormat.INPUT_TABLE, tableName)
> hconf.set("hbase.zookeeper.property.clientPort", "2181")
> hconf.set("hbase.master", "hadoop-master:6")
> hconf.set("hadoop.security.authentication", "kerberos")
> hconf.set("hbase.security.authentication", "kerberos")
> hconf.addResource(new Path("/etc/hbase/conf/core-site.xml"))
> hconf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))
> UserGroupInformation.setConfiguration(hconf)
> UserGroupInformation.loginUserFromKeytab("spark@platalyticsrealm",
> "/etc/hadoop/conf/sp.keytab")
> conf.set("spark.yarn.security.tokens.habse.enabled", "true")
> conf.set("hadoop.security.authentication", "true")
> conf.set("hbase.security.authentication", "true")
> conf.set("spark.authenticate", "true")
> conf.set("spark.authenticate.secret","None")
> val sc = new SparkContext(conf)
> UserGroupInformation.setConfiguration(hconf)
> val keyTab = "/etc/hadoop/conf/sp.keytab"
> val ugi =
> UserGroupInformation.loginUserFromKeytabAndReturnUGI("spark/hadoop-master@platalyticsrealm",
> keyTab)
> UserGroupInformation.setLoginUser(ugi)
> HBaseAdmin.checkHBaseAvailable(hconf);
> ugi.doAs(new PrivilegedExceptionAction[Void]() {
> override def run(): Void = {
> val conf = new SparkConf().set("spark.shuffle.consolidateFiles", "true")
>
> val sc = new SparkContext(conf)
> val hbaseContext = new HBaseContext(sc, hconf)
>
> val scan = new Scan()
> scan.addColumn(columnName, "column1")
> scan.setTimeRange(0L, 141608330L)
> val rdd = hbaseContext.hbaseRDD("emp", scan)
> println(rdd.count)
> rdd.saveAsTextFile("hdfs://hadoop-master:8020/hbaseTemp/")
> sc.stop()
> return null
> }
> })
> I have tried it with both Spark versions, 20 and 1.5.3 but same exception
> was thrown.
>
> I floated this email on HBase community as well, they recommended me to use
> SparkOnHbase cloudera library. And asked to try the above cod but nothing
> works. I'm stuck here.
>
>
> On Sat, Aug 13, 2016 at 7:07 AM, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> How do you access HBase? What's the version of Spark?
>>
>> (I don't see spark packages in the stack trace)
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Sun, Aug 7, 2016 at 9:02 AM, Aneela Saleem 
>> wrote:
>> > Hi all,
>> >
>> > I'm trying to run a spark job that accesses HBase with security enabled.
>> > When i run the following command:
>> >
>> > /usr/local/spark-2/bin/spark-submit --keytab
>> > /etc/hadoop/conf/spark.keytab
>> > --principal spark/hadoop-master@platalyticsrealm --class
>> > com.platalytics.example.spark.App --master yarn  --driver-class-path
>> > /root/hbase-1.2.2/conf /home/vm6/project-1-jar-with-dependencies.jar
>> >
>> >
>> > I get the following error:
>> >
>> >
>> > 2016-08-07 20:43:57,617 WARN
>> > [hconnection-0x24b5fa45-metaLookup-shared--pool2-t1] ipc.RpcClientImpl:
>> > Exception encountered while connecting to the server :
>> > javax.security.sasl.SaslException: GSS initiate failed [Caused by
>> > GSSException: No valid credentials provided (Mechanism level: Failed to
>> > find
>> > any Kerberos tgt)]
>> > 2016-08-07 20:43:57,619 ERROR
>> > 

Spark stage concurrency

2016-08-13 Thread Mazen
Suppose a spark job has two stages with independent dependencies (they do not
depend on each other) and they are submitted concurrently/simultaneously (as
Tasksets) by the DAG scheduler to the task scheduler. Can someone give more
detailed insight on how the cores available on executors are distributed
among the two ready stages/Tasksets? More precisely:

-   Tasks from the second taskset/stage are not launched until the tasks of
the previous taskset/stage complete? or,

-   Tasks from both tasksets can be launched (granted cores) simultaneously
depending depending on the logic implemented by the taskscheduler e.g.
FIFO/Fair?

In general, suppose a new resource offer has triggered the taskscheduler to
make decision to select some ready tasks (out of n ready taksets) for
execution? what is the logic implemented by the taskscheduler in such case?
Thanks.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stage-concurrency-tp27529.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark Streaming fault tolerance benchmark

2016-08-13 Thread Dominik Safaric
A few months ago, I've started investigating part of an empirical research
several stream processing engines, including but not limited to Spark
Streaming. 

As the benchmark should extend the scope further from performance metrics
such as throughput and latency, I've focused onto fault tolerance as well.
In particular, the rate of data items lost due to various faults. In this
context, I have the following questions.

If a Driver fails, all Executors and their in-memory kept blocks will be
lost. The state is however maintained in HDFS for example when
checkpointing, or using synchronous Write Ahead Logs. But, what happens to
the blocks that have been received by the Receiver, but not yet processed?
Follow the assumption of using an unreliable input data source. Secondly, to
simulate this scenario and determine whether and in which amount data items
were lost, how exactly can I simulate Driver failure after a certain amount
of time the application has been operational. 

Thirdly, what other fault tolerance scenarios might result to data items
being lost? I've investigated onto back-pressure to, but unlike Storm which
uses a fast-fail back-pressure strategy, Spark Streaming handles
back-pressure gracefully. 

Thanks a lot in advance!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-fault-tolerance-benchmark-tp27528.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 2 cannot create ORC table when CLUSTERED. This worked in Spark 1.6.1

2016-08-13 Thread Mich Talebzadeh
Hi,
SPARK-17047  created

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 13 August 2016 at 02:54, Jacek Laskowski  wrote:

> Hi Mich,
>
> File a JIRA issue as that seems as if they overlooked that part. Spark
> 2.0 has less and less HiveQL with more and more native support.
>
> (My take on this is that the days of Hive in Spark are counted and
> Hive is gonna disappear soon)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Thu, Aug 11, 2016 at 10:02 AM, Mich Talebzadeh
>  wrote:
> >
> >
> > This does not work with CLUSTERED BY clause in Spark 2 now!
> >
> > CREATE TABLE test.dummy2
> >  (
> >  ID INT
> >, CLUSTERED INT
> >, SCATTERED INT
> >, RANDOMISED INT
> >, RANDOM_STRING VARCHAR(50)
> >, SMALL_VC VARCHAR(10)
> >, PADDING  VARCHAR(10)
> > )
> > CLUSTERED BY (ID) INTO 256 BUCKETS
> > STORED AS ORC
> > TBLPROPERTIES ( "orc.compress"="SNAPPY",
> > "orc.create.index"="true",
> > "orc.bloom.filter.columns"="ID",
> > "orc.bloom.filter.fpp"="0.05",
> > "orc.stripe.size"="268435456",
> > "orc.row.index.stride"="1" )
> > scala> HiveContext.sql(sqltext)
> > org.apache.spark.sql.catalyst.parser.ParseException:
> > Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 2, pos 0)
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > Disclaimer: Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> The
> > author will in no case be liable for any monetary damages arising from
> such
> > loss, damage or destruction.
> >
> >
>


Unsubscribe

2016-08-13 Thread bijuna

Unsubscribe


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Accessing HBase through Spark with Security enabled

2016-08-13 Thread Aneela Saleem
Thanks for your response Jacek!

Here is the code, how spark accesses HBase:
System.setProperty("java.security.krb5.conf", "/etc/krb5.conf");
System.setProperty("java.security.auth.login.config",
"/etc/hbase/conf/zk-jaas.conf");
val hconf = HBaseConfiguration.create()
val tableName = "emp"
hconf.set("hbase.zookeeper.quorum", "hadoop-master")
hconf.set(TableInputFormat.INPUT_TABLE, tableName)
hconf.set("hbase.zookeeper.property.clientPort", "2181")
hconf.set("hbase.master", "hadoop-master:6")
hconf.set("hadoop.security.authentication", "kerberos")
hconf.set("hbase.security.authentication", "kerberos")
hconf.addResource(new Path("/etc/hbase/conf/core-site.xml"))
hconf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))
UserGroupInformation.setConfiguration(hconf)
UserGroupInformation.loginUserFromKeytab("spark@platalyticsrealm",
"/etc/hadoop/conf/sp.keytab")
conf.set("spark.yarn.security.tokens.habse.enabled", "true")
conf.set("hadoop.security.authentication", "true")
conf.set("hbase.security.authentication", "true")
conf.set("spark.authenticate", "true")
conf.set("spark.authenticate.secret","None")
val sc = new SparkContext(conf)
UserGroupInformation.setConfiguration(hconf)
val keyTab = "/etc/hadoop/conf/sp.keytab"
val ugi =
UserGroupInformation.loginUserFromKeytabAndReturnUGI("spark/hadoop-master@platalyticsrealm",
keyTab)
UserGroupInformation.setLoginUser(ugi)
HBaseAdmin.checkHBaseAvailable(hconf);
ugi.doAs(new PrivilegedExceptionAction[Void]() {
override def run(): Void = {
val conf = new SparkConf().set("spark.shuffle.consolidateFiles", "true")

val sc = new SparkContext(conf)
val hbaseContext = new HBaseContext(sc, hconf)

val scan = new Scan()
scan.addColumn(columnName, "column1")
scan.setTimeRange(0L, 141608330L)
val rdd = hbaseContext.hbaseRDD("emp", scan)
println(rdd.count)
rdd.saveAsTextFile("hdfs://hadoop-master:8020/hbaseTemp/")
sc.stop()
return null
}
})
I have tried it with both Spark versions, 20 and 1.5.3 but same exception
was thrown.

I floated this email on HBase community as well, they recommended me to use
SparkOnHbase cloudera library. And asked to try the above cod but nothing
works. I'm stuck here.


On Sat, Aug 13, 2016 at 7:07 AM, Jacek Laskowski  wrote:

> Hi,
>
> How do you access HBase? What's the version of Spark?
>
> (I don't see spark packages in the stack trace)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sun, Aug 7, 2016 at 9:02 AM, Aneela Saleem 
> wrote:
> > Hi all,
> >
> > I'm trying to run a spark job that accesses HBase with security enabled.
> > When i run the following command:
> >
> > /usr/local/spark-2/bin/spark-submit --keytab
> /etc/hadoop/conf/spark.keytab
> > --principal spark/hadoop-master@platalyticsrealm --class
> > com.platalytics.example.spark.App --master yarn  --driver-class-path
> > /root/hbase-1.2.2/conf /home/vm6/project-1-jar-with-dependencies.jar
> >
> >
> > I get the following error:
> >
> >
> > 2016-08-07 20:43:57,617 WARN
> > [hconnection-0x24b5fa45-metaLookup-shared--pool2-t1] ipc.RpcClientImpl:
> > Exception encountered while connecting to the server :
> > javax.security.sasl.SaslException: GSS initiate failed [Caused by
> > GSSException: No valid credentials provided (Mechanism level: Failed to
> find
> > any Kerberos tgt)]
> > 2016-08-07 20:43:57,619 ERROR
> > [hconnection-0x24b5fa45-metaLookup-shared--pool2-t1] ipc.RpcClientImpl:
> SASL
> > authentication failed. The most likely cause is missing or invalid
> > credentials. Consider 'kinit'.
> > javax.security.sasl.SaslException: GSS initiate failed [Caused by
> > GSSException: No valid credentials provided (Mechanism level: Failed to
> find
> > any Kerberos tgt)]
> >   at
> > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(
> GssKrb5Client.java:212)
> >   at
> > org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(
> HBaseSaslRpcClient.java:179)
> >   at
> > org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.
> setupSaslConnection(RpcClientImpl.java:617)
> >   at
> > org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.
> access$700(RpcClientImpl.java:162)
> >   at
> > org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.
> run(RpcClientImpl.java:743)
> >   at
> > org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.
> run(RpcClientImpl.java:740)
> >   at java.security.AccessController.doPrivileged(Native Method)
> >   at javax.security.auth.Subject.doAs(Subject.java:415)
> >   at
> > org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1657)
> >   at
> > org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.
> setupIOstreams(RpcClientImpl.java:740)
> >   at
> > org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.
> writeRequest(RpcClientImpl.java:906)
> >