[no subject]

2016-08-13 Thread Jestin Ma
Hi, I'm currently trying to perform an outer join between two DataFrames/Sets, one is ~150GB, one is about ~50 GB on a column, id. df1.id is skewed in that there are many 0's, the rest being unique IDs. df2.id is not skewed. If I filter df1.id != 0, then the join works well. If I don't, then the

Re: Does Spark SQL support indexes?

2016-08-13 Thread Jörn Franke
Use a format that has built-in indexes, such as Parquet or Orc. Do not forget to sort the data on the columns that your filter on. > On 14 Aug 2016, at 05:03, Taotao.Li wrote: > > > hi, guys, does Spark SQL support indexes? if so, how can I create an index > on my temp table? if not, how can

How Spark sql query optimisation work if we are using .rdd action ?

2016-08-13 Thread mayur bhole
HI All, Lets say, we have val df = bigTableA.join(bigTableB,bigTableA("A")===bigTableB("A"),"left") val rddFromDF = df.rdd println(rddFromDF.count) My understanding is that spark will convert all data frame operations before "rddFromDF.count" into RDD equivalent operation as we are not performin

Re: Why I can't use broadcast var defined in a global object?

2016-08-13 Thread Ted Yu
Can you (or David) resend David's reply ? I don't see the reply in this thread. Thanks > On Aug 13, 2016, at 8:39 PM, yaochunnan wrote: > > Hi David, > Your answers have solved my problem! Detailed and accurate. Thank you very > much! > > > > -- > View this message in context: > http://a

Re: Why I can't use broadcast var defined in a global object?

2016-08-13 Thread yaochunnan
Hi David, Your answers have solved my problem! Detailed and accurate. Thank you very much! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-I-can-t-use-broadcast-var-defined-in-a-global-object-tp27523p27531.html Sent from the Apache Spark User List maili

Re: Does Spark SQL support indexes?

2016-08-13 Thread Chanh Le
Hi Taotao, Spark SQL doesn’t support index :). > On Aug 14, 2016, at 10:03 AM, Taotao.Li wrote: > > > hi, guys, does Spark SQL support indexes? if so, how can I create an index > on my temp table? if not, how can I handle some specific queries on a very > large table? it would iterate al

Does Spark SQL support indexes?

2016-08-13 Thread Taotao.Li
hi, guys, does Spark SQL support indexes? if so, how can I create an index on my temp table? if not, how can I handle some specific queries on a very large table? it would iterate all the table even though all I want is just a small piece of that table. great thanks, *___* Quant

Re: KafkaUtils.createStream not picking smallest offset

2016-08-13 Thread Diwakar Dhanuskodi
Not using  check  pointing now.  Source is  producing 1.2million messages to topic. We  are  using  zookeeper offsets for other  downstreams  too. That's   the  reason  going  with  createstream which  stores offsets in zookeeper.  Sent from Samsung Mobile. Original message Fr

Re: mesos or kubernetes ?

2016-08-13 Thread Jacek Laskowski
Hi, Thanks Michael! That's exactly what I missed in my understanding of the different options for Spark on XYZ. Thanks! And the last sentence was excellent to help me understand DC/OS to, say, CDH. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 ht

Re: [SQL] Why does (0 to 9).toDF("num").as[String] work?

2016-08-13 Thread Jacek Laskowski
Hi, The point is that I could go full-type with Dataset[String] and wonder why it's possible with ints. You're working with DataFrames which are Dataset[Row]. It's too little to me these days :) Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http:

Re: mesos or kubernetes ?

2016-08-13 Thread Michael Gummelt
DC/OS Spark *is* Apache Spark on Mesos, along with some packaging that makes it easy to install and manage on DC/OS. For example: $ dcos package install spark $ dcos spark run --submit-args="--class SparkPi ..." The single command install gives runs the cluster dispatcher and the history server

Re: [SQL] Why does (0 to 9).toDF("num").as[String] work?

2016-08-13 Thread Mich Talebzadeh
Would not that be as simple as: scala> (0 to 9).toDF res14: org.apache.spark.sql.DataFrame = [value: int] scala> (0 to 9).toDF.map(_.toString) res13: org.apache.spark.sql.Dataset[String] = [value: string] with my little knowledge Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profil

Re: Spark 2.0.0 - Java API - Modify a column in a dataframe

2016-08-13 Thread Jacek Laskowski
Hi, Could Encoders.STRING work? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Thu, Aug 11, 2016 at 5:28 AM, Aseem Bansal wrote: > Hi > > I have a Dataset >

Re: mesos or kubernetes ?

2016-08-13 Thread Jacek Laskowski
Hi, I'm wondering why not DC/OS (with Mesos)? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sat, Aug 13, 2016 at 11:24 AM, guyoh wrote: > My company is tryi

[SQL] Why does (0 to 9).toDF("num").as[String] work?

2016-08-13 Thread Jacek Laskowski
Hi, Just ran into it and can't explain why it works. Please help me understand it. Q1: Why can I `as[String]` with Ints? Is this type safe? scala> (0 to 9).toDF("num").as[String] res12: org.apache.spark.sql.Dataset[String] = [num: int] Q2: Why can I map over strings even though there are really

Re: mesos or kubernetes ?

2016-08-13 Thread Shuai Lin
Good summary! One more advantage of running spark on mesos: community support. There are quite a big user base that runs spark on mesos, so if you encounter a problem with your deployment, it's very likely you can get the answer by a simple google search, or asking in the spark/mesos user list. By

Re: mesos or kubernetes ?

2016-08-13 Thread Michael Gummelt
Spark has a first-class scheduler for Mesos, whereas it doesn't for Kubernetes. Running Spark on Kubernetes means running Spark in standalone mode, wrapped in a Kubernetes service: https://github.com/kubernetes/kubernetes/tree/master/examples/spark So you're effectively comparing standalone vs. M

mesos or kubernetes ?

2016-08-13 Thread guyoh
My company is trying to decide whether to use kubernetes or mesos. Since we are planning to use Spark in the near future, I was wandering what is the best choice for us. Thanks, Guy -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mesos-or-kubernetes-tp2753

Re: call a mysql stored procedure from spark

2016-08-13 Thread Mich Talebzadeh
to be executed in MySQL and results sent back to Spark? No I don't think so. On the other hand a stored procedure is nothing but a compiled code so can you use the raw SQL behind the stored proc? You can certainly send the SQL via JDBC and the RS back. HTH Dr Mich Talebzadeh LinkedIn * htt

call a mysql stored procedure from spark

2016-08-13 Thread sujeet jog
Hi, Is there a way to call a stored procedure using spark ? thanks, Sujeet

Re: Accessing HBase through Spark with Security enabled

2016-08-13 Thread Jacek Laskowski
Hi Aneela, My (little to no) understanding of how to make it work is to use hbase.security.authentication property set to kerberos (see [1]). Spark on YARN uses it to get the tokens for Hive, HBase et al (see [2]). It happens when Client starts conversation to YARN RM (see [3]). You should not d

Spark stage concurrency

2016-08-13 Thread Mazen
Suppose a spark job has two stages with independent dependencies (they do not depend on each other) and they are submitted concurrently/simultaneously (as Tasksets) by the DAG scheduler to the task scheduler. Can someone give more detailed insight on how the cores available on executors are distrib

Spark Streaming fault tolerance benchmark

2016-08-13 Thread Dominik Safaric
A few months ago, I've started investigating part of an empirical research several stream processing engines, including but not limited to Spark Streaming. As the benchmark should extend the scope further from performance metrics such as throughput and latency, I've focused onto fault tolerance a

Re: Spark 2 cannot create ORC table when CLUSTERED. This worked in Spark 1.6.1

2016-08-13 Thread Mich Talebzadeh
Hi, SPARK-17047 created Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://tale

Unsubscribe

2016-08-13 Thread bijuna
Unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org