Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-02 Thread satyajit vegesna
Hi All, I am trying to run a spark job using yarn, and i specify --executor-cores value as 20. But when i go check the "nodes of the cluster" page in http://hostname:8088/cluster/nodes then i see 4 containers getting created on each of the node in cluster. But can only see 1 vcore getting

Re: SQL Based Authorization for SparkSQL

2016-08-02 Thread Ted Yu
There was SPARK-12008 which was closed. Not sure if there is active JIRA in this regard. On Tue, Aug 2, 2016 at 6:40 PM, 马晓宇 wrote: > Hi guys, > > I wonder if anyone working on SQL based authorization already or not. > > This is something we needed badly right now

SQL Based Authorization for SparkSQL

2016-08-02 Thread 马晓宇
Hi guys, I wonder if anyone working on SQL based authorization already or not. This is something we needed badly right now and we tried to embedded a Hive frontend in front of SparkSQL to achieve this but it's not quite a elegant solution. If SparkSQL has a way to do it or anyone already

Re: AccumulatorV2 += operator

2016-08-02 Thread Holden Karau
I believe it was intentional with the idea that it would be more unified between Java and Scala APIs. If your talking about the javadoc mention in https://github.com/apache/spark/pull/14466/files - I believe the += is meant to refer to what the internal implementation of the add function can be

AccumulatorV2 += operator

2016-08-02 Thread Bryan Cutler
It seems like the += operator is missing from the new accumulator API, although the docs still make reference to it. Anyone know if it was intentionally not put in? I'm happy to do a PR for it or update the docs to just use the add() method, just want to check if there was some reason first.

Graph edge type pattern matching in GraphX

2016-08-02 Thread Ulanov, Alexander
Dear Spark developers, Could you suggest how to perform pattern matching on the type of the graph edge in the following scenario. I need to perform some math by means of aggregateMessages on the graph edges if edges are Double. Here is the code: def my[VD: ClassTag, ED: ClassTag] (graph:

Re: What happens in Dataset limit followed by rdd

2016-08-02 Thread Sun Rui
Spark does optimise subsequent limits, for example: scala> df1.limit(3).limit(1).explain == Physical Plan == CollectLimit 1 +- *SerializeFromObject [assertnotnull(input[0, $line14.$read$$iw$$iw$my, true], top level non-flat input object).x AS x#2] +- Scan ExternalRDDScan[obj#1] However, limit

Re: Testing --supervise flag

2016-08-02 Thread Noorul Islam Kamal Malmiyoda
Widening to dev@spark On Mon, Aug 1, 2016 at 4:21 PM, Noorul Islam K M wrote: > > Hi all, > > I was trying to test --supervise flag of spark-submit. > > The documentation [1] says that, the flag helps in restarting your > application automatically if it exited with non-zero

Re: What happens in Dataset limit followed by rdd

2016-08-02 Thread Maciej Szymkiewicz
Thank you for your prompt response and great examples Sun Rui but I am still confused about one thing. Do you see any particular reason to not to merge subsequent limits? Following case (limit n (map f (limit m ds))) could be optimized to: (map f (limit n (limit m ds))) and further to

Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-02 Thread Nick Pentreath
Note that both HashingTF and CountVectorizer are usually used for creating TF-IDF normalized vectors. The definition ( https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition) of term frequency in TF-IDF is actually the "number of times the term occurs in the document". So it's perhaps a bit of a