Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-02 Thread satyajit vegesna
Hi All,

I am trying to run a spark job using yarn, and i specify --executor-cores
value as 20.
But when i go check the "nodes of the cluster" page in
http://hostname:8088/cluster/nodes then i see 4 containers getting created
on each of the node in cluster.

But can only see 1 vcore getting assigned for each containier, even when i
specify --executor-cores 20 while submitting job using spark-submit.

yarn-site.xml

yarn.scheduler.maximum-allocation-mb
6


yarn.scheduler.minimum-allocation-vcores
1


yarn.scheduler.maximum-allocation-vcores
40


yarn.nodemanager.resource.memory-mb
7


yarn.nodemanager.resource.cpu-vcores
20



Did anyone face the same issue??

Regards,
Satyajit.


Re: SQL Based Authorization for SparkSQL

2016-08-02 Thread Ted Yu
There was SPARK-12008 which was closed.

Not sure if there is active JIRA in this regard.

On Tue, Aug 2, 2016 at 6:40 PM, 马晓宇  wrote:

> Hi guys,
>
> I wonder if anyone working on SQL based authorization already or not.
>
> This is something we needed badly right now and we tried to embedded a
> Hive frontend in front of SparkSQL to achieve this but it's not quite a
> elegant solution. If SparkSQL has a way to do it or anyone already working
> on it?
>
> If not, we might consider make some contributions here and might need
> guidance during the work.
>
> Thanks.
>
> Shawn
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


SQL Based Authorization for SparkSQL

2016-08-02 Thread 马晓宇

Hi guys,

I wonder if anyone working on SQL based authorization already or not.

This is something we needed badly right now and we tried to embedded a 
Hive frontend in front of SparkSQL to achieve this but it's not quite a 
elegant solution. If SparkSQL has a way to do it or anyone already 
working on it?


If not, we might consider make some contributions here and might need 
guidance during the work.


Thanks.

Shawn



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: AccumulatorV2 += operator

2016-08-02 Thread Holden Karau
I believe it was intentional with the idea that it would be more unified
between Java and Scala APIs. If your talking about the javadoc mention in
https://github.com/apache/spark/pull/14466/files - I believe the += is
meant to refer to what the internal implementation of the add function can
be for someone extending the accumulator (but it certainly could cause
confusion).

Reynold can provide a more definitive answer in this case.

On Tue, Aug 2, 2016 at 1:46 PM, Bryan Cutler  wrote:

> It seems like the += operator is missing from the new accumulator API,
> although the docs still make reference to it.  Anyone know if it was
> intentionally not put in?  I'm happy to do a PR for it or update the docs
> to just use the add() method, just want to check if there was some reason
> first.
>
> Bryan
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


AccumulatorV2 += operator

2016-08-02 Thread Bryan Cutler
It seems like the += operator is missing from the new accumulator API,
although the docs still make reference to it.  Anyone know if it was
intentionally not put in?  I'm happy to do a PR for it or update the docs
to just use the add() method, just want to check if there was some reason
first.

Bryan


Graph edge type pattern matching in GraphX

2016-08-02 Thread Ulanov, Alexander
Dear Spark developers,

Could you suggest how to perform pattern matching on the type of the graph edge 
in the following scenario. I need to perform some math by means of 
aggregateMessages on the graph edges if edges are Double. Here is the code:
def my[VD: ClassTag, ED: ClassTag] (graph: Graph[VD, ED]): Double {
graph match {
   g: Graph[_, Double] => g.aggregateMessages[Double](t => t.sendToSrc(t.attr), 
_ + _).values.max
   _ => 0.0
}
}

However, it does not work, because aggregateMessages creates context t of type 
[VD, ED, Double]. I expect it to create context of [VD, Double, Double] because 
of the type pattern matching. Could you suggest what is the issue?

Best regards, Alexander


Re: What happens in Dataset limit followed by rdd

2016-08-02 Thread Sun Rui
Spark does optimise subsequent limits, for example:
scala> df1.limit(3).limit(1).explain
== Physical Plan ==
CollectLimit 1
+- *SerializeFromObject [assertnotnull(input[0, $line14.$read$$iw$$iw$my, 
true], top level non-flat input object).x AS x#2]
   +- Scan ExternalRDDScan[obj#1]

However, limit can not be simply pushes down across mapping functions, because 
the number of rows may change across functions. for example, flatMap()

It seems that limit can be pushed across map() which won’t change the number of 
rows. Maybe this is a room for Spark optimisation.

> On Aug 2, 2016, at 18:51, Maciej Szymkiewicz  wrote:
> 
> Thank you for your prompt response and great examples Sun Rui but I am
> still confused about one thing. Do you see any particular reason to not
> to merge subsequent limits? Following case
> 
>(limit n (map f (limit m ds)))
> 
> could be optimized to:
> 
>(map f (limit n (limit m ds)))
> 
> and further to
> 
>(map f (limit (min n m) ds))
> 
> couldn't it?
> 
> 
> On 08/02/2016 11:57 AM, Sun Rui wrote:
>> Based on your code, here is simpler test case on Spark 2.0
>> 
>>case class my (x: Int)
>>val rdd = sc.parallelize(0.until(1), 1000).map { x => my(x) }
>>val df1 = spark.createDataFrame(rdd)
>>val df2 = df1.limit(1)
>>df1.map { r => r.getAs[Int](0) }.first
>>df2.map { r => r.getAs[Int](0) }.first // Much slower than the
>>previous line
>> 
>> Actually, Dataset.first is equivalent to Dataset.limit(1).collect, so
>> check the physical plan of the two cases:
>> 
>>scala> df1.map { r => r.getAs[Int](0) }.limit(1).explain
>>== Physical Plan ==
>>CollectLimit 1
>>+- *SerializeFromObject [input[0, int, true] AS value#124]
>>   +- *MapElements , obj#123: int
>>  +- *DeserializeToObject createexternalrow(x#74,
>>StructField(x,IntegerType,false)), obj#122: org.apache.spark.sql.Row
>> +- Scan ExistingRDD[x#74]
>> 
>>scala> df2.map { r => r.getAs[Int](0) }.limit(1).explain
>>== Physical Plan ==
>>CollectLimit 1
>>+- *SerializeFromObject [input[0, int, true] AS value#131]
>>   +- *MapElements , obj#130: int
>>  +- *DeserializeToObject createexternalrow(x#74,
>>StructField(x,IntegerType,false)), obj#129: org.apache.spark.sql.Row
>> +- *GlobalLimit 1
>>+- Exchange SinglePartition
>>   +- *LocalLimit 1
>>  +- Scan ExistingRDD[x#74]
>> 
>> 
>> For the first case, it is related to an optimisation in
>> the CollectLimitExec physical operator. That is, it will first fetch
>> the first partition to get limit number of row, 1 in this case, if not
>> satisfied, then fetch more partitions, until the desired limit is
>> reached. So generally, if the first partition is not empty, only the
>> first partition will be calculated and fetched. Other partitions will
>> even not be computed.
>> 
>> However, in the second case, the optimisation in the CollectLimitExec
>> does not help, because the previous limit operation involves a shuffle
>> operation. All partitions will be computed, and running LocalLimit(1)
>> on each partition to get 1 row, and then all partitions are shuffled
>> into a single partition. CollectLimitExec will fetch 1 row from the
>> resulted single partition.
>> 
>> 
>>> On Aug 2, 2016, at 09:08, Maciej Szymkiewicz >> >> wrote:
>>> 
>>> Hi everyone,
>>> 
>>> This doesn't look like something expected, does it?
>>> 
>>> http://stackoverflow.com/q/38710018/1560062
>>> 
>>> Quick glance at the UI suggest that there is a shuffle involved and
>>> input for first is ShuffledRowRDD.
>>> -- 
>>> Best regards,
>>> Maciej Szymkiewicz
>> 
> 
> -- 
> Maciej Szymkiewicz



Re: Testing --supervise flag

2016-08-02 Thread Noorul Islam Kamal Malmiyoda
Widening to dev@spark

On Mon, Aug 1, 2016 at 4:21 PM, Noorul Islam K M  wrote:
>
> Hi all,
>
> I was trying to test --supervise flag of spark-submit.
>
> The documentation [1] says that, the flag helps in restarting your
> application automatically if it exited with non-zero exit code.
>
> I am looking for some clarification on that documentation. In this
> context, does application means the driver?
>
> Will the driver be re-launched if an exception is thrown by the
> application? I tested this scenario and the driver is not re-launched.
>
> ~/spark-1.6.1/bin/spark-submit --deploy-mode cluster --master 
> spark://10.29.83.162:6066 --class 
> org.apache.spark.examples.ExceptionHandlingTest 
> /home/spark/spark-1.6.1/lib/spark-examples-1.6.1-hadoop2.6.0.jar
>
> I killed the driver java process using 'kill -9' command and the driver
> is re-launched.
>
> Is this the only scenario were driver will be re-launched? Is there a
> way to simulate non-zero exit code and test the use of --supervise flag?
>
> Regards,
> Noorul
>
> [1] 
> http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: What happens in Dataset limit followed by rdd

2016-08-02 Thread Maciej Szymkiewicz
Thank you for your prompt response and great examples Sun Rui but I am
still confused about one thing. Do you see any particular reason to not
to merge subsequent limits? Following case

(limit n (map f (limit m ds)))

could be optimized to:

(map f (limit n (limit m ds)))

and further to

(map f (limit (min n m) ds))

couldn't it?


On 08/02/2016 11:57 AM, Sun Rui wrote:
> Based on your code, here is simpler test case on Spark 2.0
>
> case class my (x: Int)
> val rdd = sc.parallelize(0.until(1), 1000).map { x => my(x) }
> val df1 = spark.createDataFrame(rdd)
> val df2 = df1.limit(1)
> df1.map { r => r.getAs[Int](0) }.first
> df2.map { r => r.getAs[Int](0) }.first // Much slower than the
> previous line
>
> Actually, Dataset.first is equivalent to Dataset.limit(1).collect, so
> check the physical plan of the two cases:
>
> scala> df1.map { r => r.getAs[Int](0) }.limit(1).explain
> == Physical Plan ==
> CollectLimit 1
> +- *SerializeFromObject [input[0, int, true] AS value#124]
>+- *MapElements , obj#123: int
>   +- *DeserializeToObject createexternalrow(x#74,
> StructField(x,IntegerType,false)), obj#122: org.apache.spark.sql.Row
>  +- Scan ExistingRDD[x#74]
>
> scala> df2.map { r => r.getAs[Int](0) }.limit(1).explain
> == Physical Plan ==
> CollectLimit 1
> +- *SerializeFromObject [input[0, int, true] AS value#131]
>+- *MapElements , obj#130: int
>   +- *DeserializeToObject createexternalrow(x#74,
> StructField(x,IntegerType,false)), obj#129: org.apache.spark.sql.Row
>  +- *GlobalLimit 1
> +- Exchange SinglePartition
>+- *LocalLimit 1
>   +- Scan ExistingRDD[x#74]
>
>
> For the first case, it is related to an optimisation in
> the CollectLimitExec physical operator. That is, it will first fetch
> the first partition to get limit number of row, 1 in this case, if not
> satisfied, then fetch more partitions, until the desired limit is
> reached. So generally, if the first partition is not empty, only the
> first partition will be calculated and fetched. Other partitions will
> even not be computed.
>
> However, in the second case, the optimisation in the CollectLimitExec
> does not help, because the previous limit operation involves a shuffle
> operation. All partitions will be computed, and running LocalLimit(1)
> on each partition to get 1 row, and then all partitions are shuffled
> into a single partition. CollectLimitExec will fetch 1 row from the
> resulted single partition.
>
>
>> On Aug 2, 2016, at 09:08, Maciej Szymkiewicz > > wrote:
>>
>> Hi everyone,
>>
>> This doesn't look like something expected, does it?
>>
>> http://stackoverflow.com/q/38710018/1560062
>>
>> Quick glance at the UI suggest that there is a shuffle involved and
>> input for first is ShuffledRowRDD.
>> -- 
>> Best regards,
>> Maciej Szymkiewicz
>

-- 
Maciej Szymkiewicz


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: What happens in Dataset limit followed by rdd

2016-08-02 Thread Sun Rui
Based on your code, here is simpler test case on Spark 2.0

case class my (x: Int)
val rdd = sc.parallelize(0.until(1), 1000).map { x => my(x) }
val df1 = spark.createDataFrame(rdd)
val df2 = df1.limit(1)
df1.map { r => r.getAs[Int](0) }.first
df2.map { r => r.getAs[Int](0) }.first // Much slower than the previous line

Actually, Dataset.first is equivalent to Dataset.limit(1).collect, so check the 
physical plan of the two cases:

scala> df1.map { r => r.getAs[Int](0) }.limit(1).explain
== Physical Plan ==
CollectLimit 1
+- *SerializeFromObject [input[0, int, true] AS value#124]
   +- *MapElements , obj#123: int
  +- *DeserializeToObject createexternalrow(x#74, 
StructField(x,IntegerType,false)), obj#122: org.apache.spark.sql.Row
 +- Scan ExistingRDD[x#74]

scala> df2.map { r => r.getAs[Int](0) }.limit(1).explain
== Physical Plan ==
CollectLimit 1
+- *SerializeFromObject [input[0, int, true] AS value#131]
   +- *MapElements , obj#130: int
  +- *DeserializeToObject createexternalrow(x#74, 
StructField(x,IntegerType,false)), obj#129: org.apache.spark.sql.Row
 +- *GlobalLimit 1
+- Exchange SinglePartition
   +- *LocalLimit 1
  +- Scan ExistingRDD[x#74]

For the first case, it is related to an optimisation in the CollectLimitExec 
physical operator. That is, it will first fetch the first partition to get 
limit number of row, 1 in this case, if not satisfied, then fetch more 
partitions, until the desired limit is reached. So generally, if the first 
partition is not empty, only the first partition will be calculated and 
fetched. Other partitions will even not be computed.

However, in the second case, the optimisation in the CollectLimitExec does not 
help, because the previous limit operation involves a shuffle operation. All 
partitions will be computed, and running LocalLimit(1) on each partition to get 
1 row, and then all partitions are shuffled into a single partition. 
CollectLimitExec will fetch 1 row from the resulted single partition.


> On Aug 2, 2016, at 09:08, Maciej Szymkiewicz  wrote:
> 
> Hi everyone, 
> This doesn't look like something expected, does it?
> 
> http://stackoverflow.com/q/38710018/1560062 
> 
> Quick glance at the UI suggest that there is a shuffle involved and input for 
> first is ShuffledRowRDD. 
> -- 
> Best regards,
> Maciej Szymkiewicz



Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-02 Thread Nick Pentreath
Note that both HashingTF and CountVectorizer are usually used for creating
TF-IDF normalized vectors. The definition (
https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition) of term frequency
in TF-IDF is actually the "number of times the term occurs in the document".

So it's perhaps a bit of a misnomer, but the implementation is correct.

On Tue, 2 Aug 2016 at 05:44 Yanbo Liang  wrote:

> Hi Hao,
>
> HashingTF directly apply a hash function (Murmurhash3) to the features to
> determine their column index. It excluded any thought about the term
> frequency or the length of the document. It does similar work compared with
> sklearn FeatureHasher. The result is increased speed and reduced memory
> usage, but it does not remember what the input features looked like and can
> not convert the output back to the original features. Actually we misnamed
> this transformer, it only does the work of feature hashing rather than
> computing hashing term frequency.
>
> CountVectorizer will select the top vocabSize words ordered by term
> frequency across the corpus to build the hash table of the features. So it
> will consume more memory than HashingTF. However, we can convert the output
> back to the original feature.
>
> Both of the transformers do not consider the length of each document. If
> you want to compute term frequency divided by the length of the document,
> you should write your own function based on transformers provided by MLlib.
>
> Thanks
> Yanbo
>
> 2016-08-01 15:29 GMT-07:00 Hao Ren :
>
>> When computing term frequency, we can use either HashTF or
>> CountVectorizer feature extractors.
>> However, both of them just use the number of times that a term appears in
>> a document.
>> It is not a true frequency. Acutally, it should be divided by the length
>> of the document.
>>
>> Is this a wanted feature ?
>>
>> --
>> Hao Ren
>>
>> Data Engineer @ leboncoin
>>
>> Paris, France
>>
>
>