[jira] [Commented] (SPARK-32530) SPIP: Kotlin support for Apache Spark

2022-12-05 Thread Maziyar PANAHI (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643344#comment-17643344
 ] 

Maziyar PANAHI commented on SPARK-32530:


Not sure if this matters, but as a Scala developer myself primarily building 
Scala applications to use Apache Spark natively, I highly support this decision 
to have this as part of ASF officially. 

I also agree with a maintenance cost, however, unlike .NET, it's much easier 
for any of us from the Java/Scala world to contribute to Kotlin. I think it's a 
price that should be paid for the sake of longevity. It is clear that Java and 
Scala are not going anywhere, but they are not the first choice for newcomers 
either. More native languages on JVM likeKotlin can really help to bring more 
users and contributors to the Spark ecosystem in the long term.

> SPIP: Kotlin support for Apache Spark
> -
>
> Key: SPARK-32530
> URL: https://issues.apache.org/jira/browse/SPARK-32530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Pasha Finkeshteyn
>Priority: Major
>
> h2. Background and motivation
> Kotlin is a cross-platform, statically typed, general-purpose JVM language. 
> In the last year more than 5 million developers have used Kotlin in mobile, 
> backend, frontend and scientific development. The number of Kotlin developers 
> grows rapidly every year. 
>  * [According to 
> redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: 
> "Kotlin, the second fastest growing language we’ve seen outside of Swift, 
> made a big splash a year ago at this time when it vaulted eight full spots up 
> the list."
>  * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], 
> Kotlin is the second most popular language on the JVM
>  * [According to 
> StackOverflow|https://insights.stackoverflow.com/survey/2020] Kotlin’s share 
> increased by 7.8% in 2020.
> We notice the increasing usage of Kotlin in data analysis ([6% of users in 
> 2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to 
> 2% in 2019) and machine learning (3% of users in 2020, as opposed to 0% in 
> 2019), and we expect these numbers to continue to grow. 
> We, authors of this SPIP, strongly believe that making Kotlin API officially 
> available to developers can bring new users to Apache Spark and help some of 
> the existing users.
> h2. Goals
> The goal of this project is to bring first-class support for Kotlin language 
> into the Apache Spark project. We’re going to achieve this by adding one more 
> module to the current Apache Spark distribution.
> h2. Non-goals
> There is no goal to replace any existing language support or to change any 
> existing Apache Spark API.
> At this time, there is no goal to support non-core APIs of Apache Spark like 
> Spark ML and Spark structured streaming. This may change in the future based 
> on community feedback.
> There is no goal to provide CLI for Kotlin for Apache Spark, this will be a 
> separate SPIP.
> There is no goal to provide support for Apache Spark < 3.0.0.
> h2. Current implementation
> A working prototype is available at 
> [https://github.com/JetBrains/kotlin-spark-api]. It has been tested inside 
> JetBrains and by early adopters.
> h2. What are the risks?
> There is always a risk that this product won’t get enough popularity and will 
> bring more costs than benefits. It can be mitigated by the fact that we don't 
> need to change any existing API and support can be potentially dropped at any 
> time.
> We also believe that existing API is rather low maintenance. It does not 
> bring anything more complex than already exists in the Spark codebase. 
> Furthermore, the implementation is compact - less than 2000 lines of code.
> We are committed to maintaining, improving and evolving the API based on 
> feedback from both Spark and Kotlin communities. As the Kotlin data community 
> continues to grow, we see Kotlin API for Apache Spark as an important part in 
> the evolving Kotlin ecosystem, and intend to fully support it. 
> h2. How long will it take?
> A  working implementation is already available, and if the community will 
> have any proposal of changes for this implementation to be improved, these 
> can be implemented quickly — in weeks if not days.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times

2022-03-17 Thread Maziyar PANAHI (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-35066:
---
Description: 
Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark 
or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
partitions are respected in a way that all 12 tasks are being processed 
altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
them finish immediately and only 2 are being processed. (I've tried to disable 
a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task:

!image-2022-03-17-17-18-36-793.png|width=1073,height=652!

!image-2022-03-17-17-19-11-655.png!

 

Screenshot of spark 3.0.2 task:

 

!image-2022-03-17-17-19-34-906.png!

For a longer discussion: [Spark User List 
|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html](no
 longer exists!)

 

You can reproduce this big difference of performance between Spark 3.1.1 and 
Spark 3.0.2 by using the shared code with any dataset that is large enough to 
take longer than a minute. Not sure if this is related to SQL, any Spark config 
being enabled in 3.x but not really into action before 3.1.1, or it's about 
.transform in Spark ML.

  was:
Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark 
or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
partitions are respected in a way that all 12 tasks are being processed 
altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
them finish immediately and only 2 are being processed. (I've tried to disable 
a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task:

!image-2022-03-17-17-18-36-793.png|width=1073,height=652!

!image-2022-03-17-17-19-11-655.png!

 

Screenshot of spark 3.0.2 task:

 

 

!image-2022-03-17-17-19-34-906.png!

For a longer discussion: [Spark User List 
|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html](no
 longer exists!)

 

You can reproduce this big difference of performance between Spark 3.1.1 and 
Spark 3.0.2 by using the shared code with any dataset that is large enough to 
take longer than a minute. Not sure if this is related to SQL, any Spark config 
being enabled in 3.x but not really into action before 3.1.1, or it's about 
.transform in Spark ML.


> Spark 3.1.1 is slower than 3.0.2 by 4-5 times
> -
>
> Key: SPARK-35066
> URL: https://issues.apache.org/jira/browse/SPARK-35066
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.1.1
> 

[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times

2022-03-17 Thread Maziyar PANAHI (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-35066:
---
Description: 
Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark 
or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
partitions are respected in a way that all 12 tasks are being processed 
altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
them finish immediately and only 2 are being processed. (I've tried to disable 
a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task:

!image-2022-03-17-17-18-36-793.png|width=1073,height=652!

!image-2022-03-17-17-19-11-655.png!

 

Screenshot of spark 3.0.2 task:

 

 

!image-2022-03-17-17-19-34-906.png!

For a longer discussion: [Spark User List 
|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html](no
 longer exists!)

 

You can reproduce this big difference of performance between Spark 3.1.1 and 
Spark 3.0.2 by using the shared code with any dataset that is large enough to 
take longer than a minute. Not sure if this is related to SQL, any Spark config 
being enabled in 3.x but not really into action before 3.1.1, or it's about 
.transform in Spark ML.

  was:
Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark 
or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
partitions are respected in a way that all 12 tasks are being processed 
altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
them finish immediately and only 2 are being processed. (I've tried to disable 
a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task:

!image-2022-03-17-17-18-36-793.png!

!image-2022-03-17-17-19-11-655.png!

 

Screenshot of spark 3.0.2 task:

 

 

 

!image-2022-03-17-17-19-34-906.png!

For a longer discussion: [Spark User 
List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html]

 

You can reproduce this big difference of performance between Spark 3.1.1 and 
Spark 3.0.2 by using the shared code with any dataset that is large enough to 
take longer than a minute. Not sure if this is related to SQL, any Spark config 
being enabled in 3.x but not really into action before 3.1.1, or it's about 
.transform in Spark ML.


> Spark 3.1.1 is slower than 3.0.2 by 4-5 times
> -
>
> Key: SPARK-35066
> URL: https://issues.apache.org/jira/browse/SPARK-35066
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.1.1
> Environment: Spark/PySpark: 3.1.1
> 

[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times

2022-03-17 Thread Maziyar PANAHI (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-35066:
---
Description: 
Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark 
or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
partitions are respected in a way that all 12 tasks are being processed 
altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
them finish immediately and only 2 are being processed. (I've tried to disable 
a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task:

!image-2022-03-17-17-18-36-793.png!

!image-2022-03-17-17-19-11-655.png!

 

Screenshot of spark 3.0.2 task:

 

 

 

!image-2022-03-17-17-19-34-906.png!

For a longer discussion: [Spark User 
List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html]

 

You can reproduce this big difference of performance between Spark 3.1.1 and 
Spark 3.0.2 by using the shared code with any dataset that is large enough to 
take longer than a minute. Not sure if this is related to SQL, any Spark config 
being enabled in 3.x but not really into action before 3.1.1, or it's about 
.transform in Spark ML.

  was:
Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark 
or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
partitions are respected in a way that all 12 tasks are being processed all 
together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
them finish immediately and only 2 are being processed. (I've tried to disable 
a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task: [Spark UI 
3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png]

Screenshot of spark 3.0.2 task: [Spark UI 
3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png]

For a longer discussion: [Spark User 
List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html]

 

You can reproduce this big difference of performance between Spark 3.1.1 and 
Spark 3.0.2 by using the shared code with any dataset that is large enough to 
take longer than a minute. Not sure if this is related to SQL, any Spark config 
being enabled in 3.x but not really into action before 3.1.1, or it's about 
.transform in Spark ML.


> Spark 3.1.1 is slower than 3.0.2 by 4-5 times
> -
>
> Key: SPARK-35066
> URL: https://issues.apache.org/jira/browse/SPARK-35066
> Project: Spark
>  Issue Type: Bug
> 

[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times

2022-03-17 Thread Maziyar PANAHI (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-35066:
---
Attachment: image-2022-03-17-17-19-34-906.png

> Spark 3.1.1 is slower than 3.0.2 by 4-5 times
> -
>
> Key: SPARK-35066
> URL: https://issues.apache.org/jira/browse/SPARK-35066
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.1.1
> Environment: Spark/PySpark: 3.1.1
> Language: Python 3.7.x / Scala 12
> OS: macOS, Linux, and Windows
> Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1
>Reporter: Maziyar PANAHI
>Priority: Major
> Attachments: Screenshot 2021-04-07 at 11.15.48.png, Screenshot 
> 2021-04-08 at 15.08.09.png, Screenshot 2021-04-08 at 15.13.19-1.png, 
> Screenshot 2021-04-08 at 15.13.19.png, image-2022-03-17-17-18-36-793.png, 
> image-2022-03-17-17-19-11-655.png, image-2022-03-17-17-19-34-906.png
>
>
> Hi,
> The following snippet code runs 4-5 times slower when it's used in Apache 
> Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:
>  
> {code:java}
> spark = SparkSession.builder \
>         .master("local[*]") \
>         .config("spark.driver.memory", "16G") \
>         .config("spark.driver.maxResultSize", "0") \
>         .config("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer") \
>         .config("spark.kryoserializer.buffer.max", "2000m") \
>         .getOrCreate()
> Toys = spark.read \
>   .parquet('./toys-cleaned').repartition(12)
> # tokenize the text
> regexTokenizer = RegexTokenizer(inputCol="reviewText",
> outputCol="all_words", pattern="\\W")
> toys_with_words = regexTokenizer.transform(Toys)
> # remove stop words
> remover = StopWordsRemover(inputCol="all_words", outputCol="words")
> toys_with_tokens = remover.transform(toys_with_words).drop("all_words")
> all_words = toys_with_tokens.select(explode("words").alias("word"))
> # group by, sort and limit to 50k
> top50k =
> all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)
> top50k.show()
> {code}
>  
> Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
> partitions are respected in a way that all 12 tasks are being processed all 
> together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
> them finish immediately and only 2 are being processed. (I've tried to 
> disable a couple of configs related to something similar, but none of them 
> worked)
> Screenshot of spark 3.1.1 task: [Spark UI 
> 3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png]
> Screenshot of spark 3.0.2 task: [Spark UI 
> 3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png]
> For a longer discussion: [Spark User 
> List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html]
>  
> You can reproduce this big difference of performance between Spark 3.1.1 and 
> Spark 3.0.2 by using the shared code with any dataset that is large enough to 
> take longer than a minute. Not sure if this is related to SQL, any Spark 
> config being enabled in 3.x but not really into action before 3.1.1, or it's 
> about .transform in Spark ML.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times

2022-03-17 Thread Maziyar PANAHI (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-35066:
---
Attachment: image-2022-03-17-17-19-11-655.png

> Spark 3.1.1 is slower than 3.0.2 by 4-5 times
> -
>
> Key: SPARK-35066
> URL: https://issues.apache.org/jira/browse/SPARK-35066
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.1.1
> Environment: Spark/PySpark: 3.1.1
> Language: Python 3.7.x / Scala 12
> OS: macOS, Linux, and Windows
> Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1
>Reporter: Maziyar PANAHI
>Priority: Major
> Attachments: Screenshot 2021-04-07 at 11.15.48.png, Screenshot 
> 2021-04-08 at 15.08.09.png, Screenshot 2021-04-08 at 15.13.19-1.png, 
> Screenshot 2021-04-08 at 15.13.19.png, image-2022-03-17-17-18-36-793.png, 
> image-2022-03-17-17-19-11-655.png, image-2022-03-17-17-19-34-906.png
>
>
> Hi,
> The following snippet code runs 4-5 times slower when it's used in Apache 
> Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:
>  
> {code:java}
> spark = SparkSession.builder \
>         .master("local[*]") \
>         .config("spark.driver.memory", "16G") \
>         .config("spark.driver.maxResultSize", "0") \
>         .config("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer") \
>         .config("spark.kryoserializer.buffer.max", "2000m") \
>         .getOrCreate()
> Toys = spark.read \
>   .parquet('./toys-cleaned').repartition(12)
> # tokenize the text
> regexTokenizer = RegexTokenizer(inputCol="reviewText",
> outputCol="all_words", pattern="\\W")
> toys_with_words = regexTokenizer.transform(Toys)
> # remove stop words
> remover = StopWordsRemover(inputCol="all_words", outputCol="words")
> toys_with_tokens = remover.transform(toys_with_words).drop("all_words")
> all_words = toys_with_tokens.select(explode("words").alias("word"))
> # group by, sort and limit to 50k
> top50k =
> all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)
> top50k.show()
> {code}
>  
> Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
> partitions are respected in a way that all 12 tasks are being processed all 
> together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
> them finish immediately and only 2 are being processed. (I've tried to 
> disable a couple of configs related to something similar, but none of them 
> worked)
> Screenshot of spark 3.1.1 task: [Spark UI 
> 3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png]
> Screenshot of spark 3.0.2 task: [Spark UI 
> 3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png]
> For a longer discussion: [Spark User 
> List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html]
>  
> You can reproduce this big difference of performance between Spark 3.1.1 and 
> Spark 3.0.2 by using the shared code with any dataset that is large enough to 
> take longer than a minute. Not sure if this is related to SQL, any Spark 
> config being enabled in 3.x but not really into action before 3.1.1, or it's 
> about .transform in Spark ML.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times

2022-03-17 Thread Maziyar PANAHI (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-35066:
---
Attachment: Screenshot 2021-04-08 at 15.13.19-1.png

> Spark 3.1.1 is slower than 3.0.2 by 4-5 times
> -
>
> Key: SPARK-35066
> URL: https://issues.apache.org/jira/browse/SPARK-35066
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.1.1
> Environment: Spark/PySpark: 3.1.1
> Language: Python 3.7.x / Scala 12
> OS: macOS, Linux, and Windows
> Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1
>Reporter: Maziyar PANAHI
>Priority: Major
> Attachments: Screenshot 2021-04-07 at 11.15.48.png, Screenshot 
> 2021-04-08 at 15.08.09.png, Screenshot 2021-04-08 at 15.13.19-1.png, 
> Screenshot 2021-04-08 at 15.13.19.png, image-2022-03-17-17-18-36-793.png
>
>
> Hi,
> The following snippet code runs 4-5 times slower when it's used in Apache 
> Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:
>  
> {code:java}
> spark = SparkSession.builder \
>         .master("local[*]") \
>         .config("spark.driver.memory", "16G") \
>         .config("spark.driver.maxResultSize", "0") \
>         .config("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer") \
>         .config("spark.kryoserializer.buffer.max", "2000m") \
>         .getOrCreate()
> Toys = spark.read \
>   .parquet('./toys-cleaned').repartition(12)
> # tokenize the text
> regexTokenizer = RegexTokenizer(inputCol="reviewText",
> outputCol="all_words", pattern="\\W")
> toys_with_words = regexTokenizer.transform(Toys)
> # remove stop words
> remover = StopWordsRemover(inputCol="all_words", outputCol="words")
> toys_with_tokens = remover.transform(toys_with_words).drop("all_words")
> all_words = toys_with_tokens.select(explode("words").alias("word"))
> # group by, sort and limit to 50k
> top50k =
> all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)
> top50k.show()
> {code}
>  
> Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
> partitions are respected in a way that all 12 tasks are being processed all 
> together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
> them finish immediately and only 2 are being processed. (I've tried to 
> disable a couple of configs related to something similar, but none of them 
> worked)
> Screenshot of spark 3.1.1 task: [Spark UI 
> 3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png]
> Screenshot of spark 3.0.2 task: [Spark UI 
> 3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png]
> For a longer discussion: [Spark User 
> List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html]
>  
> You can reproduce this big difference of performance between Spark 3.1.1 and 
> Spark 3.0.2 by using the shared code with any dataset that is large enough to 
> take longer than a minute. Not sure if this is related to SQL, any Spark 
> config being enabled in 3.x but not really into action before 3.1.1, or it's 
> about .transform in Spark ML.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times

2022-03-17 Thread Maziyar PANAHI (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-35066:
---
Attachment: image-2022-03-17-17-18-36-793.png

> Spark 3.1.1 is slower than 3.0.2 by 4-5 times
> -
>
> Key: SPARK-35066
> URL: https://issues.apache.org/jira/browse/SPARK-35066
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.1.1
> Environment: Spark/PySpark: 3.1.1
> Language: Python 3.7.x / Scala 12
> OS: macOS, Linux, and Windows
> Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1
>Reporter: Maziyar PANAHI
>Priority: Major
> Attachments: Screenshot 2021-04-07 at 11.15.48.png, Screenshot 
> 2021-04-08 at 15.08.09.png, Screenshot 2021-04-08 at 15.13.19-1.png, 
> Screenshot 2021-04-08 at 15.13.19.png, image-2022-03-17-17-18-36-793.png
>
>
> Hi,
> The following snippet code runs 4-5 times slower when it's used in Apache 
> Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:
>  
> {code:java}
> spark = SparkSession.builder \
>         .master("local[*]") \
>         .config("spark.driver.memory", "16G") \
>         .config("spark.driver.maxResultSize", "0") \
>         .config("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer") \
>         .config("spark.kryoserializer.buffer.max", "2000m") \
>         .getOrCreate()
> Toys = spark.read \
>   .parquet('./toys-cleaned').repartition(12)
> # tokenize the text
> regexTokenizer = RegexTokenizer(inputCol="reviewText",
> outputCol="all_words", pattern="\\W")
> toys_with_words = regexTokenizer.transform(Toys)
> # remove stop words
> remover = StopWordsRemover(inputCol="all_words", outputCol="words")
> toys_with_tokens = remover.transform(toys_with_words).drop("all_words")
> all_words = toys_with_tokens.select(explode("words").alias("word"))
> # group by, sort and limit to 50k
> top50k =
> all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)
> top50k.show()
> {code}
>  
> Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
> partitions are respected in a way that all 12 tasks are being processed all 
> together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
> them finish immediately and only 2 are being processed. (I've tried to 
> disable a couple of configs related to something similar, but none of them 
> worked)
> Screenshot of spark 3.1.1 task: [Spark UI 
> 3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png]
> Screenshot of spark 3.0.2 task: [Spark UI 
> 3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png]
> For a longer discussion: [Spark User 
> List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html]
>  
> You can reproduce this big difference of performance between Spark 3.1.1 and 
> Spark 3.0.2 by using the shared code with any dataset that is large enough to 
> take longer than a minute. Not sure if this is related to SQL, any Spark 
> config being enabled in 3.x but not really into action before 3.1.1, or it's 
> about .transform in Spark ML.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times

2021-04-14 Thread Maziyar PANAHI (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-35066:
---
Attachment: Screenshot 2021-04-08 at 15.13.19.png
Screenshot 2021-04-08 at 15.08.09.png
Screenshot 2021-04-07 at 11.15.48.png

> Spark 3.1.1 is slower than 3.0.2 by 4-5 times
> -
>
> Key: SPARK-35066
> URL: https://issues.apache.org/jira/browse/SPARK-35066
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 3.1.1
> Environment: Spark/PySpark: 3.1.1
> Language: Python 3.7.x / Scala 12
> OS: macOS, Linux, and Windows
> Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1
>Reporter: Maziyar PANAHI
>Priority: Major
> Attachments: Screenshot 2021-04-07 at 11.15.48.png, Screenshot 
> 2021-04-08 at 15.08.09.png, Screenshot 2021-04-08 at 15.13.19.png
>
>
> Hi,
> The following snippet code runs 4-5 times slower when it's used in Apache 
> Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:
>  
> {code:java}
> spark = SparkSession.builder \
>         .master("local[*]") \
>         .config("spark.driver.memory", "16G") \
>         .config("spark.driver.maxResultSize", "0") \
>         .config("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer") \
>         .config("spark.kryoserializer.buffer.max", "2000m") \
>         .getOrCreate()
> Toys = spark.read \
>   .parquet('./toys-cleaned').repartition(12)
> # tokenize the text
> regexTokenizer = RegexTokenizer(inputCol="reviewText",
> outputCol="all_words", pattern="\\W")
> toys_with_words = regexTokenizer.transform(Toys)
> # remove stop words
> remover = StopWordsRemover(inputCol="all_words", outputCol="words")
> toys_with_tokens = remover.transform(toys_with_words).drop("all_words")
> all_words = toys_with_tokens.select(explode("words").alias("word"))
> # group by, sort and limit to 50k
> top50k =
> all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)
> top50k.show()
> {code}
>  
> Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
> partitions are respected in a way that all 12 tasks are being processed all 
> together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
> them finish immediately and only 2 are being processed. (I've tried to 
> disable a couple of configs related to something similar, but none of them 
> worked)
> Screenshot of spark 3.1.1 task: [Spark UI 
> 3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png]
> Screenshot of spark 3.0.2 task: [Spark UI 
> 3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png]
> For a longer discussion: [Spark User 
> List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html]
>  
> You can reproduce this big difference of performance between Spark 3.1.1 and 
> Spark 3.0.2 by using the shared code with any dataset that is large enough to 
> take longer than a minute. Not sure if this is related to SQL, any Spark 
> config being enabled in 3.x but not really into action before 3.1.1, or it's 
> about .transform in Spark ML.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times

2021-04-14 Thread Maziyar PANAHI (Jira)
Maziyar PANAHI created SPARK-35066:
--

 Summary: Spark 3.1.1 is slower than 3.0.2 by 4-5 times
 Key: SPARK-35066
 URL: https://issues.apache.org/jira/browse/SPARK-35066
 Project: Spark
  Issue Type: Bug
  Components: ML, SQL
Affects Versions: 3.1.1
 Environment: Spark/PySpark: 3.1.1

Language: Python 3.7.x / Scala 12

OS: macOS, Linux, and Windows

Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1
Reporter: Maziyar PANAHI


Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark 
or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
partitions are respected in a way that all 12 tasks are being processed all 
together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
them finish immediately and only 2 are being processed. (I've tried to disable 
a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task: [Spark UI 
3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png]

Screenshot of spark 3.0.2 task: [Spark UI 
3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png]

For a longer discussion: [Spark User 
List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html]

 

You can reproduce this big difference of performance between Spark 3.1.1 and 
Spark 3.0.2 by using the shared code with any dataset that is large enough to 
take longer than a minute. Not sure if this is related to SQL, any Spark config 
being enabled in 3.x but not really into action before 3.1.1, or it's about 
.transform in Spark ML.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26101) Spark Pipe() executes the external app by yarn username not the current username

2019-02-12 Thread Maziyar PANAHI (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766471#comment-16766471
 ] 

Maziyar PANAHI commented on SPARK-26101:


I have workaround this issue as I stated here:

[https://stackoverflow.com/a/53395055/1449151]

 

However, I still believe whichever application is asking for YARN container 
should be responsible for passing the user as well. In this case, Spark is 
asking for another container from YARN not me in a separate application. If 
Spark YARN containers are running by a user who submitted the job, RDD.Pipe() 
should follow the same logic and not expect user's configurations on the YARN 
itself.

> Spark Pipe() executes the external app by yarn username not the current 
> username
> 
>
> Key: SPARK-26101
> URL: https://issues.apache.org/jira/browse/SPARK-26101
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Maziyar PANAHI
>Priority: Major
>
> Hello,
> I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
> session (Zeppelin, Shell, or spark-submit) my real username is being 
> impersonated successfully. That allows YARN to use the right queue based on 
> the username, also HDFS knows the permissions. (These all work perfectly 
> without any problem. Meaning the cluster has been set up and configured for 
> user impersonation)
> Example (running Spark by user panahi with YARN as a master):
> {code:java}
>  
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
> 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
> with view permissions: Set();
> users with modify permissions: Set(panahi); groups with modify permissions: 
> Set()
> ...
> 18/11/17 13:55:52 INFO yarn.Client:
> client token: N/A
> diagnostics: N/A
> ApplicationMaster host: N/A
> ApplicationMaster RPC port: -1
> queue: root.multivac
> start time: 1542459353040
> final status: UNDEFINED
> tracking URL: 
> http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
> user: panahi
> {code}
>  
> However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. 
> This makes it impossible to use an external app such as `c/c++` application 
> that needs read/write access to HDFS because the user `*yarn*` does not have 
> permissions on the user's directory. (also other security and resource 
> management issues by executing all the external apps as yarn username)
> *How to produce this issue:*
> {code:java}
> val test = sc.parallelize(Seq("test user")).repartition(1)
> val piped = test.pipe(Seq("whoami"))
> val c = piped.collect()
> result:
> test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition 
> at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at 
> pipe at :37 c: Array[String] = Array(yarn) 
> {code}
>  
> I believe since Spark is the key actor to invoke this execution inside YARN 
> cluster, Spark needs to respect the actual/current username. Or maybe there 
> is another config for impersonation between Spark and YARN in this situation, 
> but I haven't found any.
>  
> Many thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26101) Spark Pipe() executes the external app by yarn username not the current username

2018-11-17 Thread Maziyar PANAHI (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-26101:
---
Summary: Spark Pipe() executes the external app by yarn username not the 
current username  (was: Spark Pipe() executes the external app by yarn user not 
the real user)

> Spark Pipe() executes the external app by yarn username not the current 
> username
> 
>
> Key: SPARK-26101
> URL: https://issues.apache.org/jira/browse/SPARK-26101
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Maziyar PANAHI
>Priority: Major
>
> Hello,
> I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
> session (Zeppelin, Shell, or spark-submit) my real username is being 
> impersonated successfully. That allows YARN to use the right queue based on 
> the username, also HDFS knows the permissions. (These all work perfectly 
> without any problem. Meaning the cluster has been set up and configured for 
> user impersonation)
> Example (running Spark by user panahi with YARN as a master):
> {code:java}
>  
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
> 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
> with view permissions: Set();
> users with modify permissions: Set(panahi); groups with modify permissions: 
> Set()
> ...
> 18/11/17 13:55:52 INFO yarn.Client:
> client token: N/A
> diagnostics: N/A
> ApplicationMaster host: N/A
> ApplicationMaster RPC port: -1
> queue: root.multivac
> start time: 1542459353040
> final status: UNDEFINED
> tracking URL: 
> http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
> user: panahi
> {code}
>  
> However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. 
> This makes it impossible to use an external app such as `c/c++` application 
> that needs read/write access to HDFS because the user `*yarn*` does not have 
> permissions on the user's directory. (also other security and resource 
> management issues by executing all the external apps as yarn username)
> *How to produce this issue:*
> {code:java}
> val test = sc.parallelize(Seq("test user")).repartition(1)
> val piped = test.pipe(Seq("whoami"))
> val c = piped.collect()
> result:
> test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition 
> at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at 
> pipe at :37 c: Array[String] = Array(yarn) 
> {code}
>  
> I believe since Spark is the key actor to invoke this execution inside YARN 
> cluster, Spark needs to respect the actual/current username. Or maybe there 
> is another config for impersonation between Spark and YARN in this situation, 
> but I haven't found any.
>  
> Many thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26101) Spark Pipe() executes the external app by yarn user not the real user

2018-11-17 Thread Maziyar PANAHI (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-26101:
---
Description: 
Hello,

 

I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
session (Zeppelin, Shell, or spark-submit) my real username is being 
impersonated successfully. That allows YARN to use the right queue based on the 
username, also HDFS knows the permissions. (These all work perfectly without 
any problem. Meaning the cluster has been set up and configured for user 
impersonation)

Example (running Spark by user `panahi` with YARN as a master):
{code:java}
 
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
with view permissions: Set();
users with modify permissions: Set(panahi); groups with modify permissions: 
Set()
...
18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
user: panahi
{code}
 

However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. 
This makes it impossible to use an external app such as `c/c++` application 
that needs read/write access to HDFS because the user `*yarn*` does not have 
permissions on the user's directory. (also other security and resource 
management issues by executing all the external apps as yarn username)

*How to produce this issue:*
{code:java}
val test = sc.parallelize(Seq("test user")).repartition(1)
val piped = test.pipe(Seq("whoami"))
val c = piped.collect()
result:
test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at 
:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at 
:37 c: Array[String] = Array(yarn) 
{code}
 

I believe since Spark is the key actor to invoke this execution inside YARN 
cluster, Spark needs to respect the actual/current username. Or maybe there is 
another config for impersonation between Spark and YARN in this situation, but 
I haven't found any.

 

Many thanks.

  was:
Hello,

 

I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
session (Zeppelin, Shell, or spark-submit) my real username is being 
impersonated successfully. That allows YARN to use the right queue based on the 
username, also HDFS knows the permissions.

Example (running Spark by user `panahi`):

 
{code:java}
 
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
with view permissions: Set();
users with modify permissions: Set(panahi); groups with modify permissions: 
Set()
...
18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
user: panahi
{code}
 

However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This 
makes it impossible to use a `c/c++` application that needs read/write access 
to HDFS because the user `yarn` does not have permissions on the user's 
directory.

How to produce this issue:
{code:java}
val test = sc.parallelize(Seq("test user")).repartition(1)
val piped = test.pipe(Seq("whoami"))
val c = piped.collect()
result:
test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at 
:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at 
:37 c: Array[String] = Array(yarn) 
{code}
 

I believe since Spark is the key actor to invoke this execution inside YARN 
cluster, Spark needs to respect the actual/current username. Or maybe there is 
another config for impersonation between Spark and YARN in this situation, but 
I haven't found any.

 

Many thanks.


> Spark Pipe() executes the external app by yarn user not the real user
> -
>
> Key: SPARK-26101
> URL: https://issues.apache.org/jira/browse/SPARK-26101
> 

[jira] [Updated] (SPARK-26101) Spark Pipe() executes the external app by yarn user not the real user

2018-11-17 Thread Maziyar PANAHI (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-26101:
---
Description: 
Hello,

 

I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
session (Zeppelin, Shell, or spark-submit) my real username is being 
impersonated successfully. That allows YARN to use the right queue based on the 
username, also HDFS knows the permissions.

Example (running Spark by user `panahi`):

 
{code:java}
 
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
with view permissions: Set();
users with modify permissions: Set(panahi); groups with modify permissions: 
Set()
...
18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
user: panahi
{code}
 

However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This 
makes it impossible to use a `c/c++` application that needs read/write access 
to HDFS because the user `yarn` does not have permissions on the user's 
directory.

How to produce this issue:
{code:java}
val test = sc.parallelize(Seq("test user")).repartition(1)
val piped = test.pipe(Seq("whoami"))
val c = piped.collect()
result:
test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at 
:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at 
:37 c: Array[String] = Array(yarn) 
{code}
 

I believe since Spark is the key actor to invoke this execution inside YARN 
cluster, Spark needs to respect the actual/current username. Or maybe there is 
another config for impersonation between Spark and YARN in this situation, but 
I haven't found any.

 

Many thanks.

  was:
Hello,

 

I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
session (Zeppelin, Shell, or spark-submit) my real username is being 
impersonated successfully. That allows YARN to use the right queue based on the 
username, also HDFS knows the permissions.

Example (running Spark by user `panahi`):

```

18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
with view permissions: Set();
users with modify permissions: Set(*panahi*); groups with modify permissions: 
Set()

...

18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: 
[http://hadoop-master-1:8088/proxy/application_1542456252041_0006/]
user: *panahi*

```

However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This 
makes it impossible to use a `c/c++` application that needs read/write access 
to HDFS because the user `yarn` does not have permissions on the user's 
directory.

How to produce this issue:

```

val test = sc.parallelize(Seq("test user")).repartition(1)

val piped = test.pipe(Seq("whoami"))

val c = piped.collect()

*result:*

test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at 
:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at 
:37 c: Array[String] = Array(*yarn*)

```

I believe since Spark is the key actor to invoke this execution inside YARN 
cluster, Spark needs to respect the actual/current username. Or maybe there is 
another config for impersonation between Spark and YARN in this situation, but 
I haven't found any.

 

Many thanks.


> Spark Pipe() executes the external app by yarn user not the real user
> -
>
> Key: SPARK-26101
> URL: https://issues.apache.org/jira/browse/SPARK-26101
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Maziyar PANAHI
>Priority: Major
>
> Hello,
>  
> I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
> session 

[jira] [Updated] (SPARK-26101) Spark Pipe() executes the external app by yarn user not the real user

2018-11-17 Thread Maziyar PANAHI (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-26101:
---
Description: 
Hello,

I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
session (Zeppelin, Shell, or spark-submit) my real username is being 
impersonated successfully. That allows YARN to use the right queue based on the 
username, also HDFS knows the permissions. (These all work perfectly without 
any problem. Meaning the cluster has been set up and configured for user 
impersonation)

Example (running Spark by user panahi with YARN as a master):
{code:java}
 
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
with view permissions: Set();
users with modify permissions: Set(panahi); groups with modify permissions: 
Set()
...
18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
user: panahi
{code}
 

However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. 
This makes it impossible to use an external app such as `c/c++` application 
that needs read/write access to HDFS because the user `*yarn*` does not have 
permissions on the user's directory. (also other security and resource 
management issues by executing all the external apps as yarn username)

*How to produce this issue:*
{code:java}
val test = sc.parallelize(Seq("test user")).repartition(1)
val piped = test.pipe(Seq("whoami"))
val c = piped.collect()
result:
test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at 
:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at 
:37 c: Array[String] = Array(yarn) 
{code}
 

I believe since Spark is the key actor to invoke this execution inside YARN 
cluster, Spark needs to respect the actual/current username. Or maybe there is 
another config for impersonation between Spark and YARN in this situation, but 
I haven't found any.

 

Many thanks.

  was:
Hello,

 

I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
session (Zeppelin, Shell, or spark-submit) my real username is being 
impersonated successfully. That allows YARN to use the right queue based on the 
username, also HDFS knows the permissions. (These all work perfectly without 
any problem. Meaning the cluster has been set up and configured for user 
impersonation)

Example (running Spark by user `panahi` with YARN as a master):
{code:java}
 
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
with view permissions: Set();
users with modify permissions: Set(panahi); groups with modify permissions: 
Set()
...
18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
user: panahi
{code}
 

However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. 
This makes it impossible to use an external app such as `c/c++` application 
that needs read/write access to HDFS because the user `*yarn*` does not have 
permissions on the user's directory. (also other security and resource 
management issues by executing all the external apps as yarn username)

*How to produce this issue:*
{code:java}
val test = sc.parallelize(Seq("test user")).repartition(1)
val piped = test.pipe(Seq("whoami"))
val c = piped.collect()
result:
test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at 
:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at 
:37 c: Array[String] = Array(yarn) 
{code}
 

I believe since Spark is the key actor to invoke this execution inside YARN 
cluster, Spark needs to respect the actual/current username. Or maybe there is 
another config for impersonation between Spark and YARN in this situation, but 
I haven't found any.

 


[jira] [Updated] (SPARK-26101) Spark Pipe() executes the external app by yarn user not the real user

2018-11-17 Thread Maziyar PANAHI (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-26101:
---
Description: 
Hello,

 

I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
session (Zeppelin, Shell, or spark-submit) my real username is being 
impersonated successfully. That allows YARN to use the right queue based on the 
username, also HDFS knows the permissions.

Example (running Spark by user `panahi`):

```

18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
with view permissions: Set();
users with modify permissions: Set(*panahi*); groups with modify permissions: 
Set()

...

18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: 
[http://hadoop-master-1:8088/proxy/application_1542456252041_0006/]
user: *panahi*

```

However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This 
makes it impossible to use a `c/c++` application that needs read/write access 
to HDFS because the user `yarn` does not have permissions on the user's 
directory.

How to produce this issue:

```

val test = sc.parallelize(Seq("test user")).repartition(1)

val piped = test.pipe(Seq("whoami"))

val c = piped.collect()

*result:*

test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at 
:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at 
:37 c: Array[String] = Array(*yarn*)

```

I believe since Spark is the key actor to invoke this execution inside YARN 
cluster, Spark needs to respect the actual/current username. Or maybe there is 
another config for impersonation between Spark and YARN in this situation, but 
I haven't found any.

 

Many thanks.

  was:
Hello,

 

I am using `Spark 2.3.0.cloudera3` on Cloudera cluster. When I start my Spark 
session (Zeppelin, Shell, or spark-submit) my real username is being 
impersonated successfully. That allows YARN to use the right queue based on the 
username, also HDFS knows the permissions.

Example (running Spark by user `panahi`):

```

18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
with view permissions: Set();
users with modify permissions: Set(*panahi*); groups with modify permissions: 
Set()

...

18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
user: *panahi*

```

However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This 
makes it impossible to use a `c/c++` application that needs read/write access 
to HDFS because the user `yarn` does not have permissions on the user's 
directory.

How to produce this issue:

```scala

val test = sc.parallelize(Seq("test user")).repartition(1)

val piped = test.pipe(Seq("whoami"))

val c = piped.collect()

*result:*

test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at 
:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at 
:37 c: Array[String] = Array(*yarn*)

```

I believe since Spark is the key actor to invoke this execution inside YARN 
cluster, Spark needs to respect the actual/current username. Or maybe there is 
another config for impersonation between Spark and YARN in this situation, but 
I haven't found any.

 

Many thanks.


> Spark Pipe() executes the external app by yarn user not the real user
> -
>
> Key: SPARK-26101
> URL: https://issues.apache.org/jira/browse/SPARK-26101
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Maziyar PANAHI
>Priority: Major
>
> Hello,
>  
> I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
> session 

[jira] [Created] (SPARK-26101) Spark Pipe() executes the external app by yarn user not the real user

2018-11-17 Thread Maziyar PANAHI (JIRA)
Maziyar PANAHI created SPARK-26101:
--

 Summary: Spark Pipe() executes the external app by yarn user not 
the real user
 Key: SPARK-26101
 URL: https://issues.apache.org/jira/browse/SPARK-26101
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.3.0
Reporter: Maziyar PANAHI


Hello,

 

I am using `Spark 2.3.0.cloudera3` on Cloudera cluster. When I start my Spark 
session (Zeppelin, Shell, or spark-submit) my real username is being 
impersonated successfully. That allows YARN to use the right queue based on the 
username, also HDFS knows the permissions.

Example (running Spark by user `panahi`):

```

18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
with view permissions: Set();
users with modify permissions: Set(*panahi*); groups with modify permissions: 
Set()

...

18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
user: *panahi*

```

However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This 
makes it impossible to use a `c/c++` application that needs read/write access 
to HDFS because the user `yarn` does not have permissions on the user's 
directory.

How to produce this issue:

```scala

val test = sc.parallelize(Seq("test user")).repartition(1)

val piped = test.pipe(Seq("whoami"))

val c = piped.collect()

*result:*

test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at 
:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at 
:37 c: Array[String] = Array(*yarn*)

```

I believe since Spark is the key actor to invoke this execution inside YARN 
cluster, Spark needs to respect the actual/current username. Or maybe there is 
another config for impersonation between Spark and YARN in this situation, but 
I haven't found any.

 

Many thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22380) Upgrade protobuf-java (com.google.protobuf) version from 2.5.0 to 3.4.0

2017-10-28 Thread Maziyar PANAHI (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223700#comment-16223700
 ] 

Maziyar PANAHI commented on SPARK-22380:


Hi Sean,

Tanks for your reply. In fact I am more looking for workaround this issue than 
upgrading this dependency in Hadoop as the current version is compatible with 
all the builds as you said.

Will you show me a way of how to shade dependencies for Hadoop? Or add version 
3.4 and ignore 2.5 inside the Spark App?  I am using Cloudera distribution of 
Spark 2 to be precise.

Thanks,
Maziyar

> Upgrade protobuf-java (com.google.protobuf) version from 2.5.0 to 3.4.0
> ---
>
> Key: SPARK-22380
> URL: https://issues.apache.org/jira/browse/SPARK-22380
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Deploy
>Affects Versions: 1.6.1, 2.2.0
> Environment: Cloudera 5.13.x
> Spark 2.2.0.cloudera1-1.cdh5.12.0.p0.142354
> And anything beyond Spark 2.2.0
>Reporter: Maziyar PANAHI
>Priority: Blocker
>
> Hi,
> This upgrade is needed when we try to use CoreNLP 3.8 with Spark (1.6+ and 
> 2.2+) due to incompatibilities in the protobuf version used by 
> com.google.protobuf and the one is used in latest Stanford CoreNLP (3.8). The 
> version of protobuf has been set to 2.5.0 in the global properties, and this 
> is stated in the pom.xml file.
> The error that refers to this dependency:
> {code:java}
> java.lang.VerifyError: Bad type on operand stack
> Exception Details:
>   Location:
> 
> com/google/protobuf/GeneratedMessageV3$ExtendableMessage.getExtension(Lcom/google/protobuf/GeneratedMessage$GeneratedExtension;I)Ljava/lang/Object;
>  @3: invokevirtual
>   Reason:
> Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current 
> frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite'
>   Current Frame:
> bci: @3
> flags: { }
> locals: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 
> 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
> stack: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 
> 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
>   Bytecode:
> 0x000: 2a2b 1cb6 0024 b0
>   at edu.stanford.nlp.simple.Document.(Document.java:433)
>   at edu.stanford.nlp.simple.Sentence.(Sentence.java:118)
>   at edu.stanford.nlp.simple.Sentence.(Sentence.java:126)
>   ... 56 elided
> {code}
> Is it possible to upgrade this dependency to the latest (3.4) or any 
> workaround besides manually removing protobuf-java-2.5.0.jar and adding 
> protobuf-java-3.4.0.jar?
> You can follow the discussion of how this upgrade would fix the issue:
> https://github.com/stanfordnlp/CoreNLP/issues/556
> Many thanks,
> Maziyar



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22380) Upgrade protobuf-java (com.google.protobuf) version from 2.5.0 to 3.4.0

2017-10-28 Thread Maziyar PANAHI (JIRA)
Maziyar PANAHI created SPARK-22380:
--

 Summary: Upgrade protobuf-java (com.google.protobuf) version from 
2.5.0 to 3.4.0
 Key: SPARK-22380
 URL: https://issues.apache.org/jira/browse/SPARK-22380
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Deploy
Affects Versions: 2.2.0, 1.6.1
 Environment: Cloudera 5.13.x
Spark 2.2.0.cloudera1-1.cdh5.12.0.p0.142354
And anything beyond Spark 2.2.0
Reporter: Maziyar PANAHI
Priority: Blocker


Hi,

This upgrade is needed when we try to use CoreNLP 3.8 with Spark (1.6+ and 
2.2+) due to incompatibilities in the protobuf version used by 
com.google.protobuf and the one is used in latest Stanford CoreNLP (3.8). The 
version of protobuf has been set to 2.5.0 in the global properties, and this is 
stated in the pom.xml file.

The error that refers to this dependency:

{code:java}
java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:

com/google/protobuf/GeneratedMessageV3$ExtendableMessage.getExtension(Lcom/google/protobuf/GeneratedMessage$GeneratedExtension;I)Ljava/lang/Object;
 @3: invokevirtual
  Reason:
Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current 
frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite'
  Current Frame:
bci: @3
flags: { }
locals: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 
'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
stack: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 
'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
  Bytecode:
0x000: 2a2b 1cb6 0024 b0

  at edu.stanford.nlp.simple.Document.(Document.java:433)
  at edu.stanford.nlp.simple.Sentence.(Sentence.java:118)
  at edu.stanford.nlp.simple.Sentence.(Sentence.java:126)
  ... 56 elided

{code}

Is it possible to upgrade this dependency to the latest (3.4) or any workaround 
besides manually removing protobuf-java-2.5.0.jar and adding 
protobuf-java-3.4.0.jar?

You can follow the discussion of how this upgrade would fix the issue:
https://github.com/stanfordnlp/CoreNLP/issues/556


Many thanks,
Maziyar





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org