[jira] [Commented] (SPARK-32530) SPIP: Kotlin support for Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-32530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643344#comment-17643344 ] Maziyar PANAHI commented on SPARK-32530: Not sure if this matters, but as a Scala developer myself primarily building Scala applications to use Apache Spark natively, I highly support this decision to have this as part of ASF officially. I also agree with a maintenance cost, however, unlike .NET, it's much easier for any of us from the Java/Scala world to contribute to Kotlin. I think it's a price that should be paid for the sake of longevity. It is clear that Java and Scala are not going anywhere, but they are not the first choice for newcomers either. More native languages on JVM likeKotlin can really help to bring more users and contributors to the Spark ecosystem in the long term. > SPIP: Kotlin support for Apache Spark > - > > Key: SPARK-32530 > URL: https://issues.apache.org/jira/browse/SPARK-32530 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Pasha Finkeshteyn >Priority: Major > > h2. Background and motivation > Kotlin is a cross-platform, statically typed, general-purpose JVM language. > In the last year more than 5 million developers have used Kotlin in mobile, > backend, frontend and scientific development. The number of Kotlin developers > grows rapidly every year. > * [According to > redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: > "Kotlin, the second fastest growing language we’ve seen outside of Swift, > made a big splash a year ago at this time when it vaulted eight full spots up > the list." > * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], > Kotlin is the second most popular language on the JVM > * [According to > StackOverflow|https://insights.stackoverflow.com/survey/2020] Kotlin’s share > increased by 7.8% in 2020. > We notice the increasing usage of Kotlin in data analysis ([6% of users in > 2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to > 2% in 2019) and machine learning (3% of users in 2020, as opposed to 0% in > 2019), and we expect these numbers to continue to grow. > We, authors of this SPIP, strongly believe that making Kotlin API officially > available to developers can bring new users to Apache Spark and help some of > the existing users. > h2. Goals > The goal of this project is to bring first-class support for Kotlin language > into the Apache Spark project. We’re going to achieve this by adding one more > module to the current Apache Spark distribution. > h2. Non-goals > There is no goal to replace any existing language support or to change any > existing Apache Spark API. > At this time, there is no goal to support non-core APIs of Apache Spark like > Spark ML and Spark structured streaming. This may change in the future based > on community feedback. > There is no goal to provide CLI for Kotlin for Apache Spark, this will be a > separate SPIP. > There is no goal to provide support for Apache Spark < 3.0.0. > h2. Current implementation > A working prototype is available at > [https://github.com/JetBrains/kotlin-spark-api]. It has been tested inside > JetBrains and by early adopters. > h2. What are the risks? > There is always a risk that this product won’t get enough popularity and will > bring more costs than benefits. It can be mitigated by the fact that we don't > need to change any existing API and support can be potentially dropped at any > time. > We also believe that existing API is rather low maintenance. It does not > bring anything more complex than already exists in the Spark codebase. > Furthermore, the implementation is compact - less than 2000 lines of code. > We are committed to maintaining, improving and evolving the API based on > feedback from both Spark and Kotlin communities. As the Kotlin data community > continues to grow, we see Kotlin API for Apache Spark as an important part in > the evolving Kotlin ecosystem, and intend to fully support it. > h2. How long will it take? > A working implementation is already available, and if the community will > have any proposal of changes for this implementation to be improved, these > can be implemented quickly — in weeks if not days. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times
[ https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maziyar PANAHI updated SPARK-35066: --- Description: Hi, The following snippet code runs 4-5 times slower when it's used in Apache Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2: {code:java} spark = SparkSession.builder \ .master("local[*]") \ .config("spark.driver.memory", "16G") \ .config("spark.driver.maxResultSize", "0") \ .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .config("spark.kryoserializer.buffer.max", "2000m") \ .getOrCreate() Toys = spark.read \ .parquet('./toys-cleaned').repartition(12) # tokenize the text regexTokenizer = RegexTokenizer(inputCol="reviewText", outputCol="all_words", pattern="\\W") toys_with_words = regexTokenizer.transform(Toys) # remove stop words remover = StopWordsRemover(inputCol="all_words", outputCol="words") toys_with_tokens = remover.transform(toys_with_words).drop("all_words") all_words = toys_with_tokens.select(explode("words").alias("word")) # group by, sort and limit to 50k top50k = all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5) top50k.show() {code} Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 partitions are respected in a way that all 12 tasks are being processed altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of them finish immediately and only 2 are being processed. (I've tried to disable a couple of configs related to something similar, but none of them worked) Screenshot of spark 3.1.1 task: !image-2022-03-17-17-18-36-793.png|width=1073,height=652! !image-2022-03-17-17-19-11-655.png! Screenshot of spark 3.0.2 task: !image-2022-03-17-17-19-34-906.png! For a longer discussion: [Spark User List |http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html](no longer exists!) You can reproduce this big difference of performance between Spark 3.1.1 and Spark 3.0.2 by using the shared code with any dataset that is large enough to take longer than a minute. Not sure if this is related to SQL, any Spark config being enabled in 3.x but not really into action before 3.1.1, or it's about .transform in Spark ML. was: Hi, The following snippet code runs 4-5 times slower when it's used in Apache Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2: {code:java} spark = SparkSession.builder \ .master("local[*]") \ .config("spark.driver.memory", "16G") \ .config("spark.driver.maxResultSize", "0") \ .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .config("spark.kryoserializer.buffer.max", "2000m") \ .getOrCreate() Toys = spark.read \ .parquet('./toys-cleaned').repartition(12) # tokenize the text regexTokenizer = RegexTokenizer(inputCol="reviewText", outputCol="all_words", pattern="\\W") toys_with_words = regexTokenizer.transform(Toys) # remove stop words remover = StopWordsRemover(inputCol="all_words", outputCol="words") toys_with_tokens = remover.transform(toys_with_words).drop("all_words") all_words = toys_with_tokens.select(explode("words").alias("word")) # group by, sort and limit to 50k top50k = all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5) top50k.show() {code} Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 partitions are respected in a way that all 12 tasks are being processed altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of them finish immediately and only 2 are being processed. (I've tried to disable a couple of configs related to something similar, but none of them worked) Screenshot of spark 3.1.1 task: !image-2022-03-17-17-18-36-793.png|width=1073,height=652! !image-2022-03-17-17-19-11-655.png! Screenshot of spark 3.0.2 task: !image-2022-03-17-17-19-34-906.png! For a longer discussion: [Spark User List |http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html](no longer exists!) You can reproduce this big difference of performance between Spark 3.1.1 and Spark 3.0.2 by using the shared code with any dataset that is large enough to take longer than a minute. Not sure if this is related to SQL, any Spark config being enabled in 3.x but not really into action before 3.1.1, or it's about .transform in Spark ML. > Spark 3.1.1 is slower than 3.0.2 by 4-5 times > - > > Key: SPARK-35066 > URL: https://issues.apache.org/jira/browse/SPARK-35066 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 3.1.1 >
[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times
[ https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maziyar PANAHI updated SPARK-35066: --- Description: Hi, The following snippet code runs 4-5 times slower when it's used in Apache Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2: {code:java} spark = SparkSession.builder \ .master("local[*]") \ .config("spark.driver.memory", "16G") \ .config("spark.driver.maxResultSize", "0") \ .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .config("spark.kryoserializer.buffer.max", "2000m") \ .getOrCreate() Toys = spark.read \ .parquet('./toys-cleaned').repartition(12) # tokenize the text regexTokenizer = RegexTokenizer(inputCol="reviewText", outputCol="all_words", pattern="\\W") toys_with_words = regexTokenizer.transform(Toys) # remove stop words remover = StopWordsRemover(inputCol="all_words", outputCol="words") toys_with_tokens = remover.transform(toys_with_words).drop("all_words") all_words = toys_with_tokens.select(explode("words").alias("word")) # group by, sort and limit to 50k top50k = all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5) top50k.show() {code} Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 partitions are respected in a way that all 12 tasks are being processed altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of them finish immediately and only 2 are being processed. (I've tried to disable a couple of configs related to something similar, but none of them worked) Screenshot of spark 3.1.1 task: !image-2022-03-17-17-18-36-793.png|width=1073,height=652! !image-2022-03-17-17-19-11-655.png! Screenshot of spark 3.0.2 task: !image-2022-03-17-17-19-34-906.png! For a longer discussion: [Spark User List |http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html](no longer exists!) You can reproduce this big difference of performance between Spark 3.1.1 and Spark 3.0.2 by using the shared code with any dataset that is large enough to take longer than a minute. Not sure if this is related to SQL, any Spark config being enabled in 3.x but not really into action before 3.1.1, or it's about .transform in Spark ML. was: Hi, The following snippet code runs 4-5 times slower when it's used in Apache Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2: {code:java} spark = SparkSession.builder \ .master("local[*]") \ .config("spark.driver.memory", "16G") \ .config("spark.driver.maxResultSize", "0") \ .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .config("spark.kryoserializer.buffer.max", "2000m") \ .getOrCreate() Toys = spark.read \ .parquet('./toys-cleaned').repartition(12) # tokenize the text regexTokenizer = RegexTokenizer(inputCol="reviewText", outputCol="all_words", pattern="\\W") toys_with_words = regexTokenizer.transform(Toys) # remove stop words remover = StopWordsRemover(inputCol="all_words", outputCol="words") toys_with_tokens = remover.transform(toys_with_words).drop("all_words") all_words = toys_with_tokens.select(explode("words").alias("word")) # group by, sort and limit to 50k top50k = all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5) top50k.show() {code} Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 partitions are respected in a way that all 12 tasks are being processed altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of them finish immediately and only 2 are being processed. (I've tried to disable a couple of configs related to something similar, but none of them worked) Screenshot of spark 3.1.1 task: !image-2022-03-17-17-18-36-793.png! !image-2022-03-17-17-19-11-655.png! Screenshot of spark 3.0.2 task: !image-2022-03-17-17-19-34-906.png! For a longer discussion: [Spark User List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html] You can reproduce this big difference of performance between Spark 3.1.1 and Spark 3.0.2 by using the shared code with any dataset that is large enough to take longer than a minute. Not sure if this is related to SQL, any Spark config being enabled in 3.x but not really into action before 3.1.1, or it's about .transform in Spark ML. > Spark 3.1.1 is slower than 3.0.2 by 4-5 times > - > > Key: SPARK-35066 > URL: https://issues.apache.org/jira/browse/SPARK-35066 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 3.1.1 > Environment: Spark/PySpark: 3.1.1 >
[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times
[ https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maziyar PANAHI updated SPARK-35066: --- Description: Hi, The following snippet code runs 4-5 times slower when it's used in Apache Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2: {code:java} spark = SparkSession.builder \ .master("local[*]") \ .config("spark.driver.memory", "16G") \ .config("spark.driver.maxResultSize", "0") \ .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .config("spark.kryoserializer.buffer.max", "2000m") \ .getOrCreate() Toys = spark.read \ .parquet('./toys-cleaned').repartition(12) # tokenize the text regexTokenizer = RegexTokenizer(inputCol="reviewText", outputCol="all_words", pattern="\\W") toys_with_words = regexTokenizer.transform(Toys) # remove stop words remover = StopWordsRemover(inputCol="all_words", outputCol="words") toys_with_tokens = remover.transform(toys_with_words).drop("all_words") all_words = toys_with_tokens.select(explode("words").alias("word")) # group by, sort and limit to 50k top50k = all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5) top50k.show() {code} Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 partitions are respected in a way that all 12 tasks are being processed altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of them finish immediately and only 2 are being processed. (I've tried to disable a couple of configs related to something similar, but none of them worked) Screenshot of spark 3.1.1 task: !image-2022-03-17-17-18-36-793.png! !image-2022-03-17-17-19-11-655.png! Screenshot of spark 3.0.2 task: !image-2022-03-17-17-19-34-906.png! For a longer discussion: [Spark User List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html] You can reproduce this big difference of performance between Spark 3.1.1 and Spark 3.0.2 by using the shared code with any dataset that is large enough to take longer than a minute. Not sure if this is related to SQL, any Spark config being enabled in 3.x but not really into action before 3.1.1, or it's about .transform in Spark ML. was: Hi, The following snippet code runs 4-5 times slower when it's used in Apache Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2: {code:java} spark = SparkSession.builder \ .master("local[*]") \ .config("spark.driver.memory", "16G") \ .config("spark.driver.maxResultSize", "0") \ .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .config("spark.kryoserializer.buffer.max", "2000m") \ .getOrCreate() Toys = spark.read \ .parquet('./toys-cleaned').repartition(12) # tokenize the text regexTokenizer = RegexTokenizer(inputCol="reviewText", outputCol="all_words", pattern="\\W") toys_with_words = regexTokenizer.transform(Toys) # remove stop words remover = StopWordsRemover(inputCol="all_words", outputCol="words") toys_with_tokens = remover.transform(toys_with_words).drop("all_words") all_words = toys_with_tokens.select(explode("words").alias("word")) # group by, sort and limit to 50k top50k = all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5) top50k.show() {code} Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 partitions are respected in a way that all 12 tasks are being processed all together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of them finish immediately and only 2 are being processed. (I've tried to disable a couple of configs related to something similar, but none of them worked) Screenshot of spark 3.1.1 task: [Spark UI 3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png] Screenshot of spark 3.0.2 task: [Spark UI 3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png] For a longer discussion: [Spark User List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html] You can reproduce this big difference of performance between Spark 3.1.1 and Spark 3.0.2 by using the shared code with any dataset that is large enough to take longer than a minute. Not sure if this is related to SQL, any Spark config being enabled in 3.x but not really into action before 3.1.1, or it's about .transform in Spark ML. > Spark 3.1.1 is slower than 3.0.2 by 4-5 times > - > > Key: SPARK-35066 > URL: https://issues.apache.org/jira/browse/SPARK-35066 > Project: Spark > Issue Type: Bug >
[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times
[ https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maziyar PANAHI updated SPARK-35066: --- Attachment: image-2022-03-17-17-19-34-906.png > Spark 3.1.1 is slower than 3.0.2 by 4-5 times > - > > Key: SPARK-35066 > URL: https://issues.apache.org/jira/browse/SPARK-35066 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 3.1.1 > Environment: Spark/PySpark: 3.1.1 > Language: Python 3.7.x / Scala 12 > OS: macOS, Linux, and Windows > Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1 >Reporter: Maziyar PANAHI >Priority: Major > Attachments: Screenshot 2021-04-07 at 11.15.48.png, Screenshot > 2021-04-08 at 15.08.09.png, Screenshot 2021-04-08 at 15.13.19-1.png, > Screenshot 2021-04-08 at 15.13.19.png, image-2022-03-17-17-18-36-793.png, > image-2022-03-17-17-19-11-655.png, image-2022-03-17-17-19-34-906.png > > > Hi, > The following snippet code runs 4-5 times slower when it's used in Apache > Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2: > > {code:java} > spark = SparkSession.builder \ > .master("local[*]") \ > .config("spark.driver.memory", "16G") \ > .config("spark.driver.maxResultSize", "0") \ > .config("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") \ > .config("spark.kryoserializer.buffer.max", "2000m") \ > .getOrCreate() > Toys = spark.read \ > .parquet('./toys-cleaned').repartition(12) > # tokenize the text > regexTokenizer = RegexTokenizer(inputCol="reviewText", > outputCol="all_words", pattern="\\W") > toys_with_words = regexTokenizer.transform(Toys) > # remove stop words > remover = StopWordsRemover(inputCol="all_words", outputCol="words") > toys_with_tokens = remover.transform(toys_with_words).drop("all_words") > all_words = toys_with_tokens.select(explode("words").alias("word")) > # group by, sort and limit to 50k > top50k = > all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5) > top50k.show() > {code} > > Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 > partitions are respected in a way that all 12 tasks are being processed all > together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of > them finish immediately and only 2 are being processed. (I've tried to > disable a couple of configs related to something similar, but none of them > worked) > Screenshot of spark 3.1.1 task: [Spark UI > 3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png] > Screenshot of spark 3.0.2 task: [Spark UI > 3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png] > For a longer discussion: [Spark User > List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html] > > You can reproduce this big difference of performance between Spark 3.1.1 and > Spark 3.0.2 by using the shared code with any dataset that is large enough to > take longer than a minute. Not sure if this is related to SQL, any Spark > config being enabled in 3.x but not really into action before 3.1.1, or it's > about .transform in Spark ML. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times
[ https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maziyar PANAHI updated SPARK-35066: --- Attachment: image-2022-03-17-17-19-11-655.png > Spark 3.1.1 is slower than 3.0.2 by 4-5 times > - > > Key: SPARK-35066 > URL: https://issues.apache.org/jira/browse/SPARK-35066 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 3.1.1 > Environment: Spark/PySpark: 3.1.1 > Language: Python 3.7.x / Scala 12 > OS: macOS, Linux, and Windows > Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1 >Reporter: Maziyar PANAHI >Priority: Major > Attachments: Screenshot 2021-04-07 at 11.15.48.png, Screenshot > 2021-04-08 at 15.08.09.png, Screenshot 2021-04-08 at 15.13.19-1.png, > Screenshot 2021-04-08 at 15.13.19.png, image-2022-03-17-17-18-36-793.png, > image-2022-03-17-17-19-11-655.png, image-2022-03-17-17-19-34-906.png > > > Hi, > The following snippet code runs 4-5 times slower when it's used in Apache > Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2: > > {code:java} > spark = SparkSession.builder \ > .master("local[*]") \ > .config("spark.driver.memory", "16G") \ > .config("spark.driver.maxResultSize", "0") \ > .config("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") \ > .config("spark.kryoserializer.buffer.max", "2000m") \ > .getOrCreate() > Toys = spark.read \ > .parquet('./toys-cleaned').repartition(12) > # tokenize the text > regexTokenizer = RegexTokenizer(inputCol="reviewText", > outputCol="all_words", pattern="\\W") > toys_with_words = regexTokenizer.transform(Toys) > # remove stop words > remover = StopWordsRemover(inputCol="all_words", outputCol="words") > toys_with_tokens = remover.transform(toys_with_words).drop("all_words") > all_words = toys_with_tokens.select(explode("words").alias("word")) > # group by, sort and limit to 50k > top50k = > all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5) > top50k.show() > {code} > > Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 > partitions are respected in a way that all 12 tasks are being processed all > together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of > them finish immediately and only 2 are being processed. (I've tried to > disable a couple of configs related to something similar, but none of them > worked) > Screenshot of spark 3.1.1 task: [Spark UI > 3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png] > Screenshot of spark 3.0.2 task: [Spark UI > 3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png] > For a longer discussion: [Spark User > List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html] > > You can reproduce this big difference of performance between Spark 3.1.1 and > Spark 3.0.2 by using the shared code with any dataset that is large enough to > take longer than a minute. Not sure if this is related to SQL, any Spark > config being enabled in 3.x but not really into action before 3.1.1, or it's > about .transform in Spark ML. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times
[ https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maziyar PANAHI updated SPARK-35066: --- Attachment: Screenshot 2021-04-08 at 15.13.19-1.png > Spark 3.1.1 is slower than 3.0.2 by 4-5 times > - > > Key: SPARK-35066 > URL: https://issues.apache.org/jira/browse/SPARK-35066 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 3.1.1 > Environment: Spark/PySpark: 3.1.1 > Language: Python 3.7.x / Scala 12 > OS: macOS, Linux, and Windows > Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1 >Reporter: Maziyar PANAHI >Priority: Major > Attachments: Screenshot 2021-04-07 at 11.15.48.png, Screenshot > 2021-04-08 at 15.08.09.png, Screenshot 2021-04-08 at 15.13.19-1.png, > Screenshot 2021-04-08 at 15.13.19.png, image-2022-03-17-17-18-36-793.png > > > Hi, > The following snippet code runs 4-5 times slower when it's used in Apache > Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2: > > {code:java} > spark = SparkSession.builder \ > .master("local[*]") \ > .config("spark.driver.memory", "16G") \ > .config("spark.driver.maxResultSize", "0") \ > .config("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") \ > .config("spark.kryoserializer.buffer.max", "2000m") \ > .getOrCreate() > Toys = spark.read \ > .parquet('./toys-cleaned').repartition(12) > # tokenize the text > regexTokenizer = RegexTokenizer(inputCol="reviewText", > outputCol="all_words", pattern="\\W") > toys_with_words = regexTokenizer.transform(Toys) > # remove stop words > remover = StopWordsRemover(inputCol="all_words", outputCol="words") > toys_with_tokens = remover.transform(toys_with_words).drop("all_words") > all_words = toys_with_tokens.select(explode("words").alias("word")) > # group by, sort and limit to 50k > top50k = > all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5) > top50k.show() > {code} > > Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 > partitions are respected in a way that all 12 tasks are being processed all > together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of > them finish immediately and only 2 are being processed. (I've tried to > disable a couple of configs related to something similar, but none of them > worked) > Screenshot of spark 3.1.1 task: [Spark UI > 3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png] > Screenshot of spark 3.0.2 task: [Spark UI > 3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png] > For a longer discussion: [Spark User > List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html] > > You can reproduce this big difference of performance between Spark 3.1.1 and > Spark 3.0.2 by using the shared code with any dataset that is large enough to > take longer than a minute. Not sure if this is related to SQL, any Spark > config being enabled in 3.x but not really into action before 3.1.1, or it's > about .transform in Spark ML. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times
[ https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maziyar PANAHI updated SPARK-35066: --- Attachment: image-2022-03-17-17-18-36-793.png > Spark 3.1.1 is slower than 3.0.2 by 4-5 times > - > > Key: SPARK-35066 > URL: https://issues.apache.org/jira/browse/SPARK-35066 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 3.1.1 > Environment: Spark/PySpark: 3.1.1 > Language: Python 3.7.x / Scala 12 > OS: macOS, Linux, and Windows > Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1 >Reporter: Maziyar PANAHI >Priority: Major > Attachments: Screenshot 2021-04-07 at 11.15.48.png, Screenshot > 2021-04-08 at 15.08.09.png, Screenshot 2021-04-08 at 15.13.19-1.png, > Screenshot 2021-04-08 at 15.13.19.png, image-2022-03-17-17-18-36-793.png > > > Hi, > The following snippet code runs 4-5 times slower when it's used in Apache > Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2: > > {code:java} > spark = SparkSession.builder \ > .master("local[*]") \ > .config("spark.driver.memory", "16G") \ > .config("spark.driver.maxResultSize", "0") \ > .config("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") \ > .config("spark.kryoserializer.buffer.max", "2000m") \ > .getOrCreate() > Toys = spark.read \ > .parquet('./toys-cleaned').repartition(12) > # tokenize the text > regexTokenizer = RegexTokenizer(inputCol="reviewText", > outputCol="all_words", pattern="\\W") > toys_with_words = regexTokenizer.transform(Toys) > # remove stop words > remover = StopWordsRemover(inputCol="all_words", outputCol="words") > toys_with_tokens = remover.transform(toys_with_words).drop("all_words") > all_words = toys_with_tokens.select(explode("words").alias("word")) > # group by, sort and limit to 50k > top50k = > all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5) > top50k.show() > {code} > > Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 > partitions are respected in a way that all 12 tasks are being processed all > together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of > them finish immediately and only 2 are being processed. (I've tried to > disable a couple of configs related to something similar, but none of them > worked) > Screenshot of spark 3.1.1 task: [Spark UI > 3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png] > Screenshot of spark 3.0.2 task: [Spark UI > 3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png] > For a longer discussion: [Spark User > List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html] > > You can reproduce this big difference of performance between Spark 3.1.1 and > Spark 3.0.2 by using the shared code with any dataset that is large enough to > take longer than a minute. Not sure if this is related to SQL, any Spark > config being enabled in 3.x but not really into action before 3.1.1, or it's > about .transform in Spark ML. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times
[ https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maziyar PANAHI updated SPARK-35066: --- Attachment: Screenshot 2021-04-08 at 15.13.19.png Screenshot 2021-04-08 at 15.08.09.png Screenshot 2021-04-07 at 11.15.48.png > Spark 3.1.1 is slower than 3.0.2 by 4-5 times > - > > Key: SPARK-35066 > URL: https://issues.apache.org/jira/browse/SPARK-35066 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 3.1.1 > Environment: Spark/PySpark: 3.1.1 > Language: Python 3.7.x / Scala 12 > OS: macOS, Linux, and Windows > Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1 >Reporter: Maziyar PANAHI >Priority: Major > Attachments: Screenshot 2021-04-07 at 11.15.48.png, Screenshot > 2021-04-08 at 15.08.09.png, Screenshot 2021-04-08 at 15.13.19.png > > > Hi, > The following snippet code runs 4-5 times slower when it's used in Apache > Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2: > > {code:java} > spark = SparkSession.builder \ > .master("local[*]") \ > .config("spark.driver.memory", "16G") \ > .config("spark.driver.maxResultSize", "0") \ > .config("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") \ > .config("spark.kryoserializer.buffer.max", "2000m") \ > .getOrCreate() > Toys = spark.read \ > .parquet('./toys-cleaned').repartition(12) > # tokenize the text > regexTokenizer = RegexTokenizer(inputCol="reviewText", > outputCol="all_words", pattern="\\W") > toys_with_words = regexTokenizer.transform(Toys) > # remove stop words > remover = StopWordsRemover(inputCol="all_words", outputCol="words") > toys_with_tokens = remover.transform(toys_with_words).drop("all_words") > all_words = toys_with_tokens.select(explode("words").alias("word")) > # group by, sort and limit to 50k > top50k = > all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5) > top50k.show() > {code} > > Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 > partitions are respected in a way that all 12 tasks are being processed all > together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of > them finish immediately and only 2 are being processed. (I've tried to > disable a couple of configs related to something similar, but none of them > worked) > Screenshot of spark 3.1.1 task: [Spark UI > 3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png] > Screenshot of spark 3.0.2 task: [Spark UI > 3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png] > For a longer discussion: [Spark User > List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html] > > You can reproduce this big difference of performance between Spark 3.1.1 and > Spark 3.0.2 by using the shared code with any dataset that is large enough to > take longer than a minute. Not sure if this is related to SQL, any Spark > config being enabled in 3.x but not really into action before 3.1.1, or it's > about .transform in Spark ML. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times
Maziyar PANAHI created SPARK-35066: -- Summary: Spark 3.1.1 is slower than 3.0.2 by 4-5 times Key: SPARK-35066 URL: https://issues.apache.org/jira/browse/SPARK-35066 Project: Spark Issue Type: Bug Components: ML, SQL Affects Versions: 3.1.1 Environment: Spark/PySpark: 3.1.1 Language: Python 3.7.x / Scala 12 OS: macOS, Linux, and Windows Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1 Reporter: Maziyar PANAHI Hi, The following snippet code runs 4-5 times slower when it's used in Apache Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2: {code:java} spark = SparkSession.builder \ .master("local[*]") \ .config("spark.driver.memory", "16G") \ .config("spark.driver.maxResultSize", "0") \ .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .config("spark.kryoserializer.buffer.max", "2000m") \ .getOrCreate() Toys = spark.read \ .parquet('./toys-cleaned').repartition(12) # tokenize the text regexTokenizer = RegexTokenizer(inputCol="reviewText", outputCol="all_words", pattern="\\W") toys_with_words = regexTokenizer.transform(Toys) # remove stop words remover = StopWordsRemover(inputCol="all_words", outputCol="words") toys_with_tokens = remover.transform(toys_with_words).drop("all_words") all_words = toys_with_tokens.select(explode("words").alias("word")) # group by, sort and limit to 50k top50k = all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(5) top50k.show() {code} Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 partitions are respected in a way that all 12 tasks are being processed all together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of them finish immediately and only 2 are being processed. (I've tried to disable a couple of configs related to something similar, but none of them worked) Screenshot of spark 3.1.1 task: [Spark UI 3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png] Screenshot of spark 3.0.2 task: [Spark UI 3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png] For a longer discussion: [Spark User List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html] You can reproduce this big difference of performance between Spark 3.1.1 and Spark 3.0.2 by using the shared code with any dataset that is large enough to take longer than a minute. Not sure if this is related to SQL, any Spark config being enabled in 3.x but not really into action before 3.1.1, or it's about .transform in Spark ML. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26101) Spark Pipe() executes the external app by yarn username not the current username
[ https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766471#comment-16766471 ] Maziyar PANAHI commented on SPARK-26101: I have workaround this issue as I stated here: [https://stackoverflow.com/a/53395055/1449151] However, I still believe whichever application is asking for YARN container should be responsible for passing the user as well. In this case, Spark is asking for another container from YARN not me in a separate application. If Spark YARN containers are running by a user who submitted the job, RDD.Pipe() should follow the same logic and not expect user's configurations on the YARN itself. > Spark Pipe() executes the external app by yarn username not the current > username > > > Key: SPARK-26101 > URL: https://issues.apache.org/jira/browse/SPARK-26101 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.3.0 >Reporter: Maziyar PANAHI >Priority: Major > > Hello, > I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark > session (Zeppelin, Shell, or spark-submit) my real username is being > impersonated successfully. That allows YARN to use the right queue based on > the username, also HDFS knows the permissions. (These all work perfectly > without any problem. Meaning the cluster has been set up and configured for > user impersonation) > Example (running Spark by user panahi with YARN as a master): > {code:java} > > 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi > 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi > 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: > 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: > 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups > with view permissions: Set(); > users with modify permissions: Set(panahi); groups with modify permissions: > Set() > ... > 18/11/17 13:55:52 INFO yarn.Client: > client token: N/A > diagnostics: N/A > ApplicationMaster host: N/A > ApplicationMaster RPC port: -1 > queue: root.multivac > start time: 1542459353040 > final status: UNDEFINED > tracking URL: > http://hadoop-master-1:8088/proxy/application_1542456252041_0006/ > user: panahi > {code} > > However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. > This makes it impossible to use an external app such as `c/c++` application > that needs read/write access to HDFS because the user `*yarn*` does not have > permissions on the user's directory. (also other security and resource > management issues by executing all the external apps as yarn username) > *How to produce this issue:* > {code:java} > val test = sc.parallelize(Seq("test user")).repartition(1) > val piped = test.pipe(Seq("whoami")) > val c = piped.collect() > result: > test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition > at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at > pipe at :37 c: Array[String] = Array(yarn) > {code} > > I believe since Spark is the key actor to invoke this execution inside YARN > cluster, Spark needs to respect the actual/current username. Or maybe there > is another config for impersonation between Spark and YARN in this situation, > but I haven't found any. > > Many thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26101) Spark Pipe() executes the external app by yarn username not the current username
[ https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maziyar PANAHI updated SPARK-26101: --- Summary: Spark Pipe() executes the external app by yarn username not the current username (was: Spark Pipe() executes the external app by yarn user not the real user) > Spark Pipe() executes the external app by yarn username not the current > username > > > Key: SPARK-26101 > URL: https://issues.apache.org/jira/browse/SPARK-26101 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.3.0 >Reporter: Maziyar PANAHI >Priority: Major > > Hello, > I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark > session (Zeppelin, Shell, or spark-submit) my real username is being > impersonated successfully. That allows YARN to use the right queue based on > the username, also HDFS knows the permissions. (These all work perfectly > without any problem. Meaning the cluster has been set up and configured for > user impersonation) > Example (running Spark by user panahi with YARN as a master): > {code:java} > > 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi > 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi > 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: > 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: > 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups > with view permissions: Set(); > users with modify permissions: Set(panahi); groups with modify permissions: > Set() > ... > 18/11/17 13:55:52 INFO yarn.Client: > client token: N/A > diagnostics: N/A > ApplicationMaster host: N/A > ApplicationMaster RPC port: -1 > queue: root.multivac > start time: 1542459353040 > final status: UNDEFINED > tracking URL: > http://hadoop-master-1:8088/proxy/application_1542456252041_0006/ > user: panahi > {code} > > However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. > This makes it impossible to use an external app such as `c/c++` application > that needs read/write access to HDFS because the user `*yarn*` does not have > permissions on the user's directory. (also other security and resource > management issues by executing all the external apps as yarn username) > *How to produce this issue:* > {code:java} > val test = sc.parallelize(Seq("test user")).repartition(1) > val piped = test.pipe(Seq("whoami")) > val c = piped.collect() > result: > test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition > at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at > pipe at :37 c: Array[String] = Array(yarn) > {code} > > I believe since Spark is the key actor to invoke this execution inside YARN > cluster, Spark needs to respect the actual/current username. Or maybe there > is another config for impersonation between Spark and YARN in this situation, > but I haven't found any. > > Many thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26101) Spark Pipe() executes the external app by yarn user not the real user
[ https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maziyar PANAHI updated SPARK-26101: --- Description: Hello, I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions. (These all work perfectly without any problem. Meaning the cluster has been set up and configured for user impersonation) Example (running Spark by user `panahi` with YARN as a master): {code:java} 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set(); users with modify permissions: Set(panahi); groups with modify permissions: Set() ... 18/11/17 13:55:52 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.multivac start time: 1542459353040 final status: UNDEFINED tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/ user: panahi {code} However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. This makes it impossible to use an external app such as `c/c++` application that needs read/write access to HDFS because the user `*yarn*` does not have permissions on the user's directory. (also other security and resource management issues by executing all the external apps as yarn username) *How to produce this issue:* {code:java} val test = sc.parallelize(Seq("test user")).repartition(1) val piped = test.pipe(Seq("whoami")) val c = piped.collect() result: test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at :37 c: Array[String] = Array(yarn) {code} I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any. Many thanks. was: Hello, I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions. Example (running Spark by user `panahi`): {code:java} 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set(); users with modify permissions: Set(panahi); groups with modify permissions: Set() ... 18/11/17 13:55:52 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.multivac start time: 1542459353040 final status: UNDEFINED tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/ user: panahi {code} However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This makes it impossible to use a `c/c++` application that needs read/write access to HDFS because the user `yarn` does not have permissions on the user's directory. How to produce this issue: {code:java} val test = sc.parallelize(Seq("test user")).repartition(1) val piped = test.pipe(Seq("whoami")) val c = piped.collect() result: test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at :37 c: Array[String] = Array(yarn) {code} I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any. Many thanks. > Spark Pipe() executes the external app by yarn user not the real user > - > > Key: SPARK-26101 > URL: https://issues.apache.org/jira/browse/SPARK-26101 >
[jira] [Updated] (SPARK-26101) Spark Pipe() executes the external app by yarn user not the real user
[ https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maziyar PANAHI updated SPARK-26101: --- Description: Hello, I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions. Example (running Spark by user `panahi`): {code:java} 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set(); users with modify permissions: Set(panahi); groups with modify permissions: Set() ... 18/11/17 13:55:52 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.multivac start time: 1542459353040 final status: UNDEFINED tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/ user: panahi {code} However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This makes it impossible to use a `c/c++` application that needs read/write access to HDFS because the user `yarn` does not have permissions on the user's directory. How to produce this issue: {code:java} val test = sc.parallelize(Seq("test user")).repartition(1) val piped = test.pipe(Seq("whoami")) val c = piped.collect() result: test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at :37 c: Array[String] = Array(yarn) {code} I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any. Many thanks. was: Hello, I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions. Example (running Spark by user `panahi`): ``` 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi* 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi* 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set(); users with modify permissions: Set(*panahi*); groups with modify permissions: Set() ... 18/11/17 13:55:52 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.multivac start time: 1542459353040 final status: UNDEFINED tracking URL: [http://hadoop-master-1:8088/proxy/application_1542456252041_0006/] user: *panahi* ``` However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This makes it impossible to use a `c/c++` application that needs read/write access to HDFS because the user `yarn` does not have permissions on the user's directory. How to produce this issue: ``` val test = sc.parallelize(Seq("test user")).repartition(1) val piped = test.pipe(Seq("whoami")) val c = piped.collect() *result:* test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at :37 c: Array[String] = Array(*yarn*) ``` I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any. Many thanks. > Spark Pipe() executes the external app by yarn user not the real user > - > > Key: SPARK-26101 > URL: https://issues.apache.org/jira/browse/SPARK-26101 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.3.0 >Reporter: Maziyar PANAHI >Priority: Major > > Hello, > > I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark > session
[jira] [Updated] (SPARK-26101) Spark Pipe() executes the external app by yarn user not the real user
[ https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maziyar PANAHI updated SPARK-26101: --- Description: Hello, I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions. (These all work perfectly without any problem. Meaning the cluster has been set up and configured for user impersonation) Example (running Spark by user panahi with YARN as a master): {code:java} 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set(); users with modify permissions: Set(panahi); groups with modify permissions: Set() ... 18/11/17 13:55:52 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.multivac start time: 1542459353040 final status: UNDEFINED tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/ user: panahi {code} However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. This makes it impossible to use an external app such as `c/c++` application that needs read/write access to HDFS because the user `*yarn*` does not have permissions on the user's directory. (also other security and resource management issues by executing all the external apps as yarn username) *How to produce this issue:* {code:java} val test = sc.parallelize(Seq("test user")).repartition(1) val piped = test.pipe(Seq("whoami")) val c = piped.collect() result: test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at :37 c: Array[String] = Array(yarn) {code} I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any. Many thanks. was: Hello, I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions. (These all work perfectly without any problem. Meaning the cluster has been set up and configured for user impersonation) Example (running Spark by user `panahi` with YARN as a master): {code:java} 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set(); users with modify permissions: Set(panahi); groups with modify permissions: Set() ... 18/11/17 13:55:52 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.multivac start time: 1542459353040 final status: UNDEFINED tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/ user: panahi {code} However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. This makes it impossible to use an external app such as `c/c++` application that needs read/write access to HDFS because the user `*yarn*` does not have permissions on the user's directory. (also other security and resource management issues by executing all the external apps as yarn username) *How to produce this issue:* {code:java} val test = sc.parallelize(Seq("test user")).repartition(1) val piped = test.pipe(Seq("whoami")) val c = piped.collect() result: test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at :37 c: Array[String] = Array(yarn) {code} I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any.
[jira] [Updated] (SPARK-26101) Spark Pipe() executes the external app by yarn user not the real user
[ https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maziyar PANAHI updated SPARK-26101: --- Description: Hello, I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions. Example (running Spark by user `panahi`): ``` 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi* 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi* 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set(); users with modify permissions: Set(*panahi*); groups with modify permissions: Set() ... 18/11/17 13:55:52 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.multivac start time: 1542459353040 final status: UNDEFINED tracking URL: [http://hadoop-master-1:8088/proxy/application_1542456252041_0006/] user: *panahi* ``` However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This makes it impossible to use a `c/c++` application that needs read/write access to HDFS because the user `yarn` does not have permissions on the user's directory. How to produce this issue: ``` val test = sc.parallelize(Seq("test user")).repartition(1) val piped = test.pipe(Seq("whoami")) val c = piped.collect() *result:* test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at :37 c: Array[String] = Array(*yarn*) ``` I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any. Many thanks. was: Hello, I am using `Spark 2.3.0.cloudera3` on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions. Example (running Spark by user `panahi`): ``` 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi* 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi* 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set(); users with modify permissions: Set(*panahi*); groups with modify permissions: Set() ... 18/11/17 13:55:52 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.multivac start time: 1542459353040 final status: UNDEFINED tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/ user: *panahi* ``` However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This makes it impossible to use a `c/c++` application that needs read/write access to HDFS because the user `yarn` does not have permissions on the user's directory. How to produce this issue: ```scala val test = sc.parallelize(Seq("test user")).repartition(1) val piped = test.pipe(Seq("whoami")) val c = piped.collect() *result:* test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at :37 c: Array[String] = Array(*yarn*) ``` I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any. Many thanks. > Spark Pipe() executes the external app by yarn user not the real user > - > > Key: SPARK-26101 > URL: https://issues.apache.org/jira/browse/SPARK-26101 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.3.0 >Reporter: Maziyar PANAHI >Priority: Major > > Hello, > > I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark > session
[jira] [Created] (SPARK-26101) Spark Pipe() executes the external app by yarn user not the real user
Maziyar PANAHI created SPARK-26101: -- Summary: Spark Pipe() executes the external app by yarn user not the real user Key: SPARK-26101 URL: https://issues.apache.org/jira/browse/SPARK-26101 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.3.0 Reporter: Maziyar PANAHI Hello, I am using `Spark 2.3.0.cloudera3` on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions. Example (running Spark by user `panahi`): ``` 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi* 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi* 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set(); users with modify permissions: Set(*panahi*); groups with modify permissions: Set() ... 18/11/17 13:55:52 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.multivac start time: 1542459353040 final status: UNDEFINED tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/ user: *panahi* ``` However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This makes it impossible to use a `c/c++` application that needs read/write access to HDFS because the user `yarn` does not have permissions on the user's directory. How to produce this issue: ```scala val test = sc.parallelize(Seq("test user")).repartition(1) val piped = test.pipe(Seq("whoami")) val c = piped.collect() *result:* test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at :37 c: Array[String] = Array(*yarn*) ``` I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any. Many thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22380) Upgrade protobuf-java (com.google.protobuf) version from 2.5.0 to 3.4.0
[ https://issues.apache.org/jira/browse/SPARK-22380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223700#comment-16223700 ] Maziyar PANAHI commented on SPARK-22380: Hi Sean, Tanks for your reply. In fact I am more looking for workaround this issue than upgrading this dependency in Hadoop as the current version is compatible with all the builds as you said. Will you show me a way of how to shade dependencies for Hadoop? Or add version 3.4 and ignore 2.5 inside the Spark App? I am using Cloudera distribution of Spark 2 to be precise. Thanks, Maziyar > Upgrade protobuf-java (com.google.protobuf) version from 2.5.0 to 3.4.0 > --- > > Key: SPARK-22380 > URL: https://issues.apache.org/jira/browse/SPARK-22380 > Project: Spark > Issue Type: Dependency upgrade > Components: Deploy >Affects Versions: 1.6.1, 2.2.0 > Environment: Cloudera 5.13.x > Spark 2.2.0.cloudera1-1.cdh5.12.0.p0.142354 > And anything beyond Spark 2.2.0 >Reporter: Maziyar PANAHI >Priority: Blocker > > Hi, > This upgrade is needed when we try to use CoreNLP 3.8 with Spark (1.6+ and > 2.2+) due to incompatibilities in the protobuf version used by > com.google.protobuf and the one is used in latest Stanford CoreNLP (3.8). The > version of protobuf has been set to 2.5.0 in the global properties, and this > is stated in the pom.xml file. > The error that refers to this dependency: > {code:java} > java.lang.VerifyError: Bad type on operand stack > Exception Details: > Location: > > com/google/protobuf/GeneratedMessageV3$ExtendableMessage.getExtension(Lcom/google/protobuf/GeneratedMessage$GeneratedExtension;I)Ljava/lang/Object; > @3: invokevirtual > Reason: > Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current > frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite' > Current Frame: > bci: @3 > flags: { } > locals: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', > 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer } > stack: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', > 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer } > Bytecode: > 0x000: 2a2b 1cb6 0024 b0 > at edu.stanford.nlp.simple.Document.(Document.java:433) > at edu.stanford.nlp.simple.Sentence.(Sentence.java:118) > at edu.stanford.nlp.simple.Sentence.(Sentence.java:126) > ... 56 elided > {code} > Is it possible to upgrade this dependency to the latest (3.4) or any > workaround besides manually removing protobuf-java-2.5.0.jar and adding > protobuf-java-3.4.0.jar? > You can follow the discussion of how this upgrade would fix the issue: > https://github.com/stanfordnlp/CoreNLP/issues/556 > Many thanks, > Maziyar -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22380) Upgrade protobuf-java (com.google.protobuf) version from 2.5.0 to 3.4.0
Maziyar PANAHI created SPARK-22380: -- Summary: Upgrade protobuf-java (com.google.protobuf) version from 2.5.0 to 3.4.0 Key: SPARK-22380 URL: https://issues.apache.org/jira/browse/SPARK-22380 Project: Spark Issue Type: Dependency upgrade Components: Deploy Affects Versions: 2.2.0, 1.6.1 Environment: Cloudera 5.13.x Spark 2.2.0.cloudera1-1.cdh5.12.0.p0.142354 And anything beyond Spark 2.2.0 Reporter: Maziyar PANAHI Priority: Blocker Hi, This upgrade is needed when we try to use CoreNLP 3.8 with Spark (1.6+ and 2.2+) due to incompatibilities in the protobuf version used by com.google.protobuf and the one is used in latest Stanford CoreNLP (3.8). The version of protobuf has been set to 2.5.0 in the global properties, and this is stated in the pom.xml file. The error that refers to this dependency: {code:java} java.lang.VerifyError: Bad type on operand stack Exception Details: Location: com/google/protobuf/GeneratedMessageV3$ExtendableMessage.getExtension(Lcom/google/protobuf/GeneratedMessage$GeneratedExtension;I)Ljava/lang/Object; @3: invokevirtual Reason: Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite' Current Frame: bci: @3 flags: { } locals: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer } stack: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer } Bytecode: 0x000: 2a2b 1cb6 0024 b0 at edu.stanford.nlp.simple.Document.(Document.java:433) at edu.stanford.nlp.simple.Sentence.(Sentence.java:118) at edu.stanford.nlp.simple.Sentence.(Sentence.java:126) ... 56 elided {code} Is it possible to upgrade this dependency to the latest (3.4) or any workaround besides manually removing protobuf-java-2.5.0.jar and adding protobuf-java-3.4.0.jar? You can follow the discussion of how this upgrade would fix the issue: https://github.com/stanfordnlp/CoreNLP/issues/556 Many thanks, Maziyar -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org