[jira] [Commented] (SPARK-22974) CountVectorModel does not attach attributes to output column
[ https://issues.apache.org/jira/browse/SPARK-22974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330235#comment-16330235 ] Xiayun Sun commented on SPARK-22974: do we define the expected behavior here as Interaction of the occurrences? `CountVectorizer` returns a `SparseVector`, for example, {{(3,[0,1,2],[1.0,1.0,1.0])}} I think it's possible to attach numerical attribute to [1.0, 1.0, 1.0] then interaction will work. > CountVectorModel does not attach attributes to output column > > > Key: SPARK-22974 > URL: https://issues.apache.org/jira/browse/SPARK-22974 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: William Zhang >Priority: Major > > If CountVectorModel transforms columns, the output column will not have > attributes attached to them. If later on, those columns are used in > Interaction transformer, an exception will be thrown: > {quote}"org.apache.spark.SparkException: Vector attributes must be defined > for interaction." > {quote} > To reproduce it: > {quote}import org.apache.spark.ml.feature._ > import org.apache.spark.sql.functions._ > val df = spark.createDataFrame(Seq( > (0, Array("a", "b", "c"), Array("1", "2")), > (1, Array("a", "b", "b", "c", "a", "d"), Array("1", "2", "3")) > )).toDF("id", "words", "nums") > val cvModel: CountVectorizerModel = new CountVectorizer() > .setInputCol("nums") > .setOutputCol("features2") > .setVocabSize(4) > .setMinDF(0) > .fit(df) > ]val cvm = new CountVectorizerModel(Array("a", "b", "c")) > .setInputCol("words") > .setOutputCol("features1") > > val df1 = cvm.transform(df) > val df2 = cvModel.transform(df1) > val interaction = new Interaction().setInputCols(Array("features1", > "features2")).setOutputCol("features") > val df3 = interaction.transform(df2) > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23144) Add console sink for continuous queries
[ https://issues.apache.org/jira/browse/SPARK-23144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-23144: -- Issue Type: Improvement (was: Bug) > Add console sink for continuous queries > --- > > Key: SPARK-23144 > URL: https://issues.apache.org/jira/browse/SPARK-23144 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23130) Spark Thrift does not clean-up temporary files (/tmp/*_resources and /tmp/hive/*.pipeout)
[ https://issues.apache.org/jira/browse/SPARK-23130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330269#comment-16330269 ] Marco Gaido commented on SPARK-23130: - [~seano] there is no JIRA for the pipeout issue and there cannot be, since it is a problem in Hive codebase, not in the Spark one, so there is no fix which can be provided in Spark. SPARK-20202 tracks not relying anymore on Spark's fork of hive and using proper Hive releases instead. This will solve the issue since the fix needed by the pipeout issue is included in newer Hive versions. > Spark Thrift does not clean-up temporary files (/tmp/*_resources and > /tmp/hive/*.pipeout) > - > > Key: SPARK-23130 > URL: https://issues.apache.org/jira/browse/SPARK-23130 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.1.0, 2.2.0 > Environment: * Hadoop distributions: HDP 2.5 - 2.6.3.0 > * OS: Seen on SLES12, RHEL 7.3 & RHEL 7.4 >Reporter: Sean Roberts >Priority: Major > Labels: thrift > > Spark Thrift is not cleaning up /tmp for files & directories named like: > /tmp/hive/*.pipeout > /tmp/*_resources > There are such a large number that /tmp quickly runs out of inodes *causing > the partition to be unusable and many services to crash*. This is even true > when the only jobs submitted are routine service checks. > Used `strace` to show that Spark Thrift is responsible: > {code:java} > strace.out.118864:04:53:49 > open("/tmp/hive/55ad7fc1-f79a-4ad8-8e02-26bbeaa86bbc7288010135864174970.pipeout", > O_RDWR|O_CREAT|O_EXCL, 0666) = 134 > strace.out.118864:04:53:49 > mkdir("/tmp/b6dfbf9e-2f7c-4c25-95a1-73c44318ecf4_resources", 0777) = 0 > {code} > *Those files were left behind, even days later.* > > Example files: > {code:java} > # stat > /tmp/hive/55ad7fc1-f79a-4ad8-8e02-26bbeaa86bbc7288010135864174970.pipeout > File: > ‘/tmp/hive/55ad7fc1-f79a-4ad8-8e02-26bbeaa86bbc7288010135864174970.pipeout’ > Size: 0 Blocks: 0 IO Block: 4096 regular empty file > Device: fe09h/65033d Inode: 678 Links: 1 > Access: (0644/-rw-r--r--) Uid: ( 1000/hive) Gid: ( 1002/ hadoop) > Access: 2017-12-19 04:53:49.126777260 -0600 > Modify: 2017-12-19 04:53:49.126777260 -0600 > Change: 2017-12-19 04:53:49.126777260 -0600 > Birth: - > # stat /tmp/b6dfbf9e-2f7c-4c25-95a1-73c44318ecf4_resources > File: ‘/tmp/b6dfbf9e-2f7c-4c25-95a1-73c44318ecf4_resources’ > Size: 4096 Blocks: 8 IO Block: 4096 directory > Device: fe09h/65033d Inode: 668 Links: 2 > Access: (0700/drwx--) Uid: ( 1000/hive) Gid: ( 1002/ hadoop) > Access: 2017-12-19 04:57:38.458937635 -0600 > Modify: 2017-12-19 04:53:49.062777216 -0600 > Change: 2017-12-19 04:53:49.066777218 -0600 > Birth: - > {code} > Showing the large number: > {code:java} > # find /tmp/ -name '*_resources' | wc -l > 68340 > # find /tmp/hive -name "*.pipeout" | wc -l > 51837 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23144) Add console sink for continuous queries
Tathagata Das created SPARK-23144: - Summary: Add console sink for continuous queries Key: SPARK-23144 URL: https://issues.apache.org/jira/browse/SPARK-23144 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Tathagata Das Assignee: Tathagata Das -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23144) Add console sink for continuous queries
[ https://issues.apache.org/jira/browse/SPARK-23144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23144: Assignee: Apache Spark (was: Tathagata Das) > Add console sink for continuous queries > --- > > Key: SPARK-23144 > URL: https://issues.apache.org/jira/browse/SPARK-23144 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23144) Add console sink for continuous queries
[ https://issues.apache.org/jira/browse/SPARK-23144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330276#comment-16330276 ] Apache Spark commented on SPARK-23144: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/20311 > Add console sink for continuous queries > --- > > Key: SPARK-23144 > URL: https://issues.apache.org/jira/browse/SPARK-23144 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23144) Add console sink for continuous queries
[ https://issues.apache.org/jira/browse/SPARK-23144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23144: Assignee: Tathagata Das (was: Apache Spark) > Add console sink for continuous queries > --- > > Key: SPARK-23144 > URL: https://issues.apache.org/jira/browse/SPARK-23144 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22974) CountVectorModel does not attach attributes to output column
[ https://issues.apache.org/jira/browse/SPARK-22974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22974: Assignee: Apache Spark > CountVectorModel does not attach attributes to output column > > > Key: SPARK-22974 > URL: https://issues.apache.org/jira/browse/SPARK-22974 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: William Zhang >Assignee: Apache Spark >Priority: Major > > If CountVectorModel transforms columns, the output column will not have > attributes attached to them. If later on, those columns are used in > Interaction transformer, an exception will be thrown: > {quote}"org.apache.spark.SparkException: Vector attributes must be defined > for interaction." > {quote} > To reproduce it: > {quote}import org.apache.spark.ml.feature._ > import org.apache.spark.sql.functions._ > val df = spark.createDataFrame(Seq( > (0, Array("a", "b", "c"), Array("1", "2")), > (1, Array("a", "b", "b", "c", "a", "d"), Array("1", "2", "3")) > )).toDF("id", "words", "nums") > val cvModel: CountVectorizerModel = new CountVectorizer() > .setInputCol("nums") > .setOutputCol("features2") > .setVocabSize(4) > .setMinDF(0) > .fit(df) > ]val cvm = new CountVectorizerModel(Array("a", "b", "c")) > .setInputCol("words") > .setOutputCol("features1") > > val df1 = cvm.transform(df) > val df2 = cvModel.transform(df1) > val interaction = new Interaction().setInputCols(Array("features1", > "features2")).setOutputCol("features") > val df3 = interaction.transform(df2) > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22974) CountVectorModel does not attach attributes to output column
[ https://issues.apache.org/jira/browse/SPARK-22974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22974: Assignee: (was: Apache Spark) > CountVectorModel does not attach attributes to output column > > > Key: SPARK-22974 > URL: https://issues.apache.org/jira/browse/SPARK-22974 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: William Zhang >Priority: Major > > If CountVectorModel transforms columns, the output column will not have > attributes attached to them. If later on, those columns are used in > Interaction transformer, an exception will be thrown: > {quote}"org.apache.spark.SparkException: Vector attributes must be defined > for interaction." > {quote} > To reproduce it: > {quote}import org.apache.spark.ml.feature._ > import org.apache.spark.sql.functions._ > val df = spark.createDataFrame(Seq( > (0, Array("a", "b", "c"), Array("1", "2")), > (1, Array("a", "b", "b", "c", "a", "d"), Array("1", "2", "3")) > )).toDF("id", "words", "nums") > val cvModel: CountVectorizerModel = new CountVectorizer() > .setInputCol("nums") > .setOutputCol("features2") > .setVocabSize(4) > .setMinDF(0) > .fit(df) > ]val cvm = new CountVectorizerModel(Array("a", "b", "c")) > .setInputCol("words") > .setOutputCol("features1") > > val df1 = cvm.transform(df) > val df2 = cvModel.transform(df1) > val interaction = new Interaction().setInputCols(Array("features1", > "features2")).setOutputCol("features") > val df3 = interaction.transform(df2) > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22974) CountVectorModel does not attach attributes to output column
[ https://issues.apache.org/jira/browse/SPARK-22974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330292#comment-16330292 ] Apache Spark commented on SPARK-22974: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/20313 > CountVectorModel does not attach attributes to output column > > > Key: SPARK-22974 > URL: https://issues.apache.org/jira/browse/SPARK-22974 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: William Zhang >Priority: Major > > If CountVectorModel transforms columns, the output column will not have > attributes attached to them. If later on, those columns are used in > Interaction transformer, an exception will be thrown: > {quote}"org.apache.spark.SparkException: Vector attributes must be defined > for interaction." > {quote} > To reproduce it: > {quote}import org.apache.spark.ml.feature._ > import org.apache.spark.sql.functions._ > val df = spark.createDataFrame(Seq( > (0, Array("a", "b", "c"), Array("1", "2")), > (1, Array("a", "b", "b", "c", "a", "d"), Array("1", "2", "3")) > )).toDF("id", "words", "nums") > val cvModel: CountVectorizerModel = new CountVectorizer() > .setInputCol("nums") > .setOutputCol("features2") > .setVocabSize(4) > .setMinDF(0) > .fit(df) > ]val cvm = new CountVectorizerModel(Array("a", "b", "c")) > .setInputCol("words") > .setOutputCol("features1") > > val df1 = cvm.transform(df) > val df2 = cvModel.transform(df1) > val interaction = new Interaction().setInputCols(Array("features1", > "features2")).setOutputCol("features") > val df3 = interaction.transform(df2) > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23016) Spark UI access and documentation
[ https://issues.apache.org/jira/browse/SPARK-23016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330309#comment-16330309 ] Eric Charles commented on SPARK-23016: -- Side note : the Spark UI can not be browsed via proxy (kubectl proxy and browse eg [1]) as the Spark UI redirects to [http://localhost:8001/jobs] [1] http://localhost:8001/api/v1/namespaces/default/services/http:spitfire-spitfire-spark-ui:4040/proxy > Spark UI access and documentation > - > > Key: SPARK-23016 > URL: https://issues.apache.org/jira/browse/SPARK-23016 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Priority: Minor > > We should have instructions to access the spark driver UI, or instruct users > to create a service to expose it. > Also might need an integration test to verify that the driver UI works as > expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23145) How to set Apache Spark Executor memory
Azharuddin created SPARK-23145: -- Summary: How to set Apache Spark Executor memory Key: SPARK-23145 URL: https://issues.apache.org/jira/browse/SPARK-23145 Project: Spark Issue Type: Question Components: EC2 Affects Versions: 2.1.0 Reporter: Azharuddin Fix For: 2.1.1 How can I increase the memory available for Apache spark executor nodes? I have a 2 GB file that is suitable to loading in to [Apache Spark|[https://mindmajix.com|https://mindmajix.com/apache-spark-training]]. I am running apache spark for the moment on 1 machine, so the driver and executor are on the same machine. The machine has 8 GB of memory. When I try count the lines of the file after setting the file to be cached in memory I get these errors: {{2014-10-25 22:25:12 WARN CacheManager:71 - Not enough space to cache partition rdd_1_1 in memory! Free memory is 278099801 bytes. }} I looked at the documentation [here|http://spark.apache.org/docs/latest/configuration.html] and set {{spark.executor.memory}} to {{4g}} in {{$SPARK_HOME/conf/spark-defaults.conf}} The UI shows this variable is set in the Spark Environment. You can find screenshot [here|https://drive.google.com/file/d/0B0B_O5bxDDlsc3JmY0xfbjFtN0k/view?usp=sharing] However when I go to the [Executor tab|https://drive.google.com/file/d/0B0B_O5bxDDlsV25ka2lHd1ZFNzA/view?usp=sharing] the memory limit for my single Executor is still set to 265.4 MB. I also still get the same error. I tried various things mentioned [here|https://stackoverflow.com/questions/24242060/how-to-change-memory-per-node-for-apache-spark-worker] but I still get the error and don't have a clear idea where I should change the setting. I am running my code interactively from the spark-shell -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23145) How to set Apache Spark Executor memory
[ https://issues.apache.org/jira/browse/SPARK-23145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Azharuddin updated SPARK-23145: --- Description: How can I increase the memory available for Apache spark executor nodes? I have a 2 GB file that is suitable to loading in to [Apache Spark| [https://mindmajix.com|https://mindmajix.com/apache-spark-training]]. I am running apache spark for the moment on 1 machine, so the driver and executor are on the same machine. The machine has 8 GB of memory. When I try count the lines of the file after setting the file to be cached in memory I get these errors: {\{2014-10-25 22:25:12 WARN CacheManager:71 - Not enough space to cache partition rdd_1_1 in memory! Free memory is 278099801 bytes. }} I looked at the documentation [here|http://spark.apache.org/docs/latest/configuration.html] and set {{spark.executor.memory}} to {{4g}} in {{$SPARK_HOME/conf/spark-defaults.conf}} The UI shows this variable is set in the Spark Environment. You can find screenshot [here|https://drive.google.com/file/d/0B0B_O5bxDDlsc3JmY0xfbjFtN0k/view?usp=sharing] However when I go to the [Executor tab|https://drive.google.com/file/d/0B0B_O5bxDDlsV25ka2lHd1ZFNzA/view?usp=sharing] the memory limit for my single Executor is still set to 265.4 MB. I also still get the same error. I tried various things mentioned [here|https://stackoverflow.com/questions/24242060/how-to-change-memory-per-node-for-apache-spark-worker] but I still get the error and don't have a clear idea where I should change the setting. I am running my code interactively from the spark-shell was: How can I increase the memory available for Apache spark executor nodes? I have a 2 GB file that is suitable to loading in to [Apache Spark|[https://mindmajix.com|https://mindmajix.com/apache-spark-training]]. I am running apache spark for the moment on 1 machine, so the driver and executor are on the same machine. The machine has 8 GB of memory. When I try count the lines of the file after setting the file to be cached in memory I get these errors: {{2014-10-25 22:25:12 WARN CacheManager:71 - Not enough space to cache partition rdd_1_1 in memory! Free memory is 278099801 bytes. }} I looked at the documentation [here|http://spark.apache.org/docs/latest/configuration.html] and set {{spark.executor.memory}} to {{4g}} in {{$SPARK_HOME/conf/spark-defaults.conf}} The UI shows this variable is set in the Spark Environment. You can find screenshot [here|https://drive.google.com/file/d/0B0B_O5bxDDlsc3JmY0xfbjFtN0k/view?usp=sharing] However when I go to the [Executor tab|https://drive.google.com/file/d/0B0B_O5bxDDlsV25ka2lHd1ZFNzA/view?usp=sharing] the memory limit for my single Executor is still set to 265.4 MB. I also still get the same error. I tried various things mentioned [here|https://stackoverflow.com/questions/24242060/how-to-change-memory-per-node-for-apache-spark-worker] but I still get the error and don't have a clear idea where I should change the setting. I am running my code interactively from the spark-shell > How to set Apache Spark Executor memory > --- > > Key: SPARK-23145 > URL: https://issues.apache.org/jira/browse/SPARK-23145 > Project: Spark > Issue Type: Question > Components: EC2 >Affects Versions: 2.1.0 >Reporter: Azharuddin >Priority: Major > Fix For: 2.1.1 > > Original Estimate: 954h > Remaining Estimate: 954h > > How can I increase the memory available for Apache spark executor nodes? > I have a 2 GB file that is suitable to loading in to [Apache Spark| > [https://mindmajix.com|https://mindmajix.com/apache-spark-training]]. I am > running apache spark for the moment on 1 machine, so the driver and executor > are on the same machine. The machine has 8 GB of memory. > When I try count the lines of the file after setting the file to be cached in > memory I get these errors: > {\{2014-10-25 22:25:12 WARN CacheManager:71 - Not enough space to cache > partition rdd_1_1 in memory! Free memory is 278099801 bytes. }} > I looked at the documentation > [here|http://spark.apache.org/docs/latest/configuration.html] and set > {{spark.executor.memory}} to {{4g}} in > {{$SPARK_HOME/conf/spark-defaults.conf}} > The UI shows this variable is set in the Spark Environment. You can find > screenshot > [here|https://drive.google.com/file/d/0B0B_O5bxDDlsc3JmY0xfbjFtN0k/view?usp=sharing] > However when I go to the [Executor > tab|https://drive.google.com/file/d/0B0B_O5bxDDlsV25ka2lHd1ZFNzA/view?usp=sharing] > the memory limit for my single Executor is still set to 265.4 MB. I also > still get the same error. > I tried various things mentioned > [here|https://stackoverflow.com/questions/24242060/how-to-change-memory-per-node-for-apache-spark-worker] > but I still get the error and do
[jira] [Updated] (SPARK-23145) How to set Apache Spark Executor memory
[ https://issues.apache.org/jira/browse/SPARK-23145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Azharuddin updated SPARK-23145: --- Description: How can I increase the memory available for Apache spark executor nodes? I have a 2 GB file that is suitable to loading in to [Apache Spark|https://mindmajix.com/apache-spark-training]. I am running apache spark for the moment on 1 machine, so the driver and executor are on the same machine. The machine has 8 GB of memory. When I try count the lines of the file after setting the file to be cached in memory I get these errors: {\{2014-10-25 22:25:12 WARN CacheManager:71 - Not enough space to cache partition rdd_1_1 in memory! Free memory is 278099801 bytes. }} I looked at the documentation [here|http://spark.apache.org/docs/latest/configuration.html] and set {{spark.executor.memory}} to {{4g}} in {{$SPARK_HOME/conf/spark-defaults.conf}} The UI shows this variable is set in the Spark Environment. You can find screenshot [here|https://drive.google.com/file/d/0B0B_O5bxDDlsc3JmY0xfbjFtN0k/view?usp=sharing] However when I go to the [Executor tab|https://drive.google.com/file/d/0B0B_O5bxDDlsV25ka2lHd1ZFNzA/view?usp=sharing] the memory limit for my single Executor is still set to 265.4 MB. I also still get the same error. I tried various things mentioned [here|https://stackoverflow.com/questions/24242060/how-to-change-memory-per-node-for-apache-spark-worker] but I still get the error and don't have a clear idea where I should change the setting. I am running my code interactively from the spark-shell was: How can I increase the memory available for Apache spark executor nodes? I have a 2 GB file that is suitable to loading in to [Apache Spark| [https://mindmajix.com|https://mindmajix.com/apache-spark-training]]. I am running apache spark for the moment on 1 machine, so the driver and executor are on the same machine. The machine has 8 GB of memory. When I try count the lines of the file after setting the file to be cached in memory I get these errors: {\{2014-10-25 22:25:12 WARN CacheManager:71 - Not enough space to cache partition rdd_1_1 in memory! Free memory is 278099801 bytes. }} I looked at the documentation [here|http://spark.apache.org/docs/latest/configuration.html] and set {{spark.executor.memory}} to {{4g}} in {{$SPARK_HOME/conf/spark-defaults.conf}} The UI shows this variable is set in the Spark Environment. You can find screenshot [here|https://drive.google.com/file/d/0B0B_O5bxDDlsc3JmY0xfbjFtN0k/view?usp=sharing] However when I go to the [Executor tab|https://drive.google.com/file/d/0B0B_O5bxDDlsV25ka2lHd1ZFNzA/view?usp=sharing] the memory limit for my single Executor is still set to 265.4 MB. I also still get the same error. I tried various things mentioned [here|https://stackoverflow.com/questions/24242060/how-to-change-memory-per-node-for-apache-spark-worker] but I still get the error and don't have a clear idea where I should change the setting. I am running my code interactively from the spark-shell > How to set Apache Spark Executor memory > --- > > Key: SPARK-23145 > URL: https://issues.apache.org/jira/browse/SPARK-23145 > Project: Spark > Issue Type: Question > Components: EC2 >Affects Versions: 2.1.0 >Reporter: Azharuddin >Priority: Major > Fix For: 2.1.1 > > Original Estimate: 954h > Remaining Estimate: 954h > > How can I increase the memory available for Apache spark executor nodes? > I have a 2 GB file that is suitable to loading in to [Apache > Spark|https://mindmajix.com/apache-spark-training]. I am running apache spark > for the moment on 1 machine, so the driver and executor are on the same > machine. The machine has 8 GB of memory. > When I try count the lines of the file after setting the file to be cached in > memory I get these errors: > {\{2014-10-25 22:25:12 WARN CacheManager:71 - Not enough space to cache > partition rdd_1_1 in memory! Free memory is 278099801 bytes. }} > I looked at the documentation > [here|http://spark.apache.org/docs/latest/configuration.html] and set > {{spark.executor.memory}} to {{4g}} in > {{$SPARK_HOME/conf/spark-defaults.conf}} > The UI shows this variable is set in the Spark Environment. You can find > screenshot > [here|https://drive.google.com/file/d/0B0B_O5bxDDlsc3JmY0xfbjFtN0k/view?usp=sharing] > However when I go to the [Executor > tab|https://drive.google.com/file/d/0B0B_O5bxDDlsV25ka2lHd1ZFNzA/view?usp=sharing] > the memory limit for my single Executor is still set to 265.4 MB. I also > still get the same error. > I tried various things mentioned > [here|https://stackoverflow.com/questions/24242060/how-to-change-memory-per-node-for-apache-spark-worker] > but I still get the error and don't have a clear idea where I should change > t
[jira] [Resolved] (SPARK-23145) How to set Apache Spark Executor memory
[ https://issues.apache.org/jira/browse/SPARK-23145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23145. --- Resolution: Invalid Fix Version/s: (was: 2.1.1) Please post these to the mailing list or StackOverflow > How to set Apache Spark Executor memory > --- > > Key: SPARK-23145 > URL: https://issues.apache.org/jira/browse/SPARK-23145 > Project: Spark > Issue Type: Question > Components: EC2 >Affects Versions: 2.1.0 >Reporter: Azharuddin >Priority: Major > Original Estimate: 954h > Remaining Estimate: 954h > > How can I increase the memory available for Apache spark executor nodes? > I have a 2 GB file that is suitable to loading in to [Apache > Spark|https://mindmajix.com/apache-spark-training]. I am running apache spark > for the moment on 1 machine, so the driver and executor are on the same > machine. The machine has 8 GB of memory. > When I try count the lines of the file after setting the file to be cached in > memory I get these errors: > {\{2014-10-25 22:25:12 WARN CacheManager:71 - Not enough space to cache > partition rdd_1_1 in memory! Free memory is 278099801 bytes. }} > I looked at the documentation > [here|http://spark.apache.org/docs/latest/configuration.html] and set > {{spark.executor.memory}} to {{4g}} in > {{$SPARK_HOME/conf/spark-defaults.conf}} > The UI shows this variable is set in the Spark Environment. You can find > screenshot > [here|https://drive.google.com/file/d/0B0B_O5bxDDlsc3JmY0xfbjFtN0k/view?usp=sharing] > However when I go to the [Executor > tab|https://drive.google.com/file/d/0B0B_O5bxDDlsV25ka2lHd1ZFNzA/view?usp=sharing] > the memory limit for my single Executor is still set to 265.4 MB. I also > still get the same error. > I tried various things mentioned > [here|https://stackoverflow.com/questions/24242060/how-to-change-memory-per-node-for-apache-spark-worker] > but I still get the error and don't have a clear idea where I should change > the setting. > I am running my code interactively from the spark-shell -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23104) Document that kubernetes is still "experimental"
[ https://issues.apache.org/jira/browse/SPARK-23104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330381#comment-16330381 ] Apache Spark commented on SPARK-23104: -- User 'foxish' has created a pull request for this issue: https://github.com/apache/spark/pull/20314 > Document that kubernetes is still "experimental" > > > Key: SPARK-23104 > URL: https://issues.apache.org/jira/browse/SPARK-23104 > Project: Spark > Issue Type: Task > Components: Documentation, Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Anirudh Ramanathan >Priority: Critical > > As discussed in the mailing list, we should document that the kubernetes > backend is still experimental. > That does not need to include any code changes. This is just meant to tell > users that they can expect changes in how the backend behaves in future > versions, and that things like configuration, the container image's layout > and entry points might change. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23104) Document that kubernetes is still "experimental"
[ https://issues.apache.org/jira/browse/SPARK-23104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23104: Assignee: Anirudh Ramanathan (was: Apache Spark) > Document that kubernetes is still "experimental" > > > Key: SPARK-23104 > URL: https://issues.apache.org/jira/browse/SPARK-23104 > Project: Spark > Issue Type: Task > Components: Documentation, Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Anirudh Ramanathan >Priority: Critical > > As discussed in the mailing list, we should document that the kubernetes > backend is still experimental. > That does not need to include any code changes. This is just meant to tell > users that they can expect changes in how the backend behaves in future > versions, and that things like configuration, the container image's layout > and entry points might change. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23104) Document that kubernetes is still "experimental"
[ https://issues.apache.org/jira/browse/SPARK-23104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23104: Assignee: Apache Spark (was: Anirudh Ramanathan) > Document that kubernetes is still "experimental" > > > Key: SPARK-23104 > URL: https://issues.apache.org/jira/browse/SPARK-23104 > Project: Spark > Issue Type: Task > Components: Documentation, Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Critical > > As discussed in the mailing list, we should document that the kubernetes > backend is still experimental. > That does not need to include any code changes. This is just meant to tell > users that they can expect changes in how the backend behaves in future > versions, and that things like configuration, the container image's layout > and entry points might change. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23146) Support client mode for Kubernetes cluster backend
Anirudh Ramanathan created SPARK-23146: -- Summary: Support client mode for Kubernetes cluster backend Key: SPARK-23146 URL: https://issues.apache.org/jira/browse/SPARK-23146 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.4.0 Reporter: Anirudh Ramanathan This issue tracks client mode support within Spark when running in the Kubernetes cluster backend. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23146) Support client mode for Kubernetes cluster backend
[ https://issues.apache.org/jira/browse/SPARK-23146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anirudh Ramanathan updated SPARK-23146: --- Issue Type: New Feature (was: Improvement) > Support client mode for Kubernetes cluster backend > -- > > Key: SPARK-23146 > URL: https://issues.apache.org/jira/browse/SPARK-23146 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Major > > This issue tracks client mode support within Spark when running in the > Kubernetes cluster backend. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23016) Spark UI access and documentation
[ https://issues.apache.org/jira/browse/SPARK-23016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330389#comment-16330389 ] Anirudh Ramanathan commented on SPARK-23016: Good point - yeah, we don't recommend the API server proxy in the docs anymore - https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/running-on-kubernetes.html. > Spark UI access and documentation > - > > Key: SPARK-23016 > URL: https://issues.apache.org/jira/browse/SPARK-23016 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Priority: Minor > > We should have instructions to access the spark driver UI, or instruct users > to create a service to expose it. > Also might need an integration test to verify that the driver UI works as > expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23140) DataSourceV2Strategy is missing in HiveSessionStateBuilder
[ https://issues.apache.org/jira/browse/SPARK-23140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23140: --- Assignee: Saisai Shao > DataSourceV2Strategy is missing in HiveSessionStateBuilder > -- > > Key: SPARK-23140 > URL: https://issues.apache.org/jira/browse/SPARK-23140 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Fix For: 2.3.0 > > > DataSourceV2Strategy is not added into HiveSessionStateBuilder's planner, > which will lead to exception when playing continuous query: > {noformat} > ERROR ContinuousExecution: Query abc [id = > 5cb6404a-e907-4662-b5d7-20037ccd6947, runId = > 617b8dea-018e-4082-935e-98d98d473fdd] terminated with error > java.lang.AssertionError: assertion failed: No plan for WriteToDataSourceV2 > org.apache.spark.sql.execution.streaming.sources.ContinuousMemoryWriter@3dba6d7c > +- StreamingDataSourceV2Relation [timestamp#15, value#16L], > org.apache.spark.sql.execution.streaming.continuous.RateStreamContinuousReader@62ceac53 > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$2.apply(ContinuousExecution.scala:221) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$2.apply(ContinuousExecution.scala:212) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runContinuous(ContinuousExecution.scala:212) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runActivatedStream(ContinuousExecution.scala:94) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23140) DataSourceV2Strategy is missing in HiveSessionStateBuilder
[ https://issues.apache.org/jira/browse/SPARK-23140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23140. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20305 [https://github.com/apache/spark/pull/20305] > DataSourceV2Strategy is missing in HiveSessionStateBuilder > -- > > Key: SPARK-23140 > URL: https://issues.apache.org/jira/browse/SPARK-23140 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Fix For: 2.3.0 > > > DataSourceV2Strategy is not added into HiveSessionStateBuilder's planner, > which will lead to exception when playing continuous query: > {noformat} > ERROR ContinuousExecution: Query abc [id = > 5cb6404a-e907-4662-b5d7-20037ccd6947, runId = > 617b8dea-018e-4082-935e-98d98d473fdd] terminated with error > java.lang.AssertionError: assertion failed: No plan for WriteToDataSourceV2 > org.apache.spark.sql.execution.streaming.sources.ContinuousMemoryWriter@3dba6d7c > +- StreamingDataSourceV2Relation [timestamp#15, value#16L], > org.apache.spark.sql.execution.streaming.continuous.RateStreamContinuousReader@62ceac53 > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$2.apply(ContinuousExecution.scala:221) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$2.apply(ContinuousExecution.scala:212) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runContinuous(ContinuousExecution.scala:212) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runActivatedStream(ContinuousExecution.scala:94) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23147) Stage page will throw exception when there's no complete tasks
[ https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-23147: Attachment: Screen Shot 2018-01-18 at 8.50.08 PM.png > Stage page will throw exception when there's no complete tasks > -- > > Key: SPARK-23147 > URL: https://issues.apache.org/jira/browse/SPARK-23147 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Major > Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png > > > Current Stage page's task table will throw an exception when there's no > complete tasks, to reproduce this issue, user could submit code like: > {code} > sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect() > {code} > Then open the UI and click into stage details. > Deep dive into the code, found that current UI can only show the completed > tasks, it is different from 2.2 code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23147) Stage page will throw exception when there's no complete tasks
Saisai Shao created SPARK-23147: --- Summary: Stage page will throw exception when there's no complete tasks Key: SPARK-23147 URL: https://issues.apache.org/jira/browse/SPARK-23147 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.3.0 Reporter: Saisai Shao Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png Current Stage page's task table will throw an exception when there's no complete tasks, to reproduce this issue, user could submit code like: {code} sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect() {code} Then open the UI and click into stage details. Deep dive into the code, found that current UI can only show the completed tasks, it is different from 2.2 code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23147) Stage page will throw exception when there's no complete tasks
[ https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-23147: Description: Current Stage page's task table will throw an exception when there's no complete tasks, to reproduce this issue, user could submit code like: {code} sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect() {code} Then open the UI and click into stage details. Below is the screenshot. !Screen Shot 2018-01-18 at 8.50.08 PM.png! Deep dive into the code, found that current UI can only show the completed tasks, it is different from 2.2 code. was: Current Stage page's task table will throw an exception when there's no complete tasks, to reproduce this issue, user could submit code like: {code} sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect() {code} Then open the UI and click into stage details. Deep dive into the code, found that current UI can only show the completed tasks, it is different from 2.2 code. > Stage page will throw exception when there's no complete tasks > -- > > Key: SPARK-23147 > URL: https://issues.apache.org/jira/browse/SPARK-23147 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Major > Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png > > > Current Stage page's task table will throw an exception when there's no > complete tasks, to reproduce this issue, user could submit code like: > {code} > sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect() > {code} > Then open the UI and click into stage details. Below is the screenshot. > !Screen Shot 2018-01-18 at 8.50.08 PM.png! > Deep dive into the code, found that current UI can only show the completed > tasks, it is different from 2.2 code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces
Bogdan Raducanu created SPARK-23148: --- Summary: spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces Key: SPARK-23148 URL: https://issues.apache.org/jira/browse/SPARK-23148 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Bogdan Raducanu Repro code: {code:java} spark.range(10).write.csv("/tmp/a b c/a.csv") spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count 10 spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count java.io.FileNotFoundException: File file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv does not exist {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23147) Stage page will throw exception when there's no complete tasks
[ https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23147: Assignee: (was: Apache Spark) > Stage page will throw exception when there's no complete tasks > -- > > Key: SPARK-23147 > URL: https://issues.apache.org/jira/browse/SPARK-23147 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Major > Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png > > > Current Stage page's task table will throw an exception when there's no > complete tasks, to reproduce this issue, user could submit code like: > {code} > sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect() > {code} > Then open the UI and click into stage details. Below is the screenshot. > !Screen Shot 2018-01-18 at 8.50.08 PM.png! > Deep dive into the code, found that current UI can only show the completed > tasks, it is different from 2.2 code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23147) Stage page will throw exception when there's no complete tasks
[ https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23147: Assignee: Apache Spark > Stage page will throw exception when there's no complete tasks > -- > > Key: SPARK-23147 > URL: https://issues.apache.org/jira/browse/SPARK-23147 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Assignee: Apache Spark >Priority: Major > Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png > > > Current Stage page's task table will throw an exception when there's no > complete tasks, to reproduce this issue, user could submit code like: > {code} > sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect() > {code} > Then open the UI and click into stage details. Below is the screenshot. > !Screen Shot 2018-01-18 at 8.50.08 PM.png! > Deep dive into the code, found that current UI can only show the completed > tasks, it is different from 2.2 code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23147) Stage page will throw exception when there's no complete tasks
[ https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330496#comment-16330496 ] Apache Spark commented on SPARK-23147: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/20315 > Stage page will throw exception when there's no complete tasks > -- > > Key: SPARK-23147 > URL: https://issues.apache.org/jira/browse/SPARK-23147 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Major > Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png > > > Current Stage page's task table will throw an exception when there's no > complete tasks, to reproduce this issue, user could submit code like: > {code} > sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect() > {code} > Then open the UI and click into stage details. Below is the screenshot. > !Screen Shot 2018-01-18 at 8.50.08 PM.png! > Deep dive into the code, found that current UI can only show the completed > tasks, it is different from 2.2 code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22036) BigDecimal multiplication sometimes returns null
[ https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22036: --- Assignee: Marco Gaido > BigDecimal multiplication sometimes returns null > > > Key: SPARK-22036 > URL: https://issues.apache.org/jira/browse/SPARK-22036 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Olivier Blanvillain >Assignee: Marco Gaido >Priority: Major > Fix For: 2.3.0 > > > The multiplication of two BigDecimal numbers sometimes returns null. Here is > a minimal reproduction: > {code:java} > object Main extends App { > import org.apache.spark.{SparkConf, SparkContext} > import org.apache.spark.sql.SparkSession > import spark.implicits._ > val conf = new > SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", > "false") > val spark = > SparkSession.builder().config(conf).appName("REPL").getOrCreate() > implicit val sqlContext = spark.sqlContext > case class X2(a: BigDecimal, b: BigDecimal) > val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), > BigDecimal(-1000.1 > val result = ds.select(ds("a") * ds("b")).collect.head > println(result) // [null] > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22036) BigDecimal multiplication sometimes returns null
[ https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22036. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20023 [https://github.com/apache/spark/pull/20023] > BigDecimal multiplication sometimes returns null > > > Key: SPARK-22036 > URL: https://issues.apache.org/jira/browse/SPARK-22036 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Olivier Blanvillain >Assignee: Marco Gaido >Priority: Major > Fix For: 2.3.0 > > > The multiplication of two BigDecimal numbers sometimes returns null. Here is > a minimal reproduction: > {code:java} > object Main extends App { > import org.apache.spark.{SparkConf, SparkContext} > import org.apache.spark.sql.SparkSession > import spark.implicits._ > val conf = new > SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", > "false") > val spark = > SparkSession.builder().config(conf).appName("REPL").getOrCreate() > implicit val sqlContext = spark.sqlContext > case class X2(a: BigDecimal, b: BigDecimal) > val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), > BigDecimal(-1000.1 > val result = ds.select(ds("a") * ds("b")).collect.head > println(result) // [null] > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23141) Support data type string as a returnType for registerJavaFunction.
[ https://issues.apache.org/jira/browse/SPARK-23141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-23141: Assignee: Takuya Ueshin > Support data type string as a returnType for registerJavaFunction. > -- > > Key: SPARK-23141 > URL: https://issues.apache.org/jira/browse/SPARK-23141 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 2.3.0 > > > Currently {{UDFRegistration.registerJavaFunction}} doesn't support data type > string as a returnType whereas {{UDFRegistration.register}}, {{@udf}}, or > {{@pandas_udf}} does. > We can support it for {{UDFRegistration.registerJavaFunction}} as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23141) Support data type string as a returnType for registerJavaFunction.
[ https://issues.apache.org/jira/browse/SPARK-23141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23141. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20307 [https://github.com/apache/spark/pull/20307] > Support data type string as a returnType for registerJavaFunction. > -- > > Key: SPARK-23141 > URL: https://issues.apache.org/jira/browse/SPARK-23141 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 2.3.0 > > > Currently {{UDFRegistration.registerJavaFunction}} doesn't support data type > string as a returnType whereas {{UDFRegistration.register}}, {{@udf}}, or > {{@pandas_udf}} does. > We can support it for {{UDFRegistration.registerJavaFunction}} as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19280) Failed Recovery from checkpoint caused by the multi-threads issue in Spark Streaming scheduler
[ https://issues.apache.org/jira/browse/SPARK-19280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330558#comment-16330558 ] Riccardo Vincelli commented on SPARK-19280: --- What is the exception you are getting? For me it is: This RDD lacks a SparkContext. It could happen in the following cases: (1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. (2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758. Not sure yet if 100% related. Thanks, > Failed Recovery from checkpoint caused by the multi-threads issue in Spark > Streaming scheduler > -- > > Key: SPARK-19280 > URL: https://issues.apache.org/jira/browse/SPARK-19280 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nan Zhu >Priority: Critical > > In one of our applications, we found the following issue, the application > recovering from a checkpoint file named "checkpoint-***16670" but with > the timestamp ***16650 will recover from the very beginning of the stream > and because our application relies on the external & periodically-cleaned > data (syncing with checkpoint cleanup), the recovery just failed > We identified a potential issue in Spark Streaming checkpoint and will > describe it with the following example. We will propose a fix in the end of > this JIRA. > 1. The application properties: Batch Duration: 2, Functionality: Single > Stream calling ReduceByKeyAndWindow and print, Window Size: 6, > SlideDuration, 2 > 2. RDD at 16650 is generated and the corresponding job is submitted to > the execution ThreadPool. Meanwhile, a DoCheckpoint message is sent to the > queue of JobGenerator > 3. Job at 16650 is finished and JobCompleted message is sent to > JobScheduler's queue, and meanwhile, Job at 16652 is submitted to the > execution ThreadPool and similarly, a DoCheckpoint is sent to the queue of > JobGenerator > 4. JobScheduler's message processing thread (I will use JS-EventLoop to > identify it) is not scheduled by the operating system for a long time, and > during this period, Jobs generated from 16652 - 16670 are generated > and completed. > 5. the message processing thread of JobGenerator (JG-EventLoop) is scheduled > and processed all DoCheckpoint messages for jobs ranging from 16652 - > 16670 and checkpoint files are successfully written. CRITICAL: at this > moment, the lastCheckpointTime would be 16670. > 6. JS-EventLoop is scheduled, and process all JobCompleted messages for jobs > ranging from 16652 - 16670. CRITICAL: a ClearMetadata message is > pushed to JobGenerator's message queue for EACH JobCompleted. > 7. The current message queue contains 20 ClearMetadata messages and > JG-EventLoop is scheduled to process them. CRITICAL: ClearMetadata will > remove all RDDs out of rememberDuration window. In our case, > ReduceyKeyAndWindow will set rememberDuration to 10 (rememberDuration of > ReducedWindowDStream (4) + windowSize) resulting that only RDDs <- > (16660, 16670] are kept. And ClearMetaData processing logic will push > a DoCheckpoint to JobGenerator's thread > 8. JG-EventLoop is scheduled again to process DoCheckpoint for 1665, VERY > CRITICAL: at this step, RDD no later than 16660 has been removed, and > checkpoint data is updated as > https://github.com/apache/spark/blob/a81e336f1eddc2c6245d807aae2c81ddc60eabf9/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStreamCheckpointData.scala#L53 > and > https://github.com/apache/spark/blob/a81e336f1eddc2c6245d807aae2c81ddc60eabf9/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStreamCheckpointData.scala#L59. > 9. After 8, a checkpoint named /path/checkpoint-16670 is created but with > the timestamp 16650. and at this moment, Application crashed > 10. Application recovers from /path/checkpoint-16670 and try to get RDD > with validTime 16650. Of course it will not find it and has to recompute. > In the case of ReduceByKeyAndWindow, it needs to recursively slice RDDs until > to the start of the stream. When the stream depends on the external data, it > will not successfully recover. In the case of Kafka, the recovered RDDs would > not be the same as the original one, as the currentOffsets has been u
[jira] [Comment Edited] (SPARK-19280) Failed Recovery from checkpoint caused by the multi-threads issue in Spark Streaming scheduler
[ https://issues.apache.org/jira/browse/SPARK-19280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330558#comment-16330558 ] Riccardo Vincelli edited comment on SPARK-19280 at 1/18/18 2:20 PM: What is the exception you are getting? For me it is: This RDD lacks a SparkContext. It could happen in the following cases: (1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. (2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758. Not sure yet if 100% related but our program: -reads Kafka topics -uses mapWithState on an initialState Thanks, was (Author: rvincelli): What is the exception you are getting? For me it is: This RDD lacks a SparkContext. It could happen in the following cases: (1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. (2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758. Not sure yet if 100% related. Thanks, > Failed Recovery from checkpoint caused by the multi-threads issue in Spark > Streaming scheduler > -- > > Key: SPARK-19280 > URL: https://issues.apache.org/jira/browse/SPARK-19280 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nan Zhu >Priority: Critical > > In one of our applications, we found the following issue, the application > recovering from a checkpoint file named "checkpoint-***16670" but with > the timestamp ***16650 will recover from the very beginning of the stream > and because our application relies on the external & periodically-cleaned > data (syncing with checkpoint cleanup), the recovery just failed > We identified a potential issue in Spark Streaming checkpoint and will > describe it with the following example. We will propose a fix in the end of > this JIRA. > 1. The application properties: Batch Duration: 2, Functionality: Single > Stream calling ReduceByKeyAndWindow and print, Window Size: 6, > SlideDuration, 2 > 2. RDD at 16650 is generated and the corresponding job is submitted to > the execution ThreadPool. Meanwhile, a DoCheckpoint message is sent to the > queue of JobGenerator > 3. Job at 16650 is finished and JobCompleted message is sent to > JobScheduler's queue, and meanwhile, Job at 16652 is submitted to the > execution ThreadPool and similarly, a DoCheckpoint is sent to the queue of > JobGenerator > 4. JobScheduler's message processing thread (I will use JS-EventLoop to > identify it) is not scheduled by the operating system for a long time, and > during this period, Jobs generated from 16652 - 16670 are generated > and completed. > 5. the message processing thread of JobGenerator (JG-EventLoop) is scheduled > and processed all DoCheckpoint messages for jobs ranging from 16652 - > 16670 and checkpoint files are successfully written. CRITICAL: at this > moment, the lastCheckpointTime would be 16670. > 6. JS-EventLoop is scheduled, and process all JobCompleted messages for jobs > ranging from 16652 - 16670. CRITICAL: a ClearMetadata message is > pushed to JobGenerator's message queue for EACH JobCompleted. > 7. The current message queue contains 20 ClearMetadata messages and > JG-EventLoop is scheduled to process them. CRITICAL: ClearMetadata will > remove all RDDs out of rememberDuration window. In our case, > ReduceyKeyAndWindow will set rememberDuration to 10 (rememberDuration of > ReducedWindowDStream (4) + windowSize) resulting that only RDDs <- > (16660, 16670] are kept. And ClearMetaData processing logic will push > a DoCheckpoint to JobGenerator's thread > 8. JG-EventLoop is scheduled again to process DoCheckpoint for 1665, VERY > CRITICAL: at this step, RDD no later than 16660 has been removed, and > checkpoint data is updated as > https://github.com/apache/spark/blob/a81e336f1eddc2c6245d807aae2c81ddc60eabf9/streaming/src/main/scala/org/apache/sp
[jira] [Created] (SPARK-23149) polish ColumnarBatch
Wenchen Fan created SPARK-23149: --- Summary: polish ColumnarBatch Key: SPARK-23149 URL: https://issues.apache.org/jira/browse/SPARK-23149 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23149) polish ColumnarBatch
[ https://issues.apache.org/jira/browse/SPARK-23149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330610#comment-16330610 ] Apache Spark commented on SPARK-23149: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/20316 > polish ColumnarBatch > > > Key: SPARK-23149 > URL: https://issues.apache.org/jira/browse/SPARK-23149 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23149) polish ColumnarBatch
[ https://issues.apache.org/jira/browse/SPARK-23149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23149: Assignee: Apache Spark (was: Wenchen Fan) > polish ColumnarBatch > > > Key: SPARK-23149 > URL: https://issues.apache.org/jira/browse/SPARK-23149 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23149) polish ColumnarBatch
[ https://issues.apache.org/jira/browse/SPARK-23149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23149: Assignee: Wenchen Fan (was: Apache Spark) > polish ColumnarBatch > > > Key: SPARK-23149 > URL: https://issues.apache.org/jira/browse/SPARK-23149 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23150) kafka job failing assertion failed: Beginning offset 5451 is after the ending offset 5435 for topic
Balu created SPARK-23150: Summary: kafka job failing assertion failed: Beginning offset 5451 is after the ending offset 5435 for topic Key: SPARK-23150 URL: https://issues.apache.org/jira/browse/SPARK-23150 Project: Spark Issue Type: Question Components: Build Affects Versions: 0.9.0 Environment: Spark Job failing with below error. Please help Kafka version 0.10.0.0 Job aborted due to stage failure: Task 3 in stage 2.0 failed 4 times, most recent failure: Lost task 3.3 in stage 2.0 (TID 16, 10.3.12.16): java.lang.AssertionError: assertion failed: Beginning offset 5451 is after the ending offset 5435 for topic project1 partition 2. You either provided an invalid fromOffset, or the Kafka topic has been damaged at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.streaming.kafka.KafkaRDD.compute(KafkaRDD.scala:86) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: Reporter: Balu -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23150) kafka job failing assertion failed: Beginning offset 5451 is after the ending offset 5435 for topic
[ https://issues.apache.org/jira/browse/SPARK-23150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23150. --- Resolution: Invalid > kafka job failing assertion failed: Beginning offset 5451 is after the ending > offset 5435 for topic > --- > > Key: SPARK-23150 > URL: https://issues.apache.org/jira/browse/SPARK-23150 > Project: Spark > Issue Type: Question > Components: Build >Affects Versions: 0.9.0 > Environment: Spark Job failing with below error. Please help > Kafka version 0.10.0.0 > Job aborted due to stage failure: Task 3 in stage 2.0 failed 4 times, most > recent failure: Lost task 3.3 in stage 2.0 (TID 16, 10.3.12.16): > java.lang.AssertionError: assertion failed: Beginning offset 5451 is after > the ending offset 5435 for topic project1 partition 2. You either provided an > invalid fromOffset, or the Kafka topic has been damaged at > scala.Predef$.assert(Predef.scala:179) at > org.apache.spark.streaming.kafka.KafkaRDD.compute(KafkaRDD.scala:86) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at > org.apache.spark.scheduler.Task.run(Task.scala:70) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) Driver stacktrace: >Reporter: Balu >Priority: Critical > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23151) Provide a distribution Spark with Hadoop 3.0
Louis Burke created SPARK-23151: --- Summary: Provide a distribution Spark with Hadoop 3.0 Key: SPARK-23151 URL: https://issues.apache.org/jira/browse/SPARK-23151 Project: Spark Issue Type: Dependency upgrade Components: Spark Core Affects Versions: 2.2.1, 2.2.0 Reporter: Louis Burke Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark package only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication is that using up to date Kinesis libraries alongside s3 causes a clash w.r.t aws-java-sdk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0
[ https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Louis Burke updated SPARK-23151: Summary: Provide a distribution of Spark with Hadoop 3.0 (was: Provide a distribution Spark with Hadoop 3.0) > Provide a distribution of Spark with Hadoop 3.0 > --- > > Key: SPARK-23151 > URL: https://issues.apache.org/jira/browse/SPARK-23151 > Project: Spark > Issue Type: Dependency upgrade > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1 >Reporter: Louis Burke >Priority: Major > > Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark > package > only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication > is > that using up to date Kinesis libraries alongside s3 causes a clash w.r.t > aws-java-sdk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23147) Stage page will throw exception when there's no complete tasks
[ https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23147. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20315 [https://github.com/apache/spark/pull/20315] > Stage page will throw exception when there's no complete tasks > -- > > Key: SPARK-23147 > URL: https://issues.apache.org/jira/browse/SPARK-23147 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Fix For: 2.3.0 > > Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png > > > Current Stage page's task table will throw an exception when there's no > complete tasks, to reproduce this issue, user could submit code like: > {code} > sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect() > {code} > Then open the UI and click into stage details. Below is the screenshot. > !Screen Shot 2018-01-18 at 8.50.08 PM.png! > Deep dive into the code, found that current UI can only show the completed > tasks, it is different from 2.2 code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23147) Stage page will throw exception when there's no complete tasks
[ https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23147: -- Assignee: Saisai Shao > Stage page will throw exception when there's no complete tasks > -- > > Key: SPARK-23147 > URL: https://issues.apache.org/jira/browse/SPARK-23147 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Fix For: 2.3.0 > > Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png > > > Current Stage page's task table will throw an exception when there's no > complete tasks, to reproduce this issue, user could submit code like: > {code} > sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect() > {code} > Then open the UI and click into stage details. Below is the screenshot. > !Screen Shot 2018-01-18 at 8.50.08 PM.png! > Deep dive into the code, found that current UI can only show the completed > tasks, it is different from 2.2 code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel
[ https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330943#comment-16330943 ] Andrew Ash commented on SPARK-22982: [~joshrosen] do you have some example stacktraces of what this bug can cause? Several of our clusters hit what I think is this problem earlier this month, see below for details. For a few days in January (4th through 12th) on our AWS infra, we observed massively degraded disk read throughput (down to 33% of previous peaks). During this time, we also began observing intermittent exceptions coming from Spark at read time of parquet files that a previous Spark job had written. When the read throughput recovered on the 12th, we stopped observing the exceptions and haven't seen them since. At first we observed this stacktrace when reading .snappy.parquet files: {noformat} java.lang.RuntimeException: java.io.IOException: could not read page Page [bytes.size=1048641, valueCount=29945, uncompressedSize=1048641] in col [my_column] BINARY at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:493) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:486) at org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:96) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(VectorizedColumnReader.java:486) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:157) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:229) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:398) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: could not read page Page [bytes.size=1048641, valueCount=29945, uncompressedSize=1048641] in col [my_column] BINARY at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV1(VectorizedColumnReader.java:562) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$000(VectorizedColumnReader.java:47) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:490) ... 31 more Caused by: java.io
[jira] [Issue Comment Deleted] (SPARK-13572) HiveContext reads avro Hive tables incorrectly
[ https://issues.apache.org/jira/browse/SPARK-13572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-13572: --- Comment: was deleted (was: Guten Tag, Bonjour, Buongiorno Besten Dank für Ihr E-Mail. Ich bin zurzeit nicht am Arbeitsplatz und kehre am 27.06.2016 zurück. Ihr E-Mail wird nicht weitergeleitet. Merci pour votre courriel. Je suis actuellement absent(e) et serai de retour le 27.06.2016. Votre courriel ne sera pas dévié. La ringrazio per la sua e-mail. Al momento sono assente. Sarò di ritorno il 27.06.2016. Il suo messaggio non è inoltrato a terze persone. Freundliche Grüsse Avec mes meilleures salutations Cordiali saluti Russ Weedon Professional ERP Engineer SBB AG SBB Informatik Solution Center Finanzen, K-SCM, HR und Immobilien Lindenhofstrasse 1, 3000 Bern 65 Mobil +41 78 812 47 62 russ.wee...@sbb.ch / www.sbb.ch ) > HiveContext reads avro Hive tables incorrectly > --- > > Key: SPARK-13572 > URL: https://issues.apache.org/jira/browse/SPARK-13572 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2, 1.6.0, 1.6.1 > Environment: Hive 0.13.1, Spark 1.5.2 >Reporter: Zoltan Fedor >Priority: Major > Attachments: logs, table_definition > > > I am using PySpark to read avro-based tables from Hive and while the avro > tables can be read, some of the columns are incorrectly read - showing value > {{None}} instead of the actual value. > {noformat} > >>> results_df = sqlContext.sql("""SELECT * FROM trmdw_prod.opsconsole_ingest > >>> where year=2016 and month=2 and day=29 limit 3""") > >>> results_df.take(3) > [Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, > uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, > statvalue=None, displayname=None, category=None, > source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29), > Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, > uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, > statvalue=None, displayname=None, category=None, > source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29), > Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, > uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, > statvalue=None, displayname=None, category=None, > source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29)] > {noformat} > Observe the {{None}} values at most of the fields. Surprisingly not all > fields, only some of them are showing {{None}} instead of the real values. > The table definition does not show anything specific about these columns. > Running the same query in Hive: > {noformat} > c:hive2://xyz.com:100> SELECT * FROM trmdw_prod.opsconsole_ingest where > year=2016 and month=2 and day=29 limit 3; > +--+---++---+---+---+++-+--++-++-+--++--+ > | opsconsole_ingest.kafkaoffsetgeneration | opsconsole_ingest.kafkapartition > | opsconsole_ingest.kafkaoffset | opsconsole_ingest.uuid | > opsconsole_ingest.mid | opsconsole_ingest.iid | > opsconsole_ingest.product | opsconsole_ingest.utctime | > opsconsole_ingest.statcode | opsconsole_ingest.statvalue | > opsconsole_ingest.displayname | opsconsole_ingest.category | > opsconsole_ingest.source_filename | opsconsole_ingest.year | > opsconsole_ingest.month | opsconsole_ingest.day | > +--+---++---+---+---+++-+--++-++-+--++--+ > | 11.0 | 0.0 > | 3.83399394E8 | EF0D03C409681B98646F316CA1088973 | > 174f53fb-ca9b-d3f9-64e1-7631bf906817 | ---- > | est| 2016-01-13T06:58:19| 8 >
[jira] [Resolved] (SPARK-23029) Doc spark.shuffle.file.buffer units are kb when no units specified
[ https://issues.apache.org/jira/browse/SPARK-23029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23029. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20269 [https://github.com/apache/spark/pull/20269] > Doc spark.shuffle.file.buffer units are kb when no units specified > -- > > Key: SPARK-23029 > URL: https://issues.apache.org/jira/browse/SPARK-23029 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Fernando Pereira >Assignee: Fernando Pereira >Priority: Minor > Fix For: 2.3.0 > > > When setting the spark.shuffle.file.buffer setting, even to its default > value, shuffles fail. > This appears to affect small to medium size partitions. Strangely the error > message is OutOfMemoryError, but it works with large partitions (at least > >32MB). > {code} > pyspark --conf "spark.shuffle.file.buffer=$((32*1024))" > /gpfs/bbp.cscs.ch/scratch/gss/spykfunc/_sparkenv/lib/python2.7/site-packages/pyspark/bin/spark-submit > pyspark-shell-main --name PySparkShell --conf spark.shuffle.file.buffer=32768 > version 2.2.1 > >>> spark.range(1e7, numPartitions=10).sort("id").write.parquet("a", > >>> mode="overwrite") > [Stage 1:>(0 + 10) / > 10]18/01/10 19:34:21 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 11) > java.lang.OutOfMemoryError: Java heap space > at java.io.BufferedOutputStream.(BufferedOutputStream.java:75) > at > org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.(DiskBlockObjectWriter.scala:107) > at > org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:108) > at > org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116) > at > org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23029) Doc spark.shuffle.file.buffer units are kb when no units specified
[ https://issues.apache.org/jira/browse/SPARK-23029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-23029: - Assignee: Fernando Pereira > Doc spark.shuffle.file.buffer units are kb when no units specified > -- > > Key: SPARK-23029 > URL: https://issues.apache.org/jira/browse/SPARK-23029 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Fernando Pereira >Assignee: Fernando Pereira >Priority: Minor > Fix For: 2.3.0 > > > When setting the spark.shuffle.file.buffer setting, even to its default > value, shuffles fail. > This appears to affect small to medium size partitions. Strangely the error > message is OutOfMemoryError, but it works with large partitions (at least > >32MB). > {code} > pyspark --conf "spark.shuffle.file.buffer=$((32*1024))" > /gpfs/bbp.cscs.ch/scratch/gss/spykfunc/_sparkenv/lib/python2.7/site-packages/pyspark/bin/spark-submit > pyspark-shell-main --name PySparkShell --conf spark.shuffle.file.buffer=32768 > version 2.2.1 > >>> spark.range(1e7, numPartitions=10).sort("id").write.parquet("a", > >>> mode="overwrite") > [Stage 1:>(0 + 10) / > 10]18/01/10 19:34:21 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 11) > java.lang.OutOfMemoryError: Java heap space > at java.io.BufferedOutputStream.(BufferedOutputStream.java:75) > at > org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.(DiskBlockObjectWriter.scala:107) > at > org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:108) > at > org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116) > at > org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Tovbin updated SPARK-23152: --- Description: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier() .setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. It contains a single element with no columns in it. Therefore the condition has to be modified to verify that. {code:java} maxLabelRow.size <= 1 {code} was: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier() .setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. It contains a single element with no columns in it. Therefore the condition has to be modified to verify that. {code:java} maxLabelRow.size <= 1 {code} > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is > thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier() > .setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} >
[jira] [Created] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
Matthew Tovbin created SPARK-23152: -- Summary: Invalid guard condition in org.apache.spark.ml.classification.Classifier Key: SPARK-23152 URL: https://issues.apache.org/jira/browse/SPARK-23152 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 2.1.2, 2.1.1, 2.1.0, 2.0.2, 2.0.1, 2.0.0, 2.1.3, 2.3.0, 2.3.1 Reporter: Matthew Tovbin When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier() .setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. It contains a single element with no columns in it. Therefore the condition has to be modified to verify that. {code:java} maxLabelRow.size <= 1 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20928) SPIP: Continuous Processing Mode for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331045#comment-16331045 ] Thomas Graves commented on SPARK-20928: --- what is status of this, it looks like subtasks are finished? > SPIP: Continuous Processing Mode for Structured Streaming > - > > Key: SPARK-20928 > URL: https://issues.apache.org/jira/browse/SPARK-20928 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Michael Armbrust >Priority: Major > Labels: SPIP > Attachments: Continuous Processing in Structured Streaming Design > Sketch.pdf > > > Given the current Source API, the minimum possible latency for any record is > bounded by the amount of time that it takes to launch a task. This > limitation is a result of the fact that {{getBatch}} requires us to know both > the starting and the ending offset, before any tasks are launched. In the > worst case, the end-to-end latency is actually closer to the average batch > time + task launching time. > For applications where latency is more important than exactly-once output > however, it would be useful if processing could happen continuously. This > would allow us to achieve fully pipelined reading and writing from sources > such as Kafka. This kind of architecture would make it possible to process > records with end-to-end latencies on the order of 1 ms, rather than the > 10-100ms that is possible today. > One possible architecture here would be to change the Source API to look like > the following rough sketch: > {code} > trait Epoch { > def data: DataFrame > /** The exclusive starting position for `data`. */ > def startOffset: Offset > /** The inclusive ending position for `data`. Incrementally updated > during processing, but not complete until execution of the query plan in > `data` is finished. */ > def endOffset: Offset > } > def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], > limits: Limits): Epoch > {code} > The above would allow us to build an alternative implementation of > {{StreamExecution}} that processes continuously with much lower latency and > only stops processing when needing to reconfigure the stream (either due to a > failure or a user requested change in parallelism. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Tovbin updated SPARK-23152: --- Description: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier() .setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. It contains a single element with no columns in it. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.size <= 1) { throw new SparkException("ML algorithm was given empty dataset.") } {code} was: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier() .setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. It contains a single element with no columns in it. Therefore the condition has to be modified to verify that. {code:java} maxLabelRow.size <= 1 {code} > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is > thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier() > .setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Tovbin updated SPARK-23152: --- Description: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. It contains a single element with no columns in it. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.size <= 1) { throw new SparkException("ML algorithm was given empty dataset.") } {code} was: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. It contains a single element with no columns in it. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.size <= 1) { throw new SparkException("ML algorithm was given empty dataset.") } {code} > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is > thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) > at > org.apache.spark.ml.
[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Tovbin updated SPARK-23152: --- Description: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. It contains a single element with no columns in it. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.size <= 1) { throw new SparkException("ML algorithm was given empty dataset.") } {code} was: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier() .setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. It contains a single element with no columns in it. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.size <= 1) { throw new SparkException("ML algorithm was given empty dataset.") } {code} > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is > thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) > at > org.apache.spark.
[jira] [Commented] (SPARK-23133) Spark options are not passed to the Executor in Docker context
[ https://issues.apache.org/jira/browse/SPARK-23133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331074#comment-16331074 ] Anirudh Ramanathan commented on SPARK-23133: Thanks for submitting the PR fixing this. > Spark options are not passed to the Executor in Docker context > -- > > Key: SPARK-23133 > URL: https://issues.apache.org/jira/browse/SPARK-23133 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 > Environment: Running Spark on K8s using supplied Docker image. >Reporter: Andrew Korzhuev >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > Reproduce: > # Build image with `bin/docker-image-tool.sh`. > # Submit application to k8s. Set executor options, e.g. ` --conf > "spark.executor. > extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"` > # Visit Spark UI on executor and notice that option is not set. > Expected behavior: options from spark-submit should be correctly passed to > executor. > Cause: > `SPARK_EXECUTOR_JAVA_OPTS` is not defined in `entrypoint.sh` > https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L70 > [https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L44-L45] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23082) Allow separate node selectors for driver and executors in Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-23082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331081#comment-16331081 ] Anirudh Ramanathan commented on SPARK-23082: This is an interesting feature request. We had some discussion about this. I think it's a bit too late for a feature request in 2.3, so, we can revisit this in the 2.4 timeframe. > Allow separate node selectors for driver and executors in Kubernetes > > > Key: SPARK-23082 > URL: https://issues.apache.org/jira/browse/SPARK-23082 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Submit >Affects Versions: 2.2.0, 2.3.0 >Reporter: Oz Ben-Ami >Priority: Minor > > In YARN, we can use {{spark.yarn.am.nodeLabelExpression}} to submit the Spark > driver to a different set of nodes from its executors. In Kubernetes, we can > specify {{spark.kubernetes.node.selector.[labelKey]}}, but we can't use > separate options for the driver and executors. > This would be useful for the particular use case where executors can go on > more ephemeral nodes (eg, with cluster autoscaling, or preemptible/spot > instances), but the driver should use a more persistent machine. > The required change would be minimal, essentially just using different config > keys for the > [driver|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90] > and > [executor|https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73] > instead of {{KUBERNETES_NODE_SELECTOR_PREFIX}} for both. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23082) Allow separate node selectors for driver and executors in Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-23082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331081#comment-16331081 ] Anirudh Ramanathan edited comment on SPARK-23082 at 1/18/18 7:52 PM: - This is an interesting feature request. We had some discussion about this. I think it's a bit too late for a feature request in 2.3, so, we can revisit this in the 2.4 timeframe. [~mcheah] was (Author: foxish): This is an interesting feature request. We had some discussion about this. I think it's a bit too late for a feature request in 2.3, so, we can revisit this in the 2.4 timeframe. > Allow separate node selectors for driver and executors in Kubernetes > > > Key: SPARK-23082 > URL: https://issues.apache.org/jira/browse/SPARK-23082 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Submit >Affects Versions: 2.2.0, 2.3.0 >Reporter: Oz Ben-Ami >Priority: Minor > > In YARN, we can use {{spark.yarn.am.nodeLabelExpression}} to submit the Spark > driver to a different set of nodes from its executors. In Kubernetes, we can > specify {{spark.kubernetes.node.selector.[labelKey]}}, but we can't use > separate options for the driver and executors. > This would be useful for the particular use case where executors can go on > more ephemeral nodes (eg, with cluster autoscaling, or preemptible/spot > instances), but the driver should use a more persistent machine. > The required change would be minimal, essentially just using different config > keys for the > [driver|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90] > and > [executor|https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73] > instead of {{KUBERNETES_NODE_SELECTOR_PREFIX}} for both. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23016) Spark UI access and documentation
[ https://issues.apache.org/jira/browse/SPARK-23016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anirudh Ramanathan resolved SPARK-23016. Resolution: Fixed This is resolved and we've verified it for 2.3.0. > Spark UI access and documentation > - > > Key: SPARK-23016 > URL: https://issues.apache.org/jira/browse/SPARK-23016 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Priority: Minor > > We should have instructions to access the spark driver UI, or instruct users > to create a service to expose it. > Also might need an integration test to verify that the driver UI works as > expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331085#comment-16331085 ] Anirudh Ramanathan commented on SPARK-22962: This is the resource staging server use-case. We'll upstream this in the 2.4.0 timeframe. > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Major > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anirudh Ramanathan updated SPARK-22962: --- Affects Version/s: (was: 2.3.0) 2.4.0 > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Major > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23146) Support client mode for Kubernetes cluster backend
[ https://issues.apache.org/jira/browse/SPARK-23146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anirudh Ramanathan updated SPARK-23146: --- Affects Version/s: (was: 2.4.0) 2.3.0 > Support client mode for Kubernetes cluster backend > -- > > Key: SPARK-23146 > URL: https://issues.apache.org/jira/browse/SPARK-23146 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Priority: Major > > This issue tracks client mode support within Spark when running in the > Kubernetes cluster backend. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anirudh Ramanathan updated SPARK-22962: --- Affects Version/s: (was: 2.4.0) 2.3.0 > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Major > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23146) Support client mode for Kubernetes cluster backend
[ https://issues.apache.org/jira/browse/SPARK-23146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anirudh Ramanathan updated SPARK-23146: --- Target Version/s: 2.4.0 > Support client mode for Kubernetes cluster backend > -- > > Key: SPARK-23146 > URL: https://issues.apache.org/jira/browse/SPARK-23146 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Priority: Major > > This issue tracks client mode support within Spark when running in the > Kubernetes cluster backend. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Tovbin updated SPARK-23152: --- Description: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. Instead it contains a single null element WrappedArray(null). Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.isEmpty || (maxLabelRow.size == 1 && maxLabelRow(0) == null)) { throw new SparkException("ML algorithm was given empty dataset.") } {code} was: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. It contains a single element with no columns in it. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.size <= 1) { throw new SparkException("ML algorithm was given empty dataset.") } {code} > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is > thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(De
[jira] [Commented] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331132#comment-16331132 ] Yinan Li commented on SPARK-22962: -- I agree that before we upstream the staging server, we should fail the submission if a user uses local resources. [~vanzin], if it's not too late to get into 2.3, I'm gonna file a PR for this. > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Major > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331140#comment-16331140 ] Marcelo Vanzin commented on SPARK-22962: Yes send a PR. > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Major > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20928) SPIP: Continuous Processing Mode for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331141#comment-16331141 ] Michael Armbrust commented on SPARK-20928: -- There is more work to do so I might leave the umbrella open, but we are going to have an initial version that supports reading and writing from/to Kafka with very low latency in Spark 2.3! Stay tuned for updates to docs and blog posts. > SPIP: Continuous Processing Mode for Structured Streaming > - > > Key: SPARK-20928 > URL: https://issues.apache.org/jira/browse/SPARK-20928 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Michael Armbrust >Priority: Major > Labels: SPIP > Attachments: Continuous Processing in Structured Streaming Design > Sketch.pdf > > > Given the current Source API, the minimum possible latency for any record is > bounded by the amount of time that it takes to launch a task. This > limitation is a result of the fact that {{getBatch}} requires us to know both > the starting and the ending offset, before any tasks are launched. In the > worst case, the end-to-end latency is actually closer to the average batch > time + task launching time. > For applications where latency is more important than exactly-once output > however, it would be useful if processing could happen continuously. This > would allow us to achieve fully pipelined reading and writing from sources > such as Kafka. This kind of architecture would make it possible to process > records with end-to-end latencies on the order of 1 ms, rather than the > 10-100ms that is possible today. > One possible architecture here would be to change the Source API to look like > the following rough sketch: > {code} > trait Epoch { > def data: DataFrame > /** The exclusive starting position for `data`. */ > def startOffset: Offset > /** The inclusive ending position for `data`. Incrementally updated > during processing, but not complete until execution of the query plan in > `data` is finished. */ > def endOffset: Offset > } > def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], > limits: Limits): Epoch > {code} > The above would allow us to build an alternative implementation of > {{StreamExecution}} that processes continuously with much lower latency and > only stops processing when needing to reconfigure the stream (either due to a > failure or a user requested change in parallelism. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Tovbin updated SPARK-23152: --- Description: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. Instead it contains a single null element WrappedArray(null). Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.size <= 1) { throw new SparkException("ML algorithm was given empty dataset.") } {code} was: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. Instead it contains a single null element WrappedArray(null). Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.isEmpty || (maxLabelRow.size == 1 && maxLabelRow(0) == null)) { throw new SparkException("ML algorithm was given empty dataset.") } {code} > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is > thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifie
[jira] [Resolved] (SPARK-23143) Add Python support for continuous trigger
[ https://issues.apache.org/jira/browse/SPARK-23143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-23143. --- Resolution: Fixed Fix Version/s: 2.3.0 3.0.0 Issue resolved by pull request 20309 [https://github.com/apache/spark/pull/20309] > Add Python support for continuous trigger > - > > Key: SPARK-23143 > URL: https://issues.apache.org/jira/browse/SPARK-23143 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > Fix For: 3.0.0, 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331153#comment-16331153 ] Anirudh Ramanathan commented on SPARK-22962: I think this isn't super critical for this release, mostly a usability thing. If it's small enough, it makes sense, but if it introduces risk and we have to redo manual testing, I'd vote against getting this into 2.3. > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Major > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Tovbin updated SPARK-23152: --- Description: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. Instead it contains a single null element WrappedArray(null). Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { throw new SparkException("ML algorithm was given empty dataset.") } {code} was: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. Instead it contains a single null element WrappedArray(null). Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.size <= 1) { throw new SparkException("ML algorithm was given empty dataset.") } {code} > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is > thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTree
[jira] [Commented] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering
[ https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331157#comment-16331157 ] Apache Spark commented on SPARK-22884: -- User 'smurakozi' has created a pull request for this issue: https://github.com/apache/spark/pull/20319 > ML test for StructuredStreaming: spark.ml.clustering > > > Key: SPARK-22884 > URL: https://issues.apache.org/jira/browse/SPARK-22884 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Priority: Major > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering
[ https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22884: Assignee: Apache Spark > ML test for StructuredStreaming: spark.ml.clustering > > > Key: SPARK-22884 > URL: https://issues.apache.org/jira/browse/SPARK-22884 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Major > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering
[ https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22884: Assignee: (was: Apache Spark) > ML test for StructuredStreaming: spark.ml.clustering > > > Key: SPARK-22884 > URL: https://issues.apache.org/jira/browse/SPARK-22884 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Priority: Major > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Tovbin updated SPARK-23152: --- Description: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the result "maxLabelRow" array is not. Instead it contains a single Row(null) element. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { throw new SparkException("ML algorithm was given empty dataset.") } {code} was: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. Instead it contains a single null element WrappedArray(null). Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { throw new SparkException("ML algorithm was given empty dataset.") } {code} > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is > thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClass
[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Tovbin updated SPARK-23152: --- Description: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a misleading NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the result "maxLabelRow" array is not. Instead it contains a single Row(null) element. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { throw new SparkException("ML algorithm was given empty dataset.") } {code} was: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the result "maxLabelRow" array is not. Instead it contains a single Row(null) element. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { throw new SparkException("ML algorithm was given empty dataset.") } {code} > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a misleading > NullPointerException is thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.De
[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Tovbin updated SPARK-23152: --- Description: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a misleading NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in function getNumClasses at org.apache.spark.ml.classification.Classifier:106 {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the result "maxLabelRow" array is not. Instead it contains a single Row(null) element. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { throw new SparkException("ML algorithm was given empty dataset.") } {code} was: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a misleading NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the result "maxLabelRow" array is not. Instead it contains a single Row(null) element. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { throw new SparkException("ML algorithm was given empty dataset.") } {code} > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a misleading > NullPointerException is thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.
[jira] [Resolved] (SPARK-23144) Add console sink for continuous queries
[ https://issues.apache.org/jira/browse/SPARK-23144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-23144. --- Resolution: Fixed Fix Version/s: 2.3.0 3.0.0 Issue resolved by pull request 20311 [https://github.com/apache/spark/pull/20311] > Add console sink for continuous queries > --- > > Key: SPARK-23144 > URL: https://issues.apache.org/jira/browse/SPARK-23144 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > Fix For: 3.0.0, 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331185#comment-16331185 ] Apache Spark commented on SPARK-22962: -- User 'liyinan926' has created a pull request for this issue: https://github.com/apache/spark/pull/20320 > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Major > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22962: Assignee: Apache Spark > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Major > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22962: Assignee: (was: Apache Spark) > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Major > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331194#comment-16331194 ] Apache Spark commented on SPARK-23152: -- User 'tovbinm' has created a pull request for this issue: https://github.com/apache/spark/pull/20321 > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a misleading > NullPointerException is thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} > > The problem happens due to an incorrect guard condition in function > getNumClasses at org.apache.spark.ml.classification.Classifier:106 > {code:java} > val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) > if (maxLabelRow.isEmpty) { > throw new SparkException("ML algorithm was given empty dataset.") > } > {code} > When the input data is empty the result "maxLabelRow" array is not. Instead > it contains a single Row(null) element. > > Proposed solution: the condition can be modified to verify that. > {code:java} > if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { > throw new SparkException("ML algorithm was given empty dataset.") > } > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23152: Assignee: (was: Apache Spark) > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a misleading > NullPointerException is thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} > > The problem happens due to an incorrect guard condition in function > getNumClasses at org.apache.spark.ml.classification.Classifier:106 > {code:java} > val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) > if (maxLabelRow.isEmpty) { > throw new SparkException("ML algorithm was given empty dataset.") > } > {code} > When the input data is empty the result "maxLabelRow" array is not. Instead > it contains a single Row(null) element. > > Proposed solution: the condition can be modified to verify that. > {code:java} > if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { > throw new SparkException("ML algorithm was given empty dataset.") > } > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23152: Assignee: Apache Spark > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Assignee: Apache Spark >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a misleading > NullPointerException is thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} > > The problem happens due to an incorrect guard condition in function > getNumClasses at org.apache.spark.ml.classification.Classifier:106 > {code:java} > val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) > if (maxLabelRow.isEmpty) { > throw new SparkException("ML algorithm was given empty dataset.") > } > {code} > When the input data is empty the result "maxLabelRow" array is not. Instead > it contains a single Row(null) element. > > Proposed solution: the condition can be modified to verify that. > {code:java} > if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { > throw new SparkException("ML algorithm was given empty dataset.") > } > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
[ https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-19185: -- Assignee: (was: Marcelo Vanzin) > ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing > - > > Key: SPARK-19185 > URL: https://issues.apache.org/jira/browse/SPARK-19185 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: Spark 2.0.2 > Spark Streaming Kafka 010 > Mesos 0.28.0 - client mode > spark.executor.cores 1 > spark.mesos.extra.cores 1 >Reporter: Kalvin Chau >Priority: Major > Labels: streaming, windowing > > We've been running into ConcurrentModificationExcpetions "KafkaConsumer is > not safe for multi-threaded access" with the CachedKafkaConsumer. I've been > working through debugging this issue and after looking through some of the > spark source code I think this is a bug. > Our set up is: > Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using > Spark-Streaming-Kafka-010 > spark.executor.cores 1 > spark.mesos.extra.cores 1 > Batch interval: 10s, window interval: 180s, and slide interval: 30s > We would see the exception when in one executor there are two task worker > threads assigned the same Topic+Partition, but a different set of offsets. > They would both get the same CachedKafkaConsumer, and whichever task thread > went first would seek and poll for all the records, and at the same time the > second thread would try to seek to its offset but fail because it is unable > to acquire the lock. > Time0 E0 Task0 - TopicPartition("abc", 0) X to Y > Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z > Time1 E0 Task0 - Seeks and starts to poll > Time1 E0 Task1 - Attempts to seek, but fails > Here are some relevant logs: > {code} > 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394204414 -> 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394238058 -> 4394257712 > 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394204414 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: > Initial fetch for spark-executor-consumer test-topic 2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Seeking to test-topic-2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting > block rdd_199_2 failed due to an exception > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block > rdd_199_2 could not be removed as it was not found on disk or in memory > 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in > task 49.0 in stage 45.0 (TID 3201) > java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala
[jira] [Assigned] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
[ https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-19185: -- Assignee: Marcelo Vanzin > ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing > - > > Key: SPARK-19185 > URL: https://issues.apache.org/jira/browse/SPARK-19185 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: Spark 2.0.2 > Spark Streaming Kafka 010 > Mesos 0.28.0 - client mode > spark.executor.cores 1 > spark.mesos.extra.cores 1 >Reporter: Kalvin Chau >Assignee: Marcelo Vanzin >Priority: Major > Labels: streaming, windowing > > We've been running into ConcurrentModificationExcpetions "KafkaConsumer is > not safe for multi-threaded access" with the CachedKafkaConsumer. I've been > working through debugging this issue and after looking through some of the > spark source code I think this is a bug. > Our set up is: > Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using > Spark-Streaming-Kafka-010 > spark.executor.cores 1 > spark.mesos.extra.cores 1 > Batch interval: 10s, window interval: 180s, and slide interval: 30s > We would see the exception when in one executor there are two task worker > threads assigned the same Topic+Partition, but a different set of offsets. > They would both get the same CachedKafkaConsumer, and whichever task thread > went first would seek and poll for all the records, and at the same time the > second thread would try to seek to its offset but fail because it is unable > to acquire the lock. > Time0 E0 Task0 - TopicPartition("abc", 0) X to Y > Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z > Time1 E0 Task0 - Seeks and starts to poll > Time1 E0 Task1 - Attempts to seek, but fails > Here are some relevant logs: > {code} > 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394204414 -> 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394238058 -> 4394257712 > 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394204414 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: > Initial fetch for spark-executor-consumer test-topic 2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Seeking to test-topic-2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting > block rdd_199_2 failed due to an exception > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block > rdd_199_2 could not be removed as it was not found on disk or in memory > 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in > task 49.0 in stage 45.0 (TID 3201) > java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) > at org.apache.spark.rdd
[jira] [Commented] (SPARK-23133) Spark options are not passed to the Executor in Docker context
[ https://issues.apache.org/jira/browse/SPARK-23133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331241#comment-16331241 ] Apache Spark commented on SPARK-23133: -- User 'foxish' has created a pull request for this issue: https://github.com/apache/spark/pull/20322 > Spark options are not passed to the Executor in Docker context > -- > > Key: SPARK-23133 > URL: https://issues.apache.org/jira/browse/SPARK-23133 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 > Environment: Running Spark on K8s using supplied Docker image. >Reporter: Andrew Korzhuev >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > Reproduce: > # Build image with `bin/docker-image-tool.sh`. > # Submit application to k8s. Set executor options, e.g. ` --conf > "spark.executor. > extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"` > # Visit Spark UI on executor and notice that option is not set. > Expected behavior: options from spark-submit should be correctly passed to > executor. > Cause: > `SPARK_EXECUTOR_JAVA_OPTS` is not defined in `entrypoint.sh` > https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L70 > [https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L44-L45] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23153) Support application dependencies in submission client's local file system
Yinan Li created SPARK-23153: Summary: Support application dependencies in submission client's local file system Key: SPARK-23153 URL: https://issues.apache.org/jira/browse/SPARK-23153 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.4.0 Reporter: Yinan Li -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23133) Spark options are not passed to the Executor in Docker context
[ https://issues.apache.org/jira/browse/SPARK-23133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23133. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20322 [https://github.com/apache/spark/pull/20322] > Spark options are not passed to the Executor in Docker context > -- > > Key: SPARK-23133 > URL: https://issues.apache.org/jira/browse/SPARK-23133 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 > Environment: Running Spark on K8s using supplied Docker image. >Reporter: Andrew Korzhuev >Priority: Minor > Fix For: 2.3.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > Reproduce: > # Build image with `bin/docker-image-tool.sh`. > # Submit application to k8s. Set executor options, e.g. ` --conf > "spark.executor. > extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"` > # Visit Spark UI on executor and notice that option is not set. > Expected behavior: options from spark-submit should be correctly passed to > executor. > Cause: > `SPARK_EXECUTOR_JAVA_OPTS` is not defined in `entrypoint.sh` > https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L70 > [https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L44-L45] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23133) Spark options are not passed to the Executor in Docker context
[ https://issues.apache.org/jira/browse/SPARK-23133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23133: -- Assignee: Andrew Korzhuev > Spark options are not passed to the Executor in Docker context > -- > > Key: SPARK-23133 > URL: https://issues.apache.org/jira/browse/SPARK-23133 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 > Environment: Running Spark on K8s using supplied Docker image. >Reporter: Andrew Korzhuev >Assignee: Andrew Korzhuev >Priority: Minor > Fix For: 2.3.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > Reproduce: > # Build image with `bin/docker-image-tool.sh`. > # Submit application to k8s. Set executor options, e.g. ` --conf > "spark.executor. > extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"` > # Visit Spark UI on executor and notice that option is not set. > Expected behavior: options from spark-submit should be correctly passed to > executor. > Cause: > `SPARK_EXECUTOR_JAVA_OPTS` is not defined in `entrypoint.sh` > https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L70 > [https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L44-L45] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22216) Improving PySpark/Pandas interoperability
[ https://issues.apache.org/jira/browse/SPARK-22216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-22216: --- Description: This is an umbrella ticket tracking the general effort to improve performance and interoperability between PySpark and Pandas. The core idea is to Apache Arrow as serialization format to reduce the overhead between PySpark and Pandas. (was: This is an umbrella ticket tracking the general effect of improving performance and interoperability between PySpark and Pandas. The core idea is to Apache Arrow as serialization format to reduce the overhead between PySpark and Pandas.) > Improving PySpark/Pandas interoperability > - > > Key: SPARK-22216 > URL: https://issues.apache.org/jira/browse/SPARK-22216 > Project: Spark > Issue Type: Epic > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Li Jin >Assignee: Li Jin >Priority: Major > > This is an umbrella ticket tracking the general effort to improve performance > and interoperability between PySpark and Pandas. The core idea is to Apache > Arrow as serialization format to reduce the overhead between PySpark and > Pandas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23094) Json Readers choose wrong encoding when bad records are present and fail
[ https://issues.apache.org/jira/browse/SPARK-23094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-23094: Assignee: Burak Yavuz > Json Readers choose wrong encoding when bad records are present and fail > > > Key: SPARK-23094 > URL: https://issues.apache.org/jira/browse/SPARK-23094 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 2.3.0 > > > The cases described in SPARK-16548 and SPARK-20549 handled the JsonParser > code paths for expressions but not the readers. We should also cover reader > code paths reading files with bad characters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23094) Json Readers choose wrong encoding when bad records are present and fail
[ https://issues.apache.org/jira/browse/SPARK-23094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23094. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20302 [https://github.com/apache/spark/pull/20302] > Json Readers choose wrong encoding when bad records are present and fail > > > Key: SPARK-23094 > URL: https://issues.apache.org/jira/browse/SPARK-23094 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 2.3.0 > > > The cases described in SPARK-16548 and SPARK-20549 handled the JsonParser > code paths for expressions but not the readers. We should also cover reader > code paths reading files with bad characters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-22962. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20320 [https://github.com/apache/spark/pull/20320] > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Yinan Li >Priority: Major > Fix For: 2.3.0 > > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-22962: -- Assignee: Yinan Li > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Yinan Li >Priority: Major > Fix For: 2.3.0 > > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23142) Add documentation for Continuous Processing
[ https://issues.apache.org/jira/browse/SPARK-23142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-23142. --- Resolution: Fixed Fix Version/s: 2.3.0 3.0.0 Issue resolved by pull request 20308 [https://github.com/apache/spark/pull/20308] > Add documentation for Continuous Processing > --- > > Key: SPARK-23142 > URL: https://issues.apache.org/jira/browse/SPARK-23142 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > Fix For: 3.0.0, 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23091) Incorrect unit test for approxQuantile
[ https://issues.apache.org/jira/browse/SPARK-23091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23091: Assignee: (was: Apache Spark) > Incorrect unit test for approxQuantile > -- > > Key: SPARK-23091 > URL: https://issues.apache.org/jira/browse/SPARK-23091 > Project: Spark > Issue Type: Improvement > Components: ML, Tests >Affects Versions: 2.2.1 >Reporter: Kuang Chen >Priority: Minor > > Currently, test for `approxQuantile` (quantile estimation algorithm) checks > whether estimated quantile is within +- 2*`relativeError` from the true > quantile. See the code below: > [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala#L157] > However, based on the original paper by Greenwald and Khanna, the estimated > quantile is guaranteed to be within +- `relativeError` from the true > quantile. Using the double "tolerance" is misleading and incorrect, and we > should fix it. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23091) Incorrect unit test for approxQuantile
[ https://issues.apache.org/jira/browse/SPARK-23091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23091: Assignee: Apache Spark > Incorrect unit test for approxQuantile > -- > > Key: SPARK-23091 > URL: https://issues.apache.org/jira/browse/SPARK-23091 > Project: Spark > Issue Type: Improvement > Components: ML, Tests >Affects Versions: 2.2.1 >Reporter: Kuang Chen >Assignee: Apache Spark >Priority: Minor > > Currently, test for `approxQuantile` (quantile estimation algorithm) checks > whether estimated quantile is within +- 2*`relativeError` from the true > quantile. See the code below: > [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala#L157] > However, based on the original paper by Greenwald and Khanna, the estimated > quantile is guaranteed to be within +- `relativeError` from the true > quantile. Using the double "tolerance" is misleading and incorrect, and we > should fix it. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org