[jira] [Commented] (SPARK-22974) CountVectorModel does not attach attributes to output column

2018-01-18 Thread Xiayun Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330235#comment-16330235
 ] 

Xiayun Sun commented on SPARK-22974:


do we define the expected behavior here as Interaction of the occurrences? 

`CountVectorizer` returns a `SparseVector`, for example, 
{{(3,[0,1,2],[1.0,1.0,1.0])}}

I think it's possible to attach numerical attribute to  [1.0, 1.0, 1.0] then 
interaction will work. 

> CountVectorModel does not attach attributes to output column
> 
>
> Key: SPARK-22974
> URL: https://issues.apache.org/jira/browse/SPARK-22974
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: William Zhang
>Priority: Major
>
> If CountVectorModel transforms columns, the output column will not have 
> attributes attached to them. If later on, those columns are used in 
> Interaction transformer, an exception will be thrown:
> {quote}"org.apache.spark.SparkException: Vector attributes must be defined 
> for interaction."
> {quote}
> To reproduce it:
> {quote}import org.apache.spark.ml.feature._
> import org.apache.spark.sql.functions._
> val df = spark.createDataFrame(Seq(
>   (0, Array("a", "b", "c"), Array("1", "2")),
>   (1, Array("a", "b", "b", "c", "a", "d"),  Array("1", "2", "3"))
> )).toDF("id", "words", "nums")
> val cvModel: CountVectorizerModel = new CountVectorizer()
>   .setInputCol("nums")
>   .setOutputCol("features2")
>   .setVocabSize(4)
>   .setMinDF(0)
>   .fit(df)
> ]val cvm = new CountVectorizerModel(Array("a", "b", "c"))
>   .setInputCol("words")
>   .setOutputCol("features1")
>   
> val df1 = cvm.transform(df)
> val df2 = cvModel.transform(df1)
> val interaction = new Interaction().setInputCols(Array("features1", 
> "features2")).setOutputCol("features")
> val df3  = interaction.transform(df2)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23144) Add console sink for continuous queries

2018-01-18 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-23144:
--
Issue Type: Improvement  (was: Bug)

> Add console sink for continuous queries
> ---
>
> Key: SPARK-23144
> URL: https://issues.apache.org/jira/browse/SPARK-23144
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23130) Spark Thrift does not clean-up temporary files (/tmp/*_resources and /tmp/hive/*.pipeout)

2018-01-18 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330269#comment-16330269
 ] 

Marco Gaido commented on SPARK-23130:
-

[~seano] there is no JIRA for the pipeout issue and there cannot be, since it 
is a problem in Hive codebase, not in the Spark one, so there is no fix which 
can be provided in Spark. SPARK-20202 tracks not relying anymore on Spark's 
fork of hive and using proper Hive releases instead. This will solve the issue 
since the fix needed by the pipeout issue is included in newer Hive versions.

> Spark Thrift does not clean-up temporary files (/tmp/*_resources and 
> /tmp/hive/*.pipeout)
> -
>
> Key: SPARK-23130
> URL: https://issues.apache.org/jira/browse/SPARK-23130
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0, 2.2.0
> Environment: * Hadoop distributions: HDP 2.5 - 2.6.3.0
>  * OS: Seen on SLES12, RHEL 7.3 & RHEL 7.4
>Reporter: Sean Roberts
>Priority: Major
>  Labels: thrift
>
> Spark Thrift is not cleaning up /tmp for files & directories named like:
>  /tmp/hive/*.pipeout
>  /tmp/*_resources
> There are such a large number that /tmp quickly runs out of inodes *causing 
> the partition to be unusable and many services to crash*. This is even true 
> when the only jobs submitted are routine service checks.
> Used `strace` to show that Spark Thrift is responsible:
> {code:java}
> strace.out.118864:04:53:49 
> open("/tmp/hive/55ad7fc1-f79a-4ad8-8e02-26bbeaa86bbc7288010135864174970.pipeout",
>  O_RDWR|O_CREAT|O_EXCL, 0666) = 134
> strace.out.118864:04:53:49 
> mkdir("/tmp/b6dfbf9e-2f7c-4c25-95a1-73c44318ecf4_resources", 0777) = 0
> {code}
> *Those files were left behind, even days later.*
> 
> Example files:
> {code:java}
> # stat 
> /tmp/hive/55ad7fc1-f79a-4ad8-8e02-26bbeaa86bbc7288010135864174970.pipeout
>   File: 
> ‘/tmp/hive/55ad7fc1-f79a-4ad8-8e02-26bbeaa86bbc7288010135864174970.pipeout’
>   Size: 0 Blocks: 0  IO Block: 4096   regular empty file
> Device: fe09h/65033d  Inode: 678 Links: 1
> Access: (0644/-rw-r--r--)  Uid: ( 1000/hive)   Gid: ( 1002/  hadoop)
> Access: 2017-12-19 04:53:49.126777260 -0600
> Modify: 2017-12-19 04:53:49.126777260 -0600
> Change: 2017-12-19 04:53:49.126777260 -0600
>  Birth: -
> # stat /tmp/b6dfbf9e-2f7c-4c25-95a1-73c44318ecf4_resources
>   File: ‘/tmp/b6dfbf9e-2f7c-4c25-95a1-73c44318ecf4_resources’
>   Size: 4096  Blocks: 8  IO Block: 4096   directory
> Device: fe09h/65033d  Inode: 668 Links: 2
> Access: (0700/drwx--)  Uid: ( 1000/hive)   Gid: ( 1002/  hadoop)
> Access: 2017-12-19 04:57:38.458937635 -0600
> Modify: 2017-12-19 04:53:49.062777216 -0600
> Change: 2017-12-19 04:53:49.066777218 -0600
>  Birth: -
> {code}
> Showing the large number:
> {code:java}
> # find /tmp/ -name '*_resources' | wc -l
> 68340
> # find /tmp/hive -name "*.pipeout" | wc -l
> 51837
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23144) Add console sink for continuous queries

2018-01-18 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-23144:
-

 Summary: Add console sink for continuous queries
 Key: SPARK-23144
 URL: https://issues.apache.org/jira/browse/SPARK-23144
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Tathagata Das
Assignee: Tathagata Das






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23144) Add console sink for continuous queries

2018-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23144:


Assignee: Apache Spark  (was: Tathagata Das)

> Add console sink for continuous queries
> ---
>
> Key: SPARK-23144
> URL: https://issues.apache.org/jira/browse/SPARK-23144
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23144) Add console sink for continuous queries

2018-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330276#comment-16330276
 ] 

Apache Spark commented on SPARK-23144:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/20311

> Add console sink for continuous queries
> ---
>
> Key: SPARK-23144
> URL: https://issues.apache.org/jira/browse/SPARK-23144
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23144) Add console sink for continuous queries

2018-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23144:


Assignee: Tathagata Das  (was: Apache Spark)

> Add console sink for continuous queries
> ---
>
> Key: SPARK-23144
> URL: https://issues.apache.org/jira/browse/SPARK-23144
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22974) CountVectorModel does not attach attributes to output column

2018-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22974:


Assignee: Apache Spark

> CountVectorModel does not attach attributes to output column
> 
>
> Key: SPARK-22974
> URL: https://issues.apache.org/jira/browse/SPARK-22974
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: William Zhang
>Assignee: Apache Spark
>Priority: Major
>
> If CountVectorModel transforms columns, the output column will not have 
> attributes attached to them. If later on, those columns are used in 
> Interaction transformer, an exception will be thrown:
> {quote}"org.apache.spark.SparkException: Vector attributes must be defined 
> for interaction."
> {quote}
> To reproduce it:
> {quote}import org.apache.spark.ml.feature._
> import org.apache.spark.sql.functions._
> val df = spark.createDataFrame(Seq(
>   (0, Array("a", "b", "c"), Array("1", "2")),
>   (1, Array("a", "b", "b", "c", "a", "d"),  Array("1", "2", "3"))
> )).toDF("id", "words", "nums")
> val cvModel: CountVectorizerModel = new CountVectorizer()
>   .setInputCol("nums")
>   .setOutputCol("features2")
>   .setVocabSize(4)
>   .setMinDF(0)
>   .fit(df)
> ]val cvm = new CountVectorizerModel(Array("a", "b", "c"))
>   .setInputCol("words")
>   .setOutputCol("features1")
>   
> val df1 = cvm.transform(df)
> val df2 = cvModel.transform(df1)
> val interaction = new Interaction().setInputCols(Array("features1", 
> "features2")).setOutputCol("features")
> val df3  = interaction.transform(df2)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22974) CountVectorModel does not attach attributes to output column

2018-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22974:


Assignee: (was: Apache Spark)

> CountVectorModel does not attach attributes to output column
> 
>
> Key: SPARK-22974
> URL: https://issues.apache.org/jira/browse/SPARK-22974
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: William Zhang
>Priority: Major
>
> If CountVectorModel transforms columns, the output column will not have 
> attributes attached to them. If later on, those columns are used in 
> Interaction transformer, an exception will be thrown:
> {quote}"org.apache.spark.SparkException: Vector attributes must be defined 
> for interaction."
> {quote}
> To reproduce it:
> {quote}import org.apache.spark.ml.feature._
> import org.apache.spark.sql.functions._
> val df = spark.createDataFrame(Seq(
>   (0, Array("a", "b", "c"), Array("1", "2")),
>   (1, Array("a", "b", "b", "c", "a", "d"),  Array("1", "2", "3"))
> )).toDF("id", "words", "nums")
> val cvModel: CountVectorizerModel = new CountVectorizer()
>   .setInputCol("nums")
>   .setOutputCol("features2")
>   .setVocabSize(4)
>   .setMinDF(0)
>   .fit(df)
> ]val cvm = new CountVectorizerModel(Array("a", "b", "c"))
>   .setInputCol("words")
>   .setOutputCol("features1")
>   
> val df1 = cvm.transform(df)
> val df2 = cvModel.transform(df1)
> val interaction = new Interaction().setInputCols(Array("features1", 
> "features2")).setOutputCol("features")
> val df3  = interaction.transform(df2)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22974) CountVectorModel does not attach attributes to output column

2018-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330292#comment-16330292
 ] 

Apache Spark commented on SPARK-22974:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/20313

> CountVectorModel does not attach attributes to output column
> 
>
> Key: SPARK-22974
> URL: https://issues.apache.org/jira/browse/SPARK-22974
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: William Zhang
>Priority: Major
>
> If CountVectorModel transforms columns, the output column will not have 
> attributes attached to them. If later on, those columns are used in 
> Interaction transformer, an exception will be thrown:
> {quote}"org.apache.spark.SparkException: Vector attributes must be defined 
> for interaction."
> {quote}
> To reproduce it:
> {quote}import org.apache.spark.ml.feature._
> import org.apache.spark.sql.functions._
> val df = spark.createDataFrame(Seq(
>   (0, Array("a", "b", "c"), Array("1", "2")),
>   (1, Array("a", "b", "b", "c", "a", "d"),  Array("1", "2", "3"))
> )).toDF("id", "words", "nums")
> val cvModel: CountVectorizerModel = new CountVectorizer()
>   .setInputCol("nums")
>   .setOutputCol("features2")
>   .setVocabSize(4)
>   .setMinDF(0)
>   .fit(df)
> ]val cvm = new CountVectorizerModel(Array("a", "b", "c"))
>   .setInputCol("words")
>   .setOutputCol("features1")
>   
> val df1 = cvm.transform(df)
> val df2 = cvModel.transform(df1)
> val interaction = new Interaction().setInputCols(Array("features1", 
> "features2")).setOutputCol("features")
> val df3  = interaction.transform(df2)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23016) Spark UI access and documentation

2018-01-18 Thread Eric Charles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330309#comment-16330309
 ] 

Eric Charles commented on SPARK-23016:
--

Side note : the Spark UI can not be browsed via proxy (kubectl proxy and browse 
eg [1]) as the Spark UI redirects to [http://localhost:8001/jobs]

[1] 
http://localhost:8001/api/v1/namespaces/default/services/http:spitfire-spitfire-spark-ui:4040/proxy

> Spark UI access and documentation
> -
>
> Key: SPARK-23016
> URL: https://issues.apache.org/jira/browse/SPARK-23016
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> We should have instructions to access the spark driver UI, or instruct users 
> to create a service to expose it.
> Also might need an integration test to verify that the driver UI works as 
> expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23145) How to set Apache Spark Executor memory

2018-01-18 Thread Azharuddin (JIRA)
Azharuddin created SPARK-23145:
--

 Summary: How to set Apache Spark Executor memory
 Key: SPARK-23145
 URL: https://issues.apache.org/jira/browse/SPARK-23145
 Project: Spark
  Issue Type: Question
  Components: EC2
Affects Versions: 2.1.0
Reporter: Azharuddin
 Fix For: 2.1.1


How can I increase the memory available for Apache spark executor nodes?

I have a 2 GB file that is suitable to loading in to [Apache 
Spark|[https://mindmajix.com|https://mindmajix.com/apache-spark-training]]. I 
am running apache spark for the moment on 1 machine, so the driver and executor 
are on the same machine. The machine has 8 GB of memory.

When I try count the lines of the file after setting the file to be cached in 
memory I get these errors:

{{2014-10-25 22:25:12 WARN CacheManager:71 - Not enough space to cache 
partition rdd_1_1 in memory! Free memory is 278099801 bytes. }}

I looked at the documentation 
[here|http://spark.apache.org/docs/latest/configuration.html] and set 
{{spark.executor.memory}} to {{4g}} in {{$SPARK_HOME/conf/spark-defaults.conf}}

The UI shows this variable is set in the Spark Environment. You can find 
screenshot 
[here|https://drive.google.com/file/d/0B0B_O5bxDDlsc3JmY0xfbjFtN0k/view?usp=sharing]

However when I go to the [Executor 
tab|https://drive.google.com/file/d/0B0B_O5bxDDlsV25ka2lHd1ZFNzA/view?usp=sharing]
 the memory limit for my single Executor is still set to 265.4 MB. I also still 
get the same error.

I tried various things mentioned 
[here|https://stackoverflow.com/questions/24242060/how-to-change-memory-per-node-for-apache-spark-worker]
 but I still get the error and don't have a clear idea where I should change 
the setting.

I am running my code interactively from the spark-shell



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23145) How to set Apache Spark Executor memory

2018-01-18 Thread Azharuddin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Azharuddin updated SPARK-23145:
---
Description: 
How can I increase the memory available for Apache spark executor nodes?

I have a 2 GB file that is suitable to loading in to [Apache Spark| 
[https://mindmajix.com|https://mindmajix.com/apache-spark-training]]. I am 
running apache spark for the moment on 1 machine, so the driver and executor 
are on the same machine. The machine has 8 GB of memory.

When I try count the lines of the file after setting the file to be cached in 
memory I get these errors:

{\{2014-10-25 22:25:12 WARN CacheManager:71 - Not enough space to cache 
partition rdd_1_1 in memory! Free memory is 278099801 bytes. }}

I looked at the documentation 
[here|http://spark.apache.org/docs/latest/configuration.html] and set 
{{spark.executor.memory}} to {{4g}} in {{$SPARK_HOME/conf/spark-defaults.conf}}

The UI shows this variable is set in the Spark Environment. You can find 
screenshot 
[here|https://drive.google.com/file/d/0B0B_O5bxDDlsc3JmY0xfbjFtN0k/view?usp=sharing]

However when I go to the [Executor 
tab|https://drive.google.com/file/d/0B0B_O5bxDDlsV25ka2lHd1ZFNzA/view?usp=sharing]
 the memory limit for my single Executor is still set to 265.4 MB. I also still 
get the same error.

I tried various things mentioned 
[here|https://stackoverflow.com/questions/24242060/how-to-change-memory-per-node-for-apache-spark-worker]
 but I still get the error and don't have a clear idea where I should change 
the setting.

I am running my code interactively from the spark-shell

  was:
How can I increase the memory available for Apache spark executor nodes?

I have a 2 GB file that is suitable to loading in to [Apache 
Spark|[https://mindmajix.com|https://mindmajix.com/apache-spark-training]]. I 
am running apache spark for the moment on 1 machine, so the driver and executor 
are on the same machine. The machine has 8 GB of memory.

When I try count the lines of the file after setting the file to be cached in 
memory I get these errors:

{{2014-10-25 22:25:12 WARN CacheManager:71 - Not enough space to cache 
partition rdd_1_1 in memory! Free memory is 278099801 bytes. }}

I looked at the documentation 
[here|http://spark.apache.org/docs/latest/configuration.html] and set 
{{spark.executor.memory}} to {{4g}} in {{$SPARK_HOME/conf/spark-defaults.conf}}

The UI shows this variable is set in the Spark Environment. You can find 
screenshot 
[here|https://drive.google.com/file/d/0B0B_O5bxDDlsc3JmY0xfbjFtN0k/view?usp=sharing]

However when I go to the [Executor 
tab|https://drive.google.com/file/d/0B0B_O5bxDDlsV25ka2lHd1ZFNzA/view?usp=sharing]
 the memory limit for my single Executor is still set to 265.4 MB. I also still 
get the same error.

I tried various things mentioned 
[here|https://stackoverflow.com/questions/24242060/how-to-change-memory-per-node-for-apache-spark-worker]
 but I still get the error and don't have a clear idea where I should change 
the setting.

I am running my code interactively from the spark-shell


> How to set Apache Spark Executor memory
> ---
>
> Key: SPARK-23145
> URL: https://issues.apache.org/jira/browse/SPARK-23145
> Project: Spark
>  Issue Type: Question
>  Components: EC2
>Affects Versions: 2.1.0
>Reporter: Azharuddin
>Priority: Major
> Fix For: 2.1.1
>
>   Original Estimate: 954h
>  Remaining Estimate: 954h
>
> How can I increase the memory available for Apache spark executor nodes?
> I have a 2 GB file that is suitable to loading in to [Apache Spark| 
> [https://mindmajix.com|https://mindmajix.com/apache-spark-training]]. I am 
> running apache spark for the moment on 1 machine, so the driver and executor 
> are on the same machine. The machine has 8 GB of memory.
> When I try count the lines of the file after setting the file to be cached in 
> memory I get these errors:
> {\{2014-10-25 22:25:12 WARN CacheManager:71 - Not enough space to cache 
> partition rdd_1_1 in memory! Free memory is 278099801 bytes. }}
> I looked at the documentation 
> [here|http://spark.apache.org/docs/latest/configuration.html] and set 
> {{spark.executor.memory}} to {{4g}} in 
> {{$SPARK_HOME/conf/spark-defaults.conf}}
> The UI shows this variable is set in the Spark Environment. You can find 
> screenshot 
> [here|https://drive.google.com/file/d/0B0B_O5bxDDlsc3JmY0xfbjFtN0k/view?usp=sharing]
> However when I go to the [Executor 
> tab|https://drive.google.com/file/d/0B0B_O5bxDDlsV25ka2lHd1ZFNzA/view?usp=sharing]
>  the memory limit for my single Executor is still set to 265.4 MB. I also 
> still get the same error.
> I tried various things mentioned 
> [here|https://stackoverflow.com/questions/24242060/how-to-change-memory-per-node-for-apache-spark-worker]
>  but I still get the error and do

[jira] [Updated] (SPARK-23145) How to set Apache Spark Executor memory

2018-01-18 Thread Azharuddin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Azharuddin updated SPARK-23145:
---
Description: 
How can I increase the memory available for Apache spark executor nodes?

I have a 2 GB file that is suitable to loading in to [Apache 
Spark|https://mindmajix.com/apache-spark-training]. I am running apache spark 
for the moment on 1 machine, so the driver and executor are on the same 
machine. The machine has 8 GB of memory.

When I try count the lines of the file after setting the file to be cached in 
memory I get these errors:

{\{2014-10-25 22:25:12 WARN CacheManager:71 - Not enough space to cache 
partition rdd_1_1 in memory! Free memory is 278099801 bytes. }}

I looked at the documentation 
[here|http://spark.apache.org/docs/latest/configuration.html] and set 
{{spark.executor.memory}} to {{4g}} in {{$SPARK_HOME/conf/spark-defaults.conf}}

The UI shows this variable is set in the Spark Environment. You can find 
screenshot 
[here|https://drive.google.com/file/d/0B0B_O5bxDDlsc3JmY0xfbjFtN0k/view?usp=sharing]

However when I go to the [Executor 
tab|https://drive.google.com/file/d/0B0B_O5bxDDlsV25ka2lHd1ZFNzA/view?usp=sharing]
 the memory limit for my single Executor is still set to 265.4 MB. I also still 
get the same error.

I tried various things mentioned 
[here|https://stackoverflow.com/questions/24242060/how-to-change-memory-per-node-for-apache-spark-worker]
 but I still get the error and don't have a clear idea where I should change 
the setting.

I am running my code interactively from the spark-shell

  was:
How can I increase the memory available for Apache spark executor nodes?

I have a 2 GB file that is suitable to loading in to [Apache Spark| 
[https://mindmajix.com|https://mindmajix.com/apache-spark-training]]. I am 
running apache spark for the moment on 1 machine, so the driver and executor 
are on the same machine. The machine has 8 GB of memory.

When I try count the lines of the file after setting the file to be cached in 
memory I get these errors:

{\{2014-10-25 22:25:12 WARN CacheManager:71 - Not enough space to cache 
partition rdd_1_1 in memory! Free memory is 278099801 bytes. }}

I looked at the documentation 
[here|http://spark.apache.org/docs/latest/configuration.html] and set 
{{spark.executor.memory}} to {{4g}} in {{$SPARK_HOME/conf/spark-defaults.conf}}

The UI shows this variable is set in the Spark Environment. You can find 
screenshot 
[here|https://drive.google.com/file/d/0B0B_O5bxDDlsc3JmY0xfbjFtN0k/view?usp=sharing]

However when I go to the [Executor 
tab|https://drive.google.com/file/d/0B0B_O5bxDDlsV25ka2lHd1ZFNzA/view?usp=sharing]
 the memory limit for my single Executor is still set to 265.4 MB. I also still 
get the same error.

I tried various things mentioned 
[here|https://stackoverflow.com/questions/24242060/how-to-change-memory-per-node-for-apache-spark-worker]
 but I still get the error and don't have a clear idea where I should change 
the setting.

I am running my code interactively from the spark-shell


> How to set Apache Spark Executor memory
> ---
>
> Key: SPARK-23145
> URL: https://issues.apache.org/jira/browse/SPARK-23145
> Project: Spark
>  Issue Type: Question
>  Components: EC2
>Affects Versions: 2.1.0
>Reporter: Azharuddin
>Priority: Major
> Fix For: 2.1.1
>
>   Original Estimate: 954h
>  Remaining Estimate: 954h
>
> How can I increase the memory available for Apache spark executor nodes?
> I have a 2 GB file that is suitable to loading in to [Apache 
> Spark|https://mindmajix.com/apache-spark-training]. I am running apache spark 
> for the moment on 1 machine, so the driver and executor are on the same 
> machine. The machine has 8 GB of memory.
> When I try count the lines of the file after setting the file to be cached in 
> memory I get these errors:
> {\{2014-10-25 22:25:12 WARN CacheManager:71 - Not enough space to cache 
> partition rdd_1_1 in memory! Free memory is 278099801 bytes. }}
> I looked at the documentation 
> [here|http://spark.apache.org/docs/latest/configuration.html] and set 
> {{spark.executor.memory}} to {{4g}} in 
> {{$SPARK_HOME/conf/spark-defaults.conf}}
> The UI shows this variable is set in the Spark Environment. You can find 
> screenshot 
> [here|https://drive.google.com/file/d/0B0B_O5bxDDlsc3JmY0xfbjFtN0k/view?usp=sharing]
> However when I go to the [Executor 
> tab|https://drive.google.com/file/d/0B0B_O5bxDDlsV25ka2lHd1ZFNzA/view?usp=sharing]
>  the memory limit for my single Executor is still set to 265.4 MB. I also 
> still get the same error.
> I tried various things mentioned 
> [here|https://stackoverflow.com/questions/24242060/how-to-change-memory-per-node-for-apache-spark-worker]
>  but I still get the error and don't have a clear idea where I should change 
> t

[jira] [Resolved] (SPARK-23145) How to set Apache Spark Executor memory

2018-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23145.
---
   Resolution: Invalid
Fix Version/s: (was: 2.1.1)

Please post these to the mailing list or StackOverflow

> How to set Apache Spark Executor memory
> ---
>
> Key: SPARK-23145
> URL: https://issues.apache.org/jira/browse/SPARK-23145
> Project: Spark
>  Issue Type: Question
>  Components: EC2
>Affects Versions: 2.1.0
>Reporter: Azharuddin
>Priority: Major
>   Original Estimate: 954h
>  Remaining Estimate: 954h
>
> How can I increase the memory available for Apache spark executor nodes?
> I have a 2 GB file that is suitable to loading in to [Apache 
> Spark|https://mindmajix.com/apache-spark-training]. I am running apache spark 
> for the moment on 1 machine, so the driver and executor are on the same 
> machine. The machine has 8 GB of memory.
> When I try count the lines of the file after setting the file to be cached in 
> memory I get these errors:
> {\{2014-10-25 22:25:12 WARN CacheManager:71 - Not enough space to cache 
> partition rdd_1_1 in memory! Free memory is 278099801 bytes. }}
> I looked at the documentation 
> [here|http://spark.apache.org/docs/latest/configuration.html] and set 
> {{spark.executor.memory}} to {{4g}} in 
> {{$SPARK_HOME/conf/spark-defaults.conf}}
> The UI shows this variable is set in the Spark Environment. You can find 
> screenshot 
> [here|https://drive.google.com/file/d/0B0B_O5bxDDlsc3JmY0xfbjFtN0k/view?usp=sharing]
> However when I go to the [Executor 
> tab|https://drive.google.com/file/d/0B0B_O5bxDDlsV25ka2lHd1ZFNzA/view?usp=sharing]
>  the memory limit for my single Executor is still set to 265.4 MB. I also 
> still get the same error.
> I tried various things mentioned 
> [here|https://stackoverflow.com/questions/24242060/how-to-change-memory-per-node-for-apache-spark-worker]
>  but I still get the error and don't have a clear idea where I should change 
> the setting.
> I am running my code interactively from the spark-shell



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23104) Document that kubernetes is still "experimental"

2018-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330381#comment-16330381
 ] 

Apache Spark commented on SPARK-23104:
--

User 'foxish' has created a pull request for this issue:
https://github.com/apache/spark/pull/20314

> Document that kubernetes is still "experimental"
> 
>
> Key: SPARK-23104
> URL: https://issues.apache.org/jira/browse/SPARK-23104
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Anirudh Ramanathan
>Priority: Critical
>
> As discussed in the mailing list, we should document that the kubernetes 
> backend is still experimental.
> That does not need to include any code changes. This is just meant to tell 
> users that they can expect changes in how the backend behaves in future 
> versions, and that things like configuration, the container image's layout 
> and entry points might change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23104) Document that kubernetes is still "experimental"

2018-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23104:


Assignee: Anirudh Ramanathan  (was: Apache Spark)

> Document that kubernetes is still "experimental"
> 
>
> Key: SPARK-23104
> URL: https://issues.apache.org/jira/browse/SPARK-23104
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Anirudh Ramanathan
>Priority: Critical
>
> As discussed in the mailing list, we should document that the kubernetes 
> backend is still experimental.
> That does not need to include any code changes. This is just meant to tell 
> users that they can expect changes in how the backend behaves in future 
> versions, and that things like configuration, the container image's layout 
> and entry points might change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23104) Document that kubernetes is still "experimental"

2018-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23104:


Assignee: Apache Spark  (was: Anirudh Ramanathan)

> Document that kubernetes is still "experimental"
> 
>
> Key: SPARK-23104
> URL: https://issues.apache.org/jira/browse/SPARK-23104
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Critical
>
> As discussed in the mailing list, we should document that the kubernetes 
> backend is still experimental.
> That does not need to include any code changes. This is just meant to tell 
> users that they can expect changes in how the backend behaves in future 
> versions, and that things like configuration, the container image's layout 
> and entry points might change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23146) Support client mode for Kubernetes cluster backend

2018-01-18 Thread Anirudh Ramanathan (JIRA)
Anirudh Ramanathan created SPARK-23146:
--

 Summary: Support client mode for Kubernetes cluster backend
 Key: SPARK-23146
 URL: https://issues.apache.org/jira/browse/SPARK-23146
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Anirudh Ramanathan


This issue tracks client mode support within Spark when running in the 
Kubernetes cluster backend.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23146) Support client mode for Kubernetes cluster backend

2018-01-18 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-23146:
---
Issue Type: New Feature  (was: Improvement)

> Support client mode for Kubernetes cluster backend
> --
>
> Key: SPARK-23146
> URL: https://issues.apache.org/jira/browse/SPARK-23146
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> This issue tracks client mode support within Spark when running in the 
> Kubernetes cluster backend.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23016) Spark UI access and documentation

2018-01-18 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330389#comment-16330389
 ] 

Anirudh Ramanathan commented on SPARK-23016:


Good point - yeah, we don't recommend the API server proxy in the docs anymore 
- 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/running-on-kubernetes.html.

> Spark UI access and documentation
> -
>
> Key: SPARK-23016
> URL: https://issues.apache.org/jira/browse/SPARK-23016
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> We should have instructions to access the spark driver UI, or instruct users 
> to create a service to expose it.
> Also might need an integration test to verify that the driver UI works as 
> expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23140) DataSourceV2Strategy is missing in HiveSessionStateBuilder

2018-01-18 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23140:
---

Assignee: Saisai Shao

> DataSourceV2Strategy is missing in HiveSessionStateBuilder
> --
>
> Key: SPARK-23140
> URL: https://issues.apache.org/jira/browse/SPARK-23140
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
> Fix For: 2.3.0
>
>
> DataSourceV2Strategy is not added into HiveSessionStateBuilder's planner, 
> which will lead to exception when playing continuous query:
> {noformat}
> ERROR ContinuousExecution: Query abc [id = 
> 5cb6404a-e907-4662-b5d7-20037ccd6947, runId = 
> 617b8dea-018e-4082-935e-98d98d473fdd] terminated with error
> java.lang.AssertionError: assertion failed: No plan for WriteToDataSourceV2 
> org.apache.spark.sql.execution.streaming.sources.ContinuousMemoryWriter@3dba6d7c
> +- StreamingDataSourceV2Relation [timestamp#15, value#16L], 
> org.apache.spark.sql.execution.streaming.continuous.RateStreamContinuousReader@62ceac53
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$2.apply(ContinuousExecution.scala:221)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$2.apply(ContinuousExecution.scala:212)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runContinuous(ContinuousExecution.scala:212)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runActivatedStream(ContinuousExecution.scala:94)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23140) DataSourceV2Strategy is missing in HiveSessionStateBuilder

2018-01-18 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23140.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20305
[https://github.com/apache/spark/pull/20305]

> DataSourceV2Strategy is missing in HiveSessionStateBuilder
> --
>
> Key: SPARK-23140
> URL: https://issues.apache.org/jira/browse/SPARK-23140
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
> Fix For: 2.3.0
>
>
> DataSourceV2Strategy is not added into HiveSessionStateBuilder's planner, 
> which will lead to exception when playing continuous query:
> {noformat}
> ERROR ContinuousExecution: Query abc [id = 
> 5cb6404a-e907-4662-b5d7-20037ccd6947, runId = 
> 617b8dea-018e-4082-935e-98d98d473fdd] terminated with error
> java.lang.AssertionError: assertion failed: No plan for WriteToDataSourceV2 
> org.apache.spark.sql.execution.streaming.sources.ContinuousMemoryWriter@3dba6d7c
> +- StreamingDataSourceV2Relation [timestamp#15, value#16L], 
> org.apache.spark.sql.execution.streaming.continuous.RateStreamContinuousReader@62ceac53
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$2.apply(ContinuousExecution.scala:221)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$2.apply(ContinuousExecution.scala:212)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runContinuous(ContinuousExecution.scala:212)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runActivatedStream(ContinuousExecution.scala:94)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23147) Stage page will throw exception when there's no complete tasks

2018-01-18 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-23147:

Attachment: Screen Shot 2018-01-18 at 8.50.08 PM.png

> Stage page will throw exception when there's no complete tasks
> --
>
> Key: SPARK-23147
> URL: https://issues.apache.org/jira/browse/SPARK-23147
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
> Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png
>
>
> Current Stage page's task table will throw an exception when there's no 
> complete tasks, to reproduce this issue, user could submit code like:
> {code}
> sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect()
> {code}
> Then open the UI and click into stage details.
> Deep dive into the code, found that current UI can only show the completed 
> tasks, it is different from 2.2 code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23147) Stage page will throw exception when there's no complete tasks

2018-01-18 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-23147:
---

 Summary: Stage page will throw exception when there's no complete 
tasks
 Key: SPARK-23147
 URL: https://issues.apache.org/jira/browse/SPARK-23147
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.0
Reporter: Saisai Shao
 Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png

Current Stage page's task table will throw an exception when there's no 
complete tasks, to reproduce this issue, user could submit code like:

{code}
sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect()
{code}

Then open the UI and click into stage details.

Deep dive into the code, found that current UI can only show the completed 
tasks, it is different from 2.2 code.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23147) Stage page will throw exception when there's no complete tasks

2018-01-18 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-23147:

Description: 
Current Stage page's task table will throw an exception when there's no 
complete tasks, to reproduce this issue, user could submit code like:

{code}
sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect()
{code}

Then open the UI and click into stage details. Below is the screenshot.

 !Screen Shot 2018-01-18 at 8.50.08 PM.png! 

Deep dive into the code, found that current UI can only show the completed 
tasks, it is different from 2.2 code.


  was:
Current Stage page's task table will throw an exception when there's no 
complete tasks, to reproduce this issue, user could submit code like:

{code}
sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect()
{code}

Then open the UI and click into stage details.

Deep dive into the code, found that current UI can only show the completed 
tasks, it is different from 2.2 code.



> Stage page will throw exception when there's no complete tasks
> --
>
> Key: SPARK-23147
> URL: https://issues.apache.org/jira/browse/SPARK-23147
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
> Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png
>
>
> Current Stage page's task table will throw an exception when there's no 
> complete tasks, to reproduce this issue, user could submit code like:
> {code}
> sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect()
> {code}
> Then open the UI and click into stage details. Below is the screenshot.
>  !Screen Shot 2018-01-18 at 8.50.08 PM.png! 
> Deep dive into the code, found that current UI can only show the completed 
> tasks, it is different from 2.2 code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces

2018-01-18 Thread Bogdan Raducanu (JIRA)
Bogdan Raducanu created SPARK-23148:
---

 Summary: spark.read.csv with multiline=true gives 
FileNotFoundException if path contains spaces
 Key: SPARK-23148
 URL: https://issues.apache.org/jira/browse/SPARK-23148
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Bogdan Raducanu


Repro code:
{code:java}
spark.range(10).write.csv("/tmp/a b c/a.csv")
spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count
10
spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count
java.io.FileNotFoundException: File 
file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv
 does not exist
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23147) Stage page will throw exception when there's no complete tasks

2018-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23147:


Assignee: (was: Apache Spark)

> Stage page will throw exception when there's no complete tasks
> --
>
> Key: SPARK-23147
> URL: https://issues.apache.org/jira/browse/SPARK-23147
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
> Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png
>
>
> Current Stage page's task table will throw an exception when there's no 
> complete tasks, to reproduce this issue, user could submit code like:
> {code}
> sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect()
> {code}
> Then open the UI and click into stage details. Below is the screenshot.
>  !Screen Shot 2018-01-18 at 8.50.08 PM.png! 
> Deep dive into the code, found that current UI can only show the completed 
> tasks, it is different from 2.2 code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23147) Stage page will throw exception when there's no complete tasks

2018-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23147:


Assignee: Apache Spark

> Stage page will throw exception when there's no complete tasks
> --
>
> Key: SPARK-23147
> URL: https://issues.apache.org/jira/browse/SPARK-23147
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Major
> Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png
>
>
> Current Stage page's task table will throw an exception when there's no 
> complete tasks, to reproduce this issue, user could submit code like:
> {code}
> sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect()
> {code}
> Then open the UI and click into stage details. Below is the screenshot.
>  !Screen Shot 2018-01-18 at 8.50.08 PM.png! 
> Deep dive into the code, found that current UI can only show the completed 
> tasks, it is different from 2.2 code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23147) Stage page will throw exception when there's no complete tasks

2018-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330496#comment-16330496
 ] 

Apache Spark commented on SPARK-23147:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/20315

> Stage page will throw exception when there's no complete tasks
> --
>
> Key: SPARK-23147
> URL: https://issues.apache.org/jira/browse/SPARK-23147
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
> Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png
>
>
> Current Stage page's task table will throw an exception when there's no 
> complete tasks, to reproduce this issue, user could submit code like:
> {code}
> sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect()
> {code}
> Then open the UI and click into stage details. Below is the screenshot.
>  !Screen Shot 2018-01-18 at 8.50.08 PM.png! 
> Deep dive into the code, found that current UI can only show the completed 
> tasks, it is different from 2.2 code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22036) BigDecimal multiplication sometimes returns null

2018-01-18 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22036:
---

Assignee: Marco Gaido

> BigDecimal multiplication sometimes returns null
> 
>
> Key: SPARK-22036
> URL: https://issues.apache.org/jira/browse/SPARK-22036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Olivier Blanvillain
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.0
>
>
> The multiplication of two BigDecimal numbers sometimes returns null. Here is 
> a minimal reproduction:
> {code:java}
> object Main extends App {
>   import org.apache.spark.{SparkConf, SparkContext}
>   import org.apache.spark.sql.SparkSession
>   import spark.implicits._
>   val conf = new 
> SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", 
> "false")
>   val spark = 
> SparkSession.builder().config(conf).appName("REPL").getOrCreate()
>   implicit val sqlContext = spark.sqlContext
>   case class X2(a: BigDecimal, b: BigDecimal)
>   val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), 
> BigDecimal(-1000.1
>   val result = ds.select(ds("a") * ds("b")).collect.head
>   println(result) // [null]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22036) BigDecimal multiplication sometimes returns null

2018-01-18 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22036.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20023
[https://github.com/apache/spark/pull/20023]

> BigDecimal multiplication sometimes returns null
> 
>
> Key: SPARK-22036
> URL: https://issues.apache.org/jira/browse/SPARK-22036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Olivier Blanvillain
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.0
>
>
> The multiplication of two BigDecimal numbers sometimes returns null. Here is 
> a minimal reproduction:
> {code:java}
> object Main extends App {
>   import org.apache.spark.{SparkConf, SparkContext}
>   import org.apache.spark.sql.SparkSession
>   import spark.implicits._
>   val conf = new 
> SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", 
> "false")
>   val spark = 
> SparkSession.builder().config(conf).appName("REPL").getOrCreate()
>   implicit val sqlContext = spark.sqlContext
>   case class X2(a: BigDecimal, b: BigDecimal)
>   val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), 
> BigDecimal(-1000.1
>   val result = ds.select(ds("a") * ds("b")).collect.head
>   println(result) // [null]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23141) Support data type string as a returnType for registerJavaFunction.

2018-01-18 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23141:


Assignee: Takuya Ueshin

> Support data type string as a returnType for registerJavaFunction.
> --
>
> Key: SPARK-23141
> URL: https://issues.apache.org/jira/browse/SPARK-23141
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently {{UDFRegistration.registerJavaFunction}} doesn't support data type 
> string as a returnType whereas {{UDFRegistration.register}}, {{@udf}}, or 
> {{@pandas_udf}} does.
>  We can support it for {{UDFRegistration.registerJavaFunction}} as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23141) Support data type string as a returnType for registerJavaFunction.

2018-01-18 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23141.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20307
[https://github.com/apache/spark/pull/20307]

> Support data type string as a returnType for registerJavaFunction.
> --
>
> Key: SPARK-23141
> URL: https://issues.apache.org/jira/browse/SPARK-23141
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently {{UDFRegistration.registerJavaFunction}} doesn't support data type 
> string as a returnType whereas {{UDFRegistration.register}}, {{@udf}}, or 
> {{@pandas_udf}} does.
>  We can support it for {{UDFRegistration.registerJavaFunction}} as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19280) Failed Recovery from checkpoint caused by the multi-threads issue in Spark Streaming scheduler

2018-01-18 Thread Riccardo Vincelli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330558#comment-16330558
 ] 

Riccardo Vincelli commented on SPARK-19280:
---

What is the exception you are getting? For me it is:

This RDD lacks a SparkContext. It could happen in the following cases: 
(1) RDD transformations and actions are NOT invoked by the driver, but inside 
of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) 
is invalid because the values transformation and count action cannot be 
performed inside of the rdd1.map transformation. For more information, see 
SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be 
hit if a reference to an RDD not defined by the streaming job is used in 
DStream operations. For more information, See SPARK-13758.

Not sure yet if 100% related.

Thanks,

> Failed Recovery from checkpoint caused by the multi-threads issue in Spark 
> Streaming scheduler
> --
>
> Key: SPARK-19280
> URL: https://issues.apache.org/jira/browse/SPARK-19280
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nan Zhu
>Priority: Critical
>
> In one of our applications, we found the following issue, the application 
> recovering from a checkpoint file named "checkpoint-***16670" but with 
> the timestamp ***16650 will recover from the very beginning of the stream 
> and because our application relies on the external & periodically-cleaned 
> data (syncing with checkpoint cleanup), the recovery just failed
> We identified a potential issue in Spark Streaming checkpoint and will 
> describe it with the following example. We will propose a fix in the end of 
> this JIRA.
> 1. The application properties: Batch Duration: 2, Functionality: Single 
> Stream calling ReduceByKeyAndWindow and print, Window Size: 6, 
> SlideDuration, 2
> 2. RDD at 16650 is generated and the corresponding job is submitted to 
> the execution ThreadPool. Meanwhile, a DoCheckpoint message is sent to the 
> queue of JobGenerator
> 3. Job at 16650 is finished and JobCompleted message is sent to 
> JobScheduler's queue, and meanwhile, Job at 16652 is submitted to the 
> execution ThreadPool and similarly, a DoCheckpoint is sent to the queue of 
> JobGenerator
> 4. JobScheduler's message processing thread (I will use JS-EventLoop to 
> identify it) is not scheduled by the operating system for a long time, and 
> during this period, Jobs generated from 16652 - 16670 are generated 
> and completed.
> 5. the message processing thread of JobGenerator (JG-EventLoop) is scheduled 
> and processed all DoCheckpoint messages for jobs ranging from 16652 - 
> 16670 and checkpoint files are successfully written. CRITICAL: at this 
> moment, the lastCheckpointTime would be 16670.
> 6. JS-EventLoop is scheduled, and process all JobCompleted messages for jobs 
> ranging from 16652 - 16670. CRITICAL: a ClearMetadata message is 
> pushed to JobGenerator's message queue for EACH JobCompleted.
> 7. The current message queue contains 20 ClearMetadata messages and 
> JG-EventLoop is scheduled to process them. CRITICAL: ClearMetadata will 
> remove all RDDs out of rememberDuration window. In our case, 
> ReduceyKeyAndWindow will set rememberDuration to 10 (rememberDuration of 
> ReducedWindowDStream (4) + windowSize) resulting that only RDDs <- 
> (16660, 16670] are kept. And ClearMetaData processing logic will push 
> a DoCheckpoint to JobGenerator's thread
> 8. JG-EventLoop is scheduled again to process DoCheckpoint for 1665, VERY 
> CRITICAL: at this step, RDD no later than 16660 has been removed, and 
> checkpoint data is updated as  
> https://github.com/apache/spark/blob/a81e336f1eddc2c6245d807aae2c81ddc60eabf9/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStreamCheckpointData.scala#L53
>  and 
> https://github.com/apache/spark/blob/a81e336f1eddc2c6245d807aae2c81ddc60eabf9/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStreamCheckpointData.scala#L59.
> 9. After 8, a checkpoint named /path/checkpoint-16670 is created but with 
> the timestamp 16650. and at this moment, Application crashed
> 10. Application recovers from /path/checkpoint-16670 and try to get RDD 
> with validTime 16650. Of course it will not find it and has to recompute. 
> In the case of ReduceByKeyAndWindow, it needs to recursively slice RDDs until 
> to the start of the stream. When the stream depends on the external data, it 
> will not successfully recover. In the case of Kafka, the recovered RDDs would 
> not be the same as the original one, as the currentOffsets has been u

[jira] [Comment Edited] (SPARK-19280) Failed Recovery from checkpoint caused by the multi-threads issue in Spark Streaming scheduler

2018-01-18 Thread Riccardo Vincelli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330558#comment-16330558
 ] 

Riccardo Vincelli edited comment on SPARK-19280 at 1/18/18 2:20 PM:


What is the exception you are getting? For me it is:

This RDD lacks a SparkContext. It could happen in the following cases: 
 (1) RDD transformations and actions are NOT invoked by the driver, but inside 
of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) 
is invalid because the values transformation and count action cannot be 
performed inside of the rdd1.map transformation. For more information, see 
SPARK-5063.
 (2) When a Spark Streaming job recovers from checkpoint, this exception will 
be hit if a reference to an RDD not defined by the streaming job is used in 
DStream operations. For more information, See SPARK-13758.

Not sure yet if 100% related but our program:

-reads Kafka topics

-uses mapWithState on an initialState

 

Thanks,


was (Author: rvincelli):
What is the exception you are getting? For me it is:

This RDD lacks a SparkContext. It could happen in the following cases: 
(1) RDD transformations and actions are NOT invoked by the driver, but inside 
of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) 
is invalid because the values transformation and count action cannot be 
performed inside of the rdd1.map transformation. For more information, see 
SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be 
hit if a reference to an RDD not defined by the streaming job is used in 
DStream operations. For more information, See SPARK-13758.

Not sure yet if 100% related.

Thanks,

> Failed Recovery from checkpoint caused by the multi-threads issue in Spark 
> Streaming scheduler
> --
>
> Key: SPARK-19280
> URL: https://issues.apache.org/jira/browse/SPARK-19280
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nan Zhu
>Priority: Critical
>
> In one of our applications, we found the following issue, the application 
> recovering from a checkpoint file named "checkpoint-***16670" but with 
> the timestamp ***16650 will recover from the very beginning of the stream 
> and because our application relies on the external & periodically-cleaned 
> data (syncing with checkpoint cleanup), the recovery just failed
> We identified a potential issue in Spark Streaming checkpoint and will 
> describe it with the following example. We will propose a fix in the end of 
> this JIRA.
> 1. The application properties: Batch Duration: 2, Functionality: Single 
> Stream calling ReduceByKeyAndWindow and print, Window Size: 6, 
> SlideDuration, 2
> 2. RDD at 16650 is generated and the corresponding job is submitted to 
> the execution ThreadPool. Meanwhile, a DoCheckpoint message is sent to the 
> queue of JobGenerator
> 3. Job at 16650 is finished and JobCompleted message is sent to 
> JobScheduler's queue, and meanwhile, Job at 16652 is submitted to the 
> execution ThreadPool and similarly, a DoCheckpoint is sent to the queue of 
> JobGenerator
> 4. JobScheduler's message processing thread (I will use JS-EventLoop to 
> identify it) is not scheduled by the operating system for a long time, and 
> during this period, Jobs generated from 16652 - 16670 are generated 
> and completed.
> 5. the message processing thread of JobGenerator (JG-EventLoop) is scheduled 
> and processed all DoCheckpoint messages for jobs ranging from 16652 - 
> 16670 and checkpoint files are successfully written. CRITICAL: at this 
> moment, the lastCheckpointTime would be 16670.
> 6. JS-EventLoop is scheduled, and process all JobCompleted messages for jobs 
> ranging from 16652 - 16670. CRITICAL: a ClearMetadata message is 
> pushed to JobGenerator's message queue for EACH JobCompleted.
> 7. The current message queue contains 20 ClearMetadata messages and 
> JG-EventLoop is scheduled to process them. CRITICAL: ClearMetadata will 
> remove all RDDs out of rememberDuration window. In our case, 
> ReduceyKeyAndWindow will set rememberDuration to 10 (rememberDuration of 
> ReducedWindowDStream (4) + windowSize) resulting that only RDDs <- 
> (16660, 16670] are kept. And ClearMetaData processing logic will push 
> a DoCheckpoint to JobGenerator's thread
> 8. JG-EventLoop is scheduled again to process DoCheckpoint for 1665, VERY 
> CRITICAL: at this step, RDD no later than 16660 has been removed, and 
> checkpoint data is updated as  
> https://github.com/apache/spark/blob/a81e336f1eddc2c6245d807aae2c81ddc60eabf9/streaming/src/main/scala/org/apache/sp

[jira] [Created] (SPARK-23149) polish ColumnarBatch

2018-01-18 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-23149:
---

 Summary: polish ColumnarBatch
 Key: SPARK-23149
 URL: https://issues.apache.org/jira/browse/SPARK-23149
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23149) polish ColumnarBatch

2018-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330610#comment-16330610
 ] 

Apache Spark commented on SPARK-23149:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/20316

> polish ColumnarBatch
> 
>
> Key: SPARK-23149
> URL: https://issues.apache.org/jira/browse/SPARK-23149
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23149) polish ColumnarBatch

2018-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23149:


Assignee: Apache Spark  (was: Wenchen Fan)

> polish ColumnarBatch
> 
>
> Key: SPARK-23149
> URL: https://issues.apache.org/jira/browse/SPARK-23149
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23149) polish ColumnarBatch

2018-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23149:


Assignee: Wenchen Fan  (was: Apache Spark)

> polish ColumnarBatch
> 
>
> Key: SPARK-23149
> URL: https://issues.apache.org/jira/browse/SPARK-23149
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23150) kafka job failing assertion failed: Beginning offset 5451 is after the ending offset 5435 for topic

2018-01-18 Thread Balu (JIRA)
Balu created SPARK-23150:


 Summary: kafka job failing assertion failed: Beginning offset 5451 
is after the ending offset 5435 for topic
 Key: SPARK-23150
 URL: https://issues.apache.org/jira/browse/SPARK-23150
 Project: Spark
  Issue Type: Question
  Components: Build
Affects Versions: 0.9.0
 Environment: Spark Job failing with below error. Please help

Kafka version 0.10.0.0

Job aborted due to stage failure: Task 3 in stage 2.0 failed 4 times, most 
recent failure: Lost task 3.3 in stage 2.0 (TID 16, 10.3.12.16): 
java.lang.AssertionError: assertion failed: Beginning offset 5451 is after the 
ending offset 5435 for topic project1 partition 2. You either provided an 
invalid fromOffset, or the Kafka topic has been damaged at 
scala.Predef$.assert(Predef.scala:179) at 
org.apache.spark.streaming.kafka.KafkaRDD.compute(KafkaRDD.scala:86) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at 
org.apache.spark.scheduler.Task.run(Task.scala:70) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) Driver stacktrace:
Reporter: Balu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23150) kafka job failing assertion failed: Beginning offset 5451 is after the ending offset 5435 for topic

2018-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23150.
---
Resolution: Invalid

> kafka job failing assertion failed: Beginning offset 5451 is after the ending 
> offset 5435 for topic
> ---
>
> Key: SPARK-23150
> URL: https://issues.apache.org/jira/browse/SPARK-23150
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Affects Versions: 0.9.0
> Environment: Spark Job failing with below error. Please help
> Kafka version 0.10.0.0
> Job aborted due to stage failure: Task 3 in stage 2.0 failed 4 times, most 
> recent failure: Lost task 3.3 in stage 2.0 (TID 16, 10.3.12.16): 
> java.lang.AssertionError: assertion failed: Beginning offset 5451 is after 
> the ending offset 5435 for topic project1 partition 2. You either provided an 
> invalid fromOffset, or the Kafka topic has been damaged at 
> scala.Predef$.assert(Predef.scala:179) at 
> org.apache.spark.streaming.kafka.KafkaRDD.compute(KafkaRDD.scala:86) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at 
> org.apache.spark.scheduler.Task.run(Task.scala:70) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745) Driver stacktrace:
>Reporter: Balu
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23151) Provide a distribution Spark with Hadoop 3.0

2018-01-18 Thread Louis Burke (JIRA)
Louis Burke created SPARK-23151:
---

 Summary: Provide a distribution Spark with Hadoop 3.0
 Key: SPARK-23151
 URL: https://issues.apache.org/jira/browse/SPARK-23151
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Spark Core
Affects Versions: 2.2.1, 2.2.0
Reporter: Louis Burke


Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark package
only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication is
that using up to date Kinesis libraries alongside s3 causes a clash w.r.t
aws-java-sdk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0

2018-01-18 Thread Louis Burke (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Louis Burke updated SPARK-23151:

Summary: Provide a distribution of Spark with Hadoop 3.0  (was: Provide a 
distribution Spark with Hadoop 3.0)

> Provide a distribution of Spark with Hadoop 3.0
> ---
>
> Key: SPARK-23151
> URL: https://issues.apache.org/jira/browse/SPARK-23151
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Louis Burke
>Priority: Major
>
> Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark 
> package
> only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication 
> is
> that using up to date Kinesis libraries alongside s3 causes a clash w.r.t
> aws-java-sdk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23147) Stage page will throw exception when there's no complete tasks

2018-01-18 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23147.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20315
[https://github.com/apache/spark/pull/20315]

> Stage page will throw exception when there's no complete tasks
> --
>
> Key: SPARK-23147
> URL: https://issues.apache.org/jira/browse/SPARK-23147
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png
>
>
> Current Stage page's task table will throw an exception when there's no 
> complete tasks, to reproduce this issue, user could submit code like:
> {code}
> sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect()
> {code}
> Then open the UI and click into stage details. Below is the screenshot.
>  !Screen Shot 2018-01-18 at 8.50.08 PM.png! 
> Deep dive into the code, found that current UI can only show the completed 
> tasks, it is different from 2.2 code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23147) Stage page will throw exception when there's no complete tasks

2018-01-18 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23147:
--

Assignee: Saisai Shao

> Stage page will throw exception when there's no complete tasks
> --
>
> Key: SPARK-23147
> URL: https://issues.apache.org/jira/browse/SPARK-23147
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png
>
>
> Current Stage page's task table will throw an exception when there's no 
> complete tasks, to reproduce this issue, user could submit code like:
> {code}
> sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect()
> {code}
> Then open the UI and click into stage details. Below is the screenshot.
>  !Screen Shot 2018-01-18 at 8.50.08 PM.png! 
> Deep dive into the code, found that current UI can only show the completed 
> tasks, it is different from 2.2 code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel

2018-01-18 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330943#comment-16330943
 ] 

Andrew Ash commented on SPARK-22982:


[~joshrosen] do you have some example stacktraces of what this bug can cause?  
Several of our clusters hit what I think is this problem earlier this month, 
see below for details.

 

For a few days in January (4th through 12th) on our AWS infra, we observed 
massively degraded disk read throughput (down to 33% of previous peaks).  
During this time, we also began observing intermittent exceptions coming from 
Spark at read time of parquet files that a previous Spark job had written.  
When the read throughput recovered on the 12th, we stopped observing the 
exceptions and haven't seen them since.

At first we observed this stacktrace when reading .snappy.parquet files:
{noformat}
java.lang.RuntimeException: java.io.IOException: could not read page Page 
[bytes.size=1048641, valueCount=29945, uncompressedSize=1048641] in col 
[my_column] BINARY
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:493)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:486)
at org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:96)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(VectorizedColumnReader.java:486)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:157)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:229)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:398)
at 
org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191)
at 
org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80)
at 
org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109)
at 
org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: could not read page Page [bytes.size=1048641, 
valueCount=29945, uncompressedSize=1048641] in col [my_column] BINARY
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV1(VectorizedColumnReader.java:562)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$000(VectorizedColumnReader.java:47)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:490)
... 31 more
Caused by: java.io

[jira] [Issue Comment Deleted] (SPARK-13572) HiveContext reads avro Hive tables incorrectly

2018-01-18 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-13572:
---
Comment: was deleted

(was: Guten Tag, Bonjour, Buongiorno
Besten Dank für Ihr E-Mail. Ich bin zurzeit nicht am Arbeitsplatz und kehre am 
27.06.2016 zurück. Ihr E-Mail wird nicht weitergeleitet.
Merci pour votre courriel. Je suis actuellement absent(e) et serai de retour le 
27.06.2016. Votre courriel ne sera pas dévié.
La ringrazio per la sua e-mail. Al momento sono assente. Sarò di ritorno il 
27.06.2016. Il suo messaggio non è inoltrato a terze persone.
Freundliche Grüsse
Avec mes meilleures salutations
Cordiali saluti
Russ Weedon
Professional ERP Engineer

SBB AG
SBB Informatik Solution Center Finanzen, K-SCM, HR und Immobilien
Lindenhofstrasse 1, 3000 Bern 65
Mobil +41 78 812 47 62
russ.wee...@sbb.ch / www.sbb.ch

)

> HiveContext reads avro Hive tables incorrectly 
> ---
>
> Key: SPARK-13572
> URL: https://issues.apache.org/jira/browse/SPARK-13572
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2, 1.6.0, 1.6.1
> Environment: Hive 0.13.1, Spark 1.5.2
>Reporter: Zoltan Fedor
>Priority: Major
> Attachments: logs, table_definition
>
>
> I am using PySpark to read avro-based tables from Hive and while the avro 
> tables can be read, some of the columns are incorrectly read - showing value 
> {{None}} instead of the actual value.
> {noformat}
> >>> results_df = sqlContext.sql("""SELECT * FROM trmdw_prod.opsconsole_ingest 
> >>> where year=2016 and month=2 and day=29 limit 3""")
> >>> results_df.take(3)
> [Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, 
> uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, 
> statvalue=None, displayname=None, category=None, 
> source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
>  Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, 
> uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, 
> statvalue=None, displayname=None, category=None, 
> source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
>  Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, 
> uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, 
> statvalue=None, displayname=None, category=None, 
> source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29)]
> {noformat}
> Observe the {{None}} values at most of the fields. Surprisingly not all 
> fields, only some of them are showing {{None}} instead of the real values. 
> The table definition does not show anything specific about these columns.
> Running the same query in Hive:
> {noformat}
> c:hive2://xyz.com:100> SELECT * FROM trmdw_prod.opsconsole_ingest where 
> year=2016 and month=2 and day=29 limit 3;
> +--+---++---+---+---+++-+--++-++-+--++--+
> | opsconsole_ingest.kafkaoffsetgeneration  | opsconsole_ingest.kafkapartition 
>  | opsconsole_ingest.kafkaoffset  |  opsconsole_ingest.uuid   |   
>   opsconsole_ingest.mid | opsconsole_ingest.iid | 
> opsconsole_ingest.product  | opsconsole_ingest.utctime  | 
> opsconsole_ingest.statcode  | opsconsole_ingest.statvalue  | 
> opsconsole_ingest.displayname  | opsconsole_ingest.category  | 
> opsconsole_ingest.source_filename  | opsconsole_ingest.year  | 
> opsconsole_ingest.month  | opsconsole_ingest.day  |
> +--+---++---+---+---+++-+--++-++-+--++--+
> | 11.0 | 0.0  
>  | 3.83399394E8   | EF0D03C409681B98646F316CA1088973  | 
> 174f53fb-ca9b-d3f9-64e1-7631bf906817  | ----  
> | est| 2016-01-13T06:58:19| 8 
>

[jira] [Resolved] (SPARK-23029) Doc spark.shuffle.file.buffer units are kb when no units specified

2018-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23029.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20269
[https://github.com/apache/spark/pull/20269]

> Doc spark.shuffle.file.buffer units are kb when no units specified
> --
>
> Key: SPARK-23029
> URL: https://issues.apache.org/jira/browse/SPARK-23029
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Fernando Pereira
>Assignee: Fernando Pereira
>Priority: Minor
> Fix For: 2.3.0
>
>
> When setting the spark.shuffle.file.buffer setting, even to its default 
> value, shuffles fail.
> This appears to affect small to medium size partitions. Strangely the error 
> message is OutOfMemoryError, but it works with large partitions (at least 
> >32MB).
> {code}
> pyspark --conf "spark.shuffle.file.buffer=$((32*1024))"
> /gpfs/bbp.cscs.ch/scratch/gss/spykfunc/_sparkenv/lib/python2.7/site-packages/pyspark/bin/spark-submit
>  pyspark-shell-main --name PySparkShell --conf spark.shuffle.file.buffer=32768
> version 2.2.1
> >>> spark.range(1e7, numPartitions=10).sort("id").write.parquet("a", 
> >>> mode="overwrite")
> [Stage 1:>(0 + 10) / 
> 10]18/01/10 19:34:21 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 11)
> java.lang.OutOfMemoryError: Java heap space
>   at java.io.BufferedOutputStream.(BufferedOutputStream.java:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.(DiskBlockObjectWriter.scala:107)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:108)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23029) Doc spark.shuffle.file.buffer units are kb when no units specified

2018-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23029:
-

Assignee: Fernando Pereira

> Doc spark.shuffle.file.buffer units are kb when no units specified
> --
>
> Key: SPARK-23029
> URL: https://issues.apache.org/jira/browse/SPARK-23029
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Fernando Pereira
>Assignee: Fernando Pereira
>Priority: Minor
> Fix For: 2.3.0
>
>
> When setting the spark.shuffle.file.buffer setting, even to its default 
> value, shuffles fail.
> This appears to affect small to medium size partitions. Strangely the error 
> message is OutOfMemoryError, but it works with large partitions (at least 
> >32MB).
> {code}
> pyspark --conf "spark.shuffle.file.buffer=$((32*1024))"
> /gpfs/bbp.cscs.ch/scratch/gss/spykfunc/_sparkenv/lib/python2.7/site-packages/pyspark/bin/spark-submit
>  pyspark-shell-main --name PySparkShell --conf spark.shuffle.file.buffer=32768
> version 2.2.1
> >>> spark.range(1e7, numPartitions=10).sort("id").write.parquet("a", 
> >>> mode="overwrite")
> [Stage 1:>(0 + 10) / 
> 10]18/01/10 19:34:21 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 11)
> java.lang.OutOfMemoryError: Java heap space
>   at java.io.BufferedOutputStream.(BufferedOutputStream.java:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.(DiskBlockObjectWriter.scala:107)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:108)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tovbin updated SPARK-23152:
---
Description: 
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])

new DecisionTreeClassifier()
 .setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. It contains a 
single element with no columns in it. Therefore the condition has to be 
modified to verify that.
{code:java}
maxLabelRow.size <= 1 
{code}
 

 

  was:
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce:

 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])

new DecisionTreeClassifier()
 .setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 

The error:

 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
 

 

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses

 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. It contains a 
single element with no columns in it. Therefore the condition has to be 
modified to verify that.
{code:java}
maxLabelRow.size <= 1 
{code}
 

 

 


> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
> thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier()
>  .setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
>   

[jira] [Created] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)
Matthew Tovbin created SPARK-23152:
--

 Summary: Invalid guard condition in 
org.apache.spark.ml.classification.Classifier
 Key: SPARK-23152
 URL: https://issues.apache.org/jira/browse/SPARK-23152
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 2.1.2, 2.1.1, 2.1.0, 2.0.2, 2.0.1, 2.0.0, 2.1.3, 2.3.0, 
2.3.1
Reporter: Matthew Tovbin


When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce:

 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])

new DecisionTreeClassifier()
 .setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 

The error:

 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
 

 

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses

 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. It contains a 
single element with no columns in it. Therefore the condition has to be 
modified to verify that.
{code:java}
maxLabelRow.size <= 1 
{code}
 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20928) SPIP: Continuous Processing Mode for Structured Streaming

2018-01-18 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331045#comment-16331045
 ] 

Thomas Graves commented on SPARK-20928:
---

what is status of this, it looks like subtasks are finished?

> SPIP: Continuous Processing Mode for Structured Streaming
> -
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>Priority: Major
>  Labels: SPIP
> Attachments: Continuous Processing in Structured Streaming Design 
> Sketch.pdf
>
>
> Given the current Source API, the minimum possible latency for any record is 
> bounded by the amount of time that it takes to launch a task.  This 
> limitation is a result of the fact that {{getBatch}} requires us to know both 
> the starting and the ending offset, before any tasks are launched.  In the 
> worst case, the end-to-end latency is actually closer to the average batch 
> time + task launching time.
> For applications where latency is more important than exactly-once output 
> however, it would be useful if processing could happen continuously.  This 
> would allow us to achieve fully pipelined reading and writing from sources 
> such as Kafka.  This kind of architecture would make it possible to process 
> records with end-to-end latencies on the order of 1 ms, rather than the 
> 10-100ms that is possible today.
> One possible architecture here would be to change the Source API to look like 
> the following rough sketch:
> {code}
>   trait Epoch {
> def data: DataFrame
> /** The exclusive starting position for `data`. */
> def startOffset: Offset
> /** The inclusive ending position for `data`.  Incrementally updated 
> during processing, but not complete until execution of the query plan in 
> `data` is finished. */
> def endOffset: Offset
>   }
>   def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], 
> limits: Limits): Epoch
> {code}
> The above would allow us to build an alternative implementation of 
> {{StreamExecution}} that processes continuously with much lower latency and 
> only stops processing when needing to reconfigure the stream (either due to a 
> failure or a user requested change in parallelism.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tovbin updated SPARK-23152:
---
Description: 
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])

new DecisionTreeClassifier()
 .setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. It contains a 
single element with no columns in it.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.size <= 1) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 

  was:
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])

new DecisionTreeClassifier()
 .setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. It contains a 
single element with no columns in it. Therefore the condition has to be 
modified to verify that.
{code:java}
maxLabelRow.size <= 1 
{code}
 

 


> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
> thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier()
>  .setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)

[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tovbin updated SPARK-23152:
---
Description: 
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. It contains a 
single element with no columns in it.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.size <= 1) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 

  was:
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])

new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. It contains a 
single element with no columns in it.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.size <= 1) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 


> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
> thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
> at 
> org.apache.spark.ml.

[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tovbin updated SPARK-23152:
---
Description: 
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])

new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. It contains a 
single element with no columns in it.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.size <= 1) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 

  was:
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])

new DecisionTreeClassifier()
 .setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. It contains a 
single element with no columns in it.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.size <= 1) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 


> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
> thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
> at 
> org.apache.spark.

[jira] [Commented] (SPARK-23133) Spark options are not passed to the Executor in Docker context

2018-01-18 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331074#comment-16331074
 ] 

Anirudh Ramanathan commented on SPARK-23133:


Thanks for submitting the PR fixing this.

> Spark options are not passed to the Executor in Docker context
> --
>
> Key: SPARK-23133
> URL: https://issues.apache.org/jira/browse/SPARK-23133
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: Running Spark on K8s using supplied Docker image.
>Reporter: Andrew Korzhuev
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Reproduce:
>  # Build image with `bin/docker-image-tool.sh`.
>  # Submit application to k8s. Set executor options, e.g. ` --conf 
> "spark.executor. 
> extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"`
>  # Visit Spark UI on executor and notice that option is not set.
> Expected behavior: options from spark-submit should be correctly passed to 
> executor.
> Cause:
> `SPARK_EXECUTOR_JAVA_OPTS` is not defined in `entrypoint.sh`
> https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L70
> [https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L44-L45]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23082) Allow separate node selectors for driver and executors in Kubernetes

2018-01-18 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331081#comment-16331081
 ] 

Anirudh Ramanathan commented on SPARK-23082:


This is an interesting feature request. We had some discussion about this. I 
think it's a bit too late for a feature request in 2.3, so, we can revisit this 
in the 2.4 timeframe.

> Allow separate node selectors for driver and executors in Kubernetes
> 
>
> Key: SPARK-23082
> URL: https://issues.apache.org/jira/browse/SPARK-23082
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>
> In YARN, we can use {{spark.yarn.am.nodeLabelExpression}} to submit the Spark 
> driver to a different set of nodes from its executors. In Kubernetes, we can 
> specify {{spark.kubernetes.node.selector.[labelKey]}}, but we can't use 
> separate options for the driver and executors.
> This would be useful for the particular use case where executors can go on 
> more ephemeral nodes (eg, with cluster autoscaling, or preemptible/spot 
> instances), but the driver should use a more persistent machine.
> The required change would be minimal, essentially just using different config 
> keys for the 
> [driver|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90]
>  and 
> [executor|https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73]
>  instead of {{KUBERNETES_NODE_SELECTOR_PREFIX}} for both.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23082) Allow separate node selectors for driver and executors in Kubernetes

2018-01-18 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331081#comment-16331081
 ] 

Anirudh Ramanathan edited comment on SPARK-23082 at 1/18/18 7:52 PM:
-

This is an interesting feature request. We had some discussion about this. I 
think it's a bit too late for a feature request in 2.3, so, we can revisit this 
in the 2.4 timeframe.

 

[~mcheah]


was (Author: foxish):
This is an interesting feature request. We had some discussion about this. I 
think it's a bit too late for a feature request in 2.3, so, we can revisit this 
in the 2.4 timeframe.

> Allow separate node selectors for driver and executors in Kubernetes
> 
>
> Key: SPARK-23082
> URL: https://issues.apache.org/jira/browse/SPARK-23082
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>
> In YARN, we can use {{spark.yarn.am.nodeLabelExpression}} to submit the Spark 
> driver to a different set of nodes from its executors. In Kubernetes, we can 
> specify {{spark.kubernetes.node.selector.[labelKey]}}, but we can't use 
> separate options for the driver and executors.
> This would be useful for the particular use case where executors can go on 
> more ephemeral nodes (eg, with cluster autoscaling, or preemptible/spot 
> instances), but the driver should use a more persistent machine.
> The required change would be minimal, essentially just using different config 
> keys for the 
> [driver|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90]
>  and 
> [executor|https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73]
>  instead of {{KUBERNETES_NODE_SELECTOR_PREFIX}} for both.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23016) Spark UI access and documentation

2018-01-18 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan resolved SPARK-23016.

Resolution: Fixed

This is resolved and we've verified it for 2.3.0.

> Spark UI access and documentation
> -
>
> Key: SPARK-23016
> URL: https://issues.apache.org/jira/browse/SPARK-23016
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> We should have instructions to access the spark driver UI, or instruct users 
> to create a service to expose it.
> Also might need an integration test to verify that the driver UI works as 
> expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331085#comment-16331085
 ] 

Anirudh Ramanathan commented on SPARK-22962:


This is the resource staging server use-case. We'll upstream this in the 2.4.0 
timeframe.

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-22962:
---
Affects Version/s: (was: 2.3.0)
   2.4.0

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23146) Support client mode for Kubernetes cluster backend

2018-01-18 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-23146:
---
Affects Version/s: (was: 2.4.0)
   2.3.0

> Support client mode for Kubernetes cluster backend
> --
>
> Key: SPARK-23146
> URL: https://issues.apache.org/jira/browse/SPARK-23146
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> This issue tracks client mode support within Spark when running in the 
> Kubernetes cluster backend.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-22962:
---
Affects Version/s: (was: 2.4.0)
   2.3.0

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23146) Support client mode for Kubernetes cluster backend

2018-01-18 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-23146:
---
Target Version/s: 2.4.0

> Support client mode for Kubernetes cluster backend
> --
>
> Key: SPARK-23146
> URL: https://issues.apache.org/jira/browse/SPARK-23146
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> This issue tracks client mode support within Spark when running in the 
> Kubernetes cluster backend.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tovbin updated SPARK-23152:
---
Description: 
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. Instead it 
contains a single null element WrappedArray(null).

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.isEmpty || (maxLabelRow.size == 1 && maxLabelRow(0) == null)) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 

  was:
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. It contains a 
single element with no columns in it.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.size <= 1) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 


> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
> thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(De

[jira] [Commented] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331132#comment-16331132
 ] 

Yinan Li commented on SPARK-22962:
--

I agree that before we upstream the staging server, we should fail the 
submission if a user uses local resources. [~vanzin], if it's not too late to 
get into 2.3, I'm gonna file a PR for this.

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331140#comment-16331140
 ] 

Marcelo Vanzin commented on SPARK-22962:


Yes send a PR.

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20928) SPIP: Continuous Processing Mode for Structured Streaming

2018-01-18 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331141#comment-16331141
 ] 

Michael Armbrust commented on SPARK-20928:
--

There is more work to do so I might leave the umbrella open, but we are going 
to have an initial version that supports reading and writing from/to Kafka with 
very low latency in Spark 2.3!  Stay tuned for updates to docs and blog posts.

> SPIP: Continuous Processing Mode for Structured Streaming
> -
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>Priority: Major
>  Labels: SPIP
> Attachments: Continuous Processing in Structured Streaming Design 
> Sketch.pdf
>
>
> Given the current Source API, the minimum possible latency for any record is 
> bounded by the amount of time that it takes to launch a task.  This 
> limitation is a result of the fact that {{getBatch}} requires us to know both 
> the starting and the ending offset, before any tasks are launched.  In the 
> worst case, the end-to-end latency is actually closer to the average batch 
> time + task launching time.
> For applications where latency is more important than exactly-once output 
> however, it would be useful if processing could happen continuously.  This 
> would allow us to achieve fully pipelined reading and writing from sources 
> such as Kafka.  This kind of architecture would make it possible to process 
> records with end-to-end latencies on the order of 1 ms, rather than the 
> 10-100ms that is possible today.
> One possible architecture here would be to change the Source API to look like 
> the following rough sketch:
> {code}
>   trait Epoch {
> def data: DataFrame
> /** The exclusive starting position for `data`. */
> def startOffset: Offset
> /** The inclusive ending position for `data`.  Incrementally updated 
> during processing, but not complete until execution of the query plan in 
> `data` is finished. */
> def endOffset: Offset
>   }
>   def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], 
> limits: Limits): Epoch
> {code}
> The above would allow us to build an alternative implementation of 
> {{StreamExecution}} that processes continuously with much lower latency and 
> only stops processing when needing to reconfigure the stream (either due to a 
> failure or a user requested change in parallelism.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tovbin updated SPARK-23152:
---
Description: 
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. Instead it 
contains a single null element WrappedArray(null).

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.size <= 1) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 

  was:
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. Instead it 
contains a single null element WrappedArray(null).

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.isEmpty || (maxLabelRow.size == 1 && maxLabelRow(0) == null)) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 


> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
> thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifie

[jira] [Resolved] (SPARK-23143) Add Python support for continuous trigger

2018-01-18 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-23143.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   3.0.0

Issue resolved by pull request 20309
[https://github.com/apache/spark/pull/20309]

> Add Python support for continuous trigger
> -
>
> Key: SPARK-23143
> URL: https://issues.apache.org/jira/browse/SPARK-23143
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 3.0.0, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331153#comment-16331153
 ] 

Anirudh Ramanathan commented on SPARK-22962:


I think this isn't super critical for this release, mostly a usability thing. 
If it's small enough, it makes sense, but if it introduces risk and we have to 
redo manual testing, I'd vote against getting this into 2.3.

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tovbin updated SPARK-23152:
---
Description: 
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. Instead it 
contains a single null element WrappedArray(null).

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 

  was:
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. Instead it 
contains a single null element WrappedArray(null).

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.size <= 1) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 


> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
> thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTree

[jira] [Commented] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering

2018-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331157#comment-16331157
 ] 

Apache Spark commented on SPARK-22884:
--

User 'smurakozi' has created a pull request for this issue:
https://github.com/apache/spark/pull/20319

> ML test for StructuredStreaming: spark.ml.clustering
> 
>
> Key: SPARK-22884
> URL: https://issues.apache.org/jira/browse/SPARK-22884
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering

2018-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22884:


Assignee: Apache Spark

> ML test for StructuredStreaming: spark.ml.clustering
> 
>
> Key: SPARK-22884
> URL: https://issues.apache.org/jira/browse/SPARK-22884
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Major
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering

2018-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22884:


Assignee: (was: Apache Spark)

> ML test for StructuredStreaming: spark.ml.clustering
> 
>
> Key: SPARK-22884
> URL: https://issues.apache.org/jira/browse/SPARK-22884
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tovbin updated SPARK-23152:
---
Description: 
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the result "maxLabelRow" array is not. Instead it 
contains a single Row(null) element.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 

  was:
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. Instead it 
contains a single null element WrappedArray(null).

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 


> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
> thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClass

[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tovbin updated SPARK-23152:
---
Description: 
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a misleading 
NullPointerException is thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the result "maxLabelRow" array is not. Instead it 
contains a single Row(null) element.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 

  was:
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the result "maxLabelRow" array is not. Instead it 
contains a single Row(null) element.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 


> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a misleading 
> NullPointerException is thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.De

[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tovbin updated SPARK-23152:
---
Description: 
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a misleading 
NullPointerException is thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in function 
getNumClasses at org.apache.spark.ml.classification.Classifier:106
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the result "maxLabelRow" array is not. Instead it 
contains a single Row(null) element.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 

  was:
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a misleading 
NullPointerException is thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the result "maxLabelRow" array is not. Instead it 
contains a single Row(null) element.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 


> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a misleading 
> NullPointerException is thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.

[jira] [Resolved] (SPARK-23144) Add console sink for continuous queries

2018-01-18 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-23144.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   3.0.0

Issue resolved by pull request 20311
[https://github.com/apache/spark/pull/20311]

> Add console sink for continuous queries
> ---
>
> Key: SPARK-23144
> URL: https://issues.apache.org/jira/browse/SPARK-23144
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 3.0.0, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331185#comment-16331185
 ] 

Apache Spark commented on SPARK-22962:
--

User 'liyinan926' has created a pull request for this issue:
https://github.com/apache/spark/pull/20320

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22962:


Assignee: Apache Spark

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Major
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22962:


Assignee: (was: Apache Spark)

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331194#comment-16331194
 ] 

Apache Spark commented on SPARK-23152:
--

User 'tovbinm' has created a pull request for this issue:
https://github.com/apache/spark/pull/20321

> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a misleading 
> NullPointerException is thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
>   
> The problem happens due to an incorrect guard condition in function 
> getNumClasses at org.apache.spark.ml.classification.Classifier:106
> {code:java}
> val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
> if (maxLabelRow.isEmpty) {
>   throw new SparkException("ML algorithm was given empty dataset.")
> }
> {code}
> When the input data is empty the result "maxLabelRow" array is not. Instead 
> it contains a single Row(null) element.
>  
> Proposed solution: the condition can be modified to verify that.
> {code:java}
> if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
>   throw new SparkException("ML algorithm was given empty dataset.")
> }
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23152:


Assignee: (was: Apache Spark)

> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a misleading 
> NullPointerException is thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
>   
> The problem happens due to an incorrect guard condition in function 
> getNumClasses at org.apache.spark.ml.classification.Classifier:106
> {code:java}
> val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
> if (maxLabelRow.isEmpty) {
>   throw new SparkException("ML algorithm was given empty dataset.")
> }
> {code}
> When the input data is empty the result "maxLabelRow" array is not. Instead 
> it contains a single Row(null) element.
>  
> Proposed solution: the condition can be modified to verify that.
> {code:java}
> if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
>   throw new SparkException("ML algorithm was given empty dataset.")
> }
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23152:


Assignee: Apache Spark

> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Assignee: Apache Spark
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a misleading 
> NullPointerException is thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
>   
> The problem happens due to an incorrect guard condition in function 
> getNumClasses at org.apache.spark.ml.classification.Classifier:106
> {code:java}
> val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
> if (maxLabelRow.isEmpty) {
>   throw new SparkException("ML algorithm was given empty dataset.")
> }
> {code}
> When the input data is empty the result "maxLabelRow" array is not. Instead 
> it contains a single Row(null) element.
>  
> Proposed solution: the condition can be modified to verify that.
> {code:java}
> if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
>   throw new SparkException("ML algorithm was given empty dataset.")
> }
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2018-01-18 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-19185:
--

Assignee: (was: Marcelo Vanzin)

> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>Priority: Major
>  Labels: streaming, windowing
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala

[jira] [Assigned] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2018-01-18 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-19185:
--

Assignee: Marcelo Vanzin

> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>Assignee: Marcelo Vanzin
>Priority: Major
>  Labels: streaming, windowing
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
>   at org.apache.spark.rdd

[jira] [Commented] (SPARK-23133) Spark options are not passed to the Executor in Docker context

2018-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331241#comment-16331241
 ] 

Apache Spark commented on SPARK-23133:
--

User 'foxish' has created a pull request for this issue:
https://github.com/apache/spark/pull/20322

> Spark options are not passed to the Executor in Docker context
> --
>
> Key: SPARK-23133
> URL: https://issues.apache.org/jira/browse/SPARK-23133
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: Running Spark on K8s using supplied Docker image.
>Reporter: Andrew Korzhuev
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Reproduce:
>  # Build image with `bin/docker-image-tool.sh`.
>  # Submit application to k8s. Set executor options, e.g. ` --conf 
> "spark.executor. 
> extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"`
>  # Visit Spark UI on executor and notice that option is not set.
> Expected behavior: options from spark-submit should be correctly passed to 
> executor.
> Cause:
> `SPARK_EXECUTOR_JAVA_OPTS` is not defined in `entrypoint.sh`
> https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L70
> [https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L44-L45]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23153) Support application dependencies in submission client's local file system

2018-01-18 Thread Yinan Li (JIRA)
Yinan Li created SPARK-23153:


 Summary: Support application dependencies in submission client's 
local file system
 Key: SPARK-23153
 URL: https://issues.apache.org/jira/browse/SPARK-23153
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Yinan Li






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23133) Spark options are not passed to the Executor in Docker context

2018-01-18 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23133.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20322
[https://github.com/apache/spark/pull/20322]

> Spark options are not passed to the Executor in Docker context
> --
>
> Key: SPARK-23133
> URL: https://issues.apache.org/jira/browse/SPARK-23133
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: Running Spark on K8s using supplied Docker image.
>Reporter: Andrew Korzhuev
>Priority: Minor
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Reproduce:
>  # Build image with `bin/docker-image-tool.sh`.
>  # Submit application to k8s. Set executor options, e.g. ` --conf 
> "spark.executor. 
> extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"`
>  # Visit Spark UI on executor and notice that option is not set.
> Expected behavior: options from spark-submit should be correctly passed to 
> executor.
> Cause:
> `SPARK_EXECUTOR_JAVA_OPTS` is not defined in `entrypoint.sh`
> https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L70
> [https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L44-L45]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23133) Spark options are not passed to the Executor in Docker context

2018-01-18 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23133:
--

Assignee: Andrew Korzhuev

> Spark options are not passed to the Executor in Docker context
> --
>
> Key: SPARK-23133
> URL: https://issues.apache.org/jira/browse/SPARK-23133
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: Running Spark on K8s using supplied Docker image.
>Reporter: Andrew Korzhuev
>Assignee: Andrew Korzhuev
>Priority: Minor
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Reproduce:
>  # Build image with `bin/docker-image-tool.sh`.
>  # Submit application to k8s. Set executor options, e.g. ` --conf 
> "spark.executor. 
> extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"`
>  # Visit Spark UI on executor and notice that option is not set.
> Expected behavior: options from spark-submit should be correctly passed to 
> executor.
> Cause:
> `SPARK_EXECUTOR_JAVA_OPTS` is not defined in `entrypoint.sh`
> https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L70
> [https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L44-L45]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22216) Improving PySpark/Pandas interoperability

2018-01-18 Thread Li Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-22216:
---
Description: This is an umbrella ticket tracking the general effort to 
improve performance and interoperability between PySpark and Pandas. The core 
idea is to Apache Arrow as serialization format to reduce the overhead between 
PySpark and Pandas.  (was: This is an umbrella ticket tracking the general 
effect of improving performance and interoperability between PySpark and 
Pandas. The core idea is to Apache Arrow as serialization format to reduce the 
overhead between PySpark and Pandas.)

> Improving PySpark/Pandas interoperability
> -
>
> Key: SPARK-22216
> URL: https://issues.apache.org/jira/browse/SPARK-22216
> Project: Spark
>  Issue Type: Epic
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
>
> This is an umbrella ticket tracking the general effort to improve performance 
> and interoperability between PySpark and Pandas. The core idea is to Apache 
> Arrow as serialization format to reduce the overhead between PySpark and 
> Pandas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23094) Json Readers choose wrong encoding when bad records are present and fail

2018-01-18 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23094:


Assignee: Burak Yavuz

> Json Readers choose wrong encoding when bad records are present and fail
> 
>
> Key: SPARK-23094
> URL: https://issues.apache.org/jira/browse/SPARK-23094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 2.3.0
>
>
> The cases described in SPARK-16548 and SPARK-20549 handled the JsonParser 
> code paths for expressions but not the readers. We should also cover reader 
> code paths reading files with bad characters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23094) Json Readers choose wrong encoding when bad records are present and fail

2018-01-18 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23094.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20302
[https://github.com/apache/spark/pull/20302]

> Json Readers choose wrong encoding when bad records are present and fail
> 
>
> Key: SPARK-23094
> URL: https://issues.apache.org/jira/browse/SPARK-23094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 2.3.0
>
>
> The cases described in SPARK-16548 and SPARK-20549 handled the JsonParser 
> code paths for expressions but not the readers. We should also cover reader 
> code paths reading files with bad characters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22962.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20320
[https://github.com/apache/spark/pull/20320]

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Yinan Li
>Priority: Major
> Fix For: 2.3.0
>
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-22962:
--

Assignee: Yinan Li

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Yinan Li
>Priority: Major
> Fix For: 2.3.0
>
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23142) Add documentation for Continuous Processing

2018-01-18 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-23142.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   3.0.0

Issue resolved by pull request 20308
[https://github.com/apache/spark/pull/20308]

> Add documentation for Continuous Processing
> ---
>
> Key: SPARK-23142
> URL: https://issues.apache.org/jira/browse/SPARK-23142
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 3.0.0, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23091) Incorrect unit test for approxQuantile

2018-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23091:


Assignee: (was: Apache Spark)

> Incorrect unit test for approxQuantile
> --
>
> Key: SPARK-23091
> URL: https://issues.apache.org/jira/browse/SPARK-23091
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Tests
>Affects Versions: 2.2.1
>Reporter: Kuang Chen
>Priority: Minor
>
> Currently, test for `approxQuantile` (quantile estimation algorithm) checks 
> whether estimated quantile is within +- 2*`relativeError` from the true 
> quantile. See the code below:
> [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala#L157]
> However, based on the original paper by Greenwald and Khanna, the estimated 
> quantile is guaranteed to be within +- `relativeError` from the true 
> quantile. Using the double "tolerance" is misleading and incorrect, and we 
> should fix it.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23091) Incorrect unit test for approxQuantile

2018-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23091:


Assignee: Apache Spark

> Incorrect unit test for approxQuantile
> --
>
> Key: SPARK-23091
> URL: https://issues.apache.org/jira/browse/SPARK-23091
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Tests
>Affects Versions: 2.2.1
>Reporter: Kuang Chen
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, test for `approxQuantile` (quantile estimation algorithm) checks 
> whether estimated quantile is within +- 2*`relativeError` from the true 
> quantile. See the code below:
> [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala#L157]
> However, based on the original paper by Greenwald and Khanna, the estimated 
> quantile is guaranteed to be within +- `relativeError` from the true 
> quantile. Using the double "tolerance" is misleading and incorrect, and we 
> should fix it.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >