[jira] [Comment Edited] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-03-23 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412458#comment-16412458
 ] 

Felix Cheung edited comment on SPARK-23780 at 3/24/18 6:53 AM:
---

here

[https://github.com/mages/googleVis/blob/master/R/zzz.R#L39]

 or here

[https://github.com/jeroen/jsonlite/blob/master/R/toJSON.R#L2] 


was (Author: felixcheung):
here

[https://github.com/mages/googleVis/blob/master/R/zzz.R#L39]

 

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Ivan Dzikovsky
>Priority: Major
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-03-23 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412458#comment-16412458
 ] 

Felix Cheung commented on SPARK-23780:
--

here

[https://github.com/mages/googleVis/blob/master/R/zzz.R#L39]

 

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Ivan Dzikovsky
>Priority: Major
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-03-23 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412457#comment-16412457
 ] 

Felix Cheung commented on SPARK-23780:
--

hmm, I think the cause of this is the incompatibility of the method signature 
of toJSON

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Ivan Dzikovsky
>Priority: Major
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23716) Change SHA512 style in release artifacts to play nicely with shasum utility

2018-03-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412423#comment-16412423
 ] 

Nicholas Chammas edited comment on SPARK-23716 at 3/24/18 5:13 AM:
---

For my use case, there is no value in updating the Spark release code if we're 
not going to also update the release hashes for all prior releases, which it 
sounds like we don't want to do.

I wrote my own code to convert the GPG-style hashes to shasum style hashes, and 
that satisfies [my need|https://github.com/nchammas/flintrock/issues/238], 
which is focused on syncing Spark releases from the Apache archive to an S3 
bucket.

Closing this as "Won't Fix".


was (Author: nchammas):
For my use case, there is no value in updating the Spark release code if we're 
not going to also update the release hashes for all prior releases, which it 
sounds like we don't want to do.

I wrote my own code to convert the GPG-style hashes to shasum style hashes, and 
that satisfies my need. I am syncing Spark releases from the Apache 
distribution archive to a personal S3 bucket and need a way to verify the 
integrity of the files.

> Change SHA512 style in release artifacts to play nicely with shasum utility
> ---
>
> Key: SPARK-23716
> URL: https://issues.apache.org/jira/browse/SPARK-23716
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> As [discussed 
> here|http://apache-spark-developers-list.1001551.n3.nabble.com/Changing-how-we-compute-release-hashes-td23599.html].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23716) Change SHA512 style in release artifacts to play nicely with shasum utility

2018-03-23 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-23716.
--
Resolution: Won't Fix

For my use case, there is no value in updating the Spark release code if we're 
not going to also update the release hashes for all prior releases, which it 
sounds like we don't want to do.

I wrote my own code to convert the GPG-style hashes to shasum style hashes, and 
that satisfies my need. I am syncing Spark releases from the Apache 
distribution archive to a personal S3 bucket and need a way to verify the 
integrity of the files.

> Change SHA512 style in release artifacts to play nicely with shasum utility
> ---
>
> Key: SPARK-23716
> URL: https://issues.apache.org/jira/browse/SPARK-23716
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> As [discussed 
> here|http://apache-spark-developers-list.1001551.n3.nabble.com/Changing-how-we-compute-release-hashes-td23599.html].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly

2018-03-23 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412379#comment-16412379
 ] 

MIN-FU YANG edited comment on SPARK-22876 at 3/24/18 3:00 AM:
--

I found that current Yarn implementation doesn't expose the number of failed 
app attempts in their API, maybe this feature should be pending until they 
expose this number.


was (Author: lucasmf):
I found that current Yarn implementation doesn't expose number of failed app 
attempts number in their API, maybe this feature should be pending until they 
expose this number.

> spark.yarn.am.attemptFailuresValidityInterval does not work correctly
> -
>
> Key: SPARK-22876
> URL: https://issues.apache.org/jira/browse/SPARK-22876
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
> Environment: hadoop version 2.7.3
>Reporter: Jinhan Zhong
>Priority: Minor
>
> I assume we can use spark.yarn.maxAppAttempts together with 
> spark.yarn.am.attemptFailuresValidityInterval to make a long running 
> application avoid stopping  after acceptable number of failures.
> But after testing, I found that the application always stops after failing n 
> times ( n is minimum value of spark.yarn.maxAppAttempts and 
> yarn.resourcemanager.am.max-attempts from client yarn-site.xml)
> for example, following setup will allow the application master to fail 20 
> times.
> * spark.yarn.am.attemptFailuresValidityInterval=1s
> * spark.yarn.maxAppAttempts=20
> * yarn client: yarn.resourcemanager.am.max-attempts=20
> * yarn resource manager: yarn.resourcemanager.am.max-attempts=3
> And after checking the source code, I found in source file 
> ApplicationMaster.scala 
> https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293
> there's a ShutdownHook that checks the attempt id against the maxAppAttempts, 
> if attempt id >= maxAppAttempts, it will try to unregister the application 
> and the application will finish.
> is this a expected design or a bug?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly

2018-03-23 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412379#comment-16412379
 ] 

MIN-FU YANG commented on SPARK-22876:
-

I found that current Yarn implementation doesn't expose number of failed app 
attempts number in their API, maybe this feature should be pending until they 
expose this number.

> spark.yarn.am.attemptFailuresValidityInterval does not work correctly
> -
>
> Key: SPARK-22876
> URL: https://issues.apache.org/jira/browse/SPARK-22876
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
> Environment: hadoop version 2.7.3
>Reporter: Jinhan Zhong
>Priority: Minor
>
> I assume we can use spark.yarn.maxAppAttempts together with 
> spark.yarn.am.attemptFailuresValidityInterval to make a long running 
> application avoid stopping  after acceptable number of failures.
> But after testing, I found that the application always stops after failing n 
> times ( n is minimum value of spark.yarn.maxAppAttempts and 
> yarn.resourcemanager.am.max-attempts from client yarn-site.xml)
> for example, following setup will allow the application master to fail 20 
> times.
> * spark.yarn.am.attemptFailuresValidityInterval=1s
> * spark.yarn.maxAppAttempts=20
> * yarn client: yarn.resourcemanager.am.max-attempts=20
> * yarn resource manager: yarn.resourcemanager.am.max-attempts=3
> And after checking the source code, I found in source file 
> ApplicationMaster.scala 
> https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293
> there's a ShutdownHook that checks the attempt id against the maxAppAttempts, 
> if attempt id >= maxAppAttempts, it will try to unregister the application 
> and the application will finish.
> is this a expected design or a bug?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14352) approxQuantile should support multi columns

2018-03-23 Thread Walt Elder (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412370#comment-16412370
 ] 

Walt Elder commented on SPARK-14352:


Seems like this should be marked closed as of 2.2, right?

> approxQuantile should support multi columns
> ---
>
> Key: SPARK-14352
> URL: https://issues.apache.org/jira/browse/SPARK-14352
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: zhengruifeng
>Priority: Major
>
> It will be convenient and efficient to calculate quantiles of multi-columns 
> with approxQuantile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23785) LauncherBackend doesn't check state of connection before setting state

2018-03-23 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412361#comment-16412361
 ] 

Sahil Takiar commented on SPARK-23785:
--

Updated the PR.
{quote} in LauncherBackend.BackendConnection, set "isDisconnected" before 
calling super.close()
{quote}
Done
{quote} The race you describe also exists; it's not what the exception in the 
Hive bug shows, though.
{quote}
I tried to write a test to replicate this issue, but it seems its already 
handled in {{LauncherConnection}}, if the {{SparkAppHandle}} calls 
{{disconnect}} the client connection automatically gets closed because it gets 
an {{EOFException}}, which triggers {{close()}} - the logic is in 
{{LauncherConnection#run}}.

So I guess its already handled? I added my test case to the PR, should be 
useful since it covers what happens when {{SparkAppHandle#disconnect}} is 
called.

> LauncherBackend doesn't check state of connection before setting state
> --
>
> Key: SPARK-23785
> URL: https://issues.apache.org/jira/browse/SPARK-23785
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Priority: Major
>
> Found in HIVE-18533 while trying to integration with the 
> {{InProcessLauncher}}. {{LauncherBackend}} doesn't check the state of its 
> connection to the {{LauncherServer}} before trying to run {{setState}} - 
> which sends a {{SetState}} message on the connection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23788) Race condition in StreamingQuerySuite

2018-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412360#comment-16412360
 ] 

Apache Spark commented on SPARK-23788:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/20896

> Race condition in StreamingQuerySuite
> -
>
> Key: SPARK-23788
> URL: https://issues.apache.org/jira/browse/SPARK-23788
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Minor
>
> The serializability test uses the same MemoryStream instance for 3 different 
> queries. If any of those queries ask it to commit before the others have run, 
> the rest will see empty dataframes. This can fail the test if q3 is affected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23788) Race condition in StreamingQuerySuite

2018-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23788:


Assignee: (was: Apache Spark)

> Race condition in StreamingQuerySuite
> -
>
> Key: SPARK-23788
> URL: https://issues.apache.org/jira/browse/SPARK-23788
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Minor
>
> The serializability test uses the same MemoryStream instance for 3 different 
> queries. If any of those queries ask it to commit before the others have run, 
> the rest will see empty dataframes. This can fail the test if q3 is affected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23788) Race condition in StreamingQuerySuite

2018-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23788:


Assignee: Apache Spark

> Race condition in StreamingQuerySuite
> -
>
> Key: SPARK-23788
> URL: https://issues.apache.org/jira/browse/SPARK-23788
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Minor
>
> The serializability test uses the same MemoryStream instance for 3 different 
> queries. If any of those queries ask it to commit before the others have run, 
> the rest will see empty dataframes. This can fail the test if q3 is affected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23788) Race condition in StreamingQuerySuite

2018-03-23 Thread Jose Torres (JIRA)
Jose Torres created SPARK-23788:
---

 Summary: Race condition in StreamingQuerySuite
 Key: SPARK-23788
 URL: https://issues.apache.org/jira/browse/SPARK-23788
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres


The serializability test uses the same MemoryStream instance for 3 different 
queries. If any of those queries ask it to commit before the others have run, 
the rest will see empty dataframes. This can fail the test if q3 is affected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly

2018-03-23 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412228#comment-16412228
 ] 

MIN-FU YANG commented on SPARK-22876:
-

Hi, I also encounter this problem.

Please assign it to me, I can fix this issue.

> spark.yarn.am.attemptFailuresValidityInterval does not work correctly
> -
>
> Key: SPARK-22876
> URL: https://issues.apache.org/jira/browse/SPARK-22876
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
> Environment: hadoop version 2.7.3
>Reporter: Jinhan Zhong
>Priority: Minor
>
> I assume we can use spark.yarn.maxAppAttempts together with 
> spark.yarn.am.attemptFailuresValidityInterval to make a long running 
> application avoid stopping  after acceptable number of failures.
> But after testing, I found that the application always stops after failing n 
> times ( n is minimum value of spark.yarn.maxAppAttempts and 
> yarn.resourcemanager.am.max-attempts from client yarn-site.xml)
> for example, following setup will allow the application master to fail 20 
> times.
> * spark.yarn.am.attemptFailuresValidityInterval=1s
> * spark.yarn.maxAppAttempts=20
> * yarn client: yarn.resourcemanager.am.max-attempts=20
> * yarn resource manager: yarn.resourcemanager.am.max-attempts=3
> And after checking the source code, I found in source file 
> ApplicationMaster.scala 
> https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293
> there's a ShutdownHook that checks the attempt id against the maxAppAttempts, 
> if attempt id >= maxAppAttempts, it will try to unregister the application 
> and the application will finish.
> is this a expected design or a bug?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23615) Add maxDF Parameter to Python CountVectorizer

2018-03-23 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23615.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20777
[https://github.com/apache/spark/pull/20777]

> Add maxDF Parameter to Python CountVectorizer
> -
>
> Key: SPARK-23615
> URL: https://issues.apache.org/jira/browse/SPARK-23615
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 2.4.0
>
>
> The maxDF parameter is for filtering out frequently occurring terms.  This 
> param was recently added to the Scala CountVectorizer and needs to be added 
> to Python also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23615) Add maxDF Parameter to Python CountVectorizer

2018-03-23 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-23615:


Assignee: Huaxin Gao

> Add maxDF Parameter to Python CountVectorizer
> -
>
> Key: SPARK-23615
> URL: https://issues.apache.org/jira/browse/SPARK-23615
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 2.4.0
>
>
> The maxDF parameter is for filtering out frequently occurring terms.  This 
> param was recently added to the Scala CountVectorizer and needs to be added 
> to Python also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22513) Provide build profile for hadoop 2.8

2018-03-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412218#comment-16412218
 ] 

Nicholas Chammas commented on SPARK-22513:
--

Fair enough.

Just as an alternate confirmation, [~ste...@apache.org] can you comment on 
whether there might be any issues running Spark built against Hadoop 2.7 with, 
say, HDFS 2.8?

> Provide build profile for hadoop 2.8
> 
>
> Key: SPARK-22513
> URL: https://issues.apache.org/jira/browse/SPARK-22513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Christine Koppelt
>Priority: Major
>
> hadoop 2.8 comes with a patch which is necessary to make it run on NixOS [1]. 
> Therefore it would be cool to have a Spark version pre-built for Hadoop 2.8.
> [1] 
> https://github.com/apache/hadoop/commit/5231c527aaf19fb3f4bd59dcd2ab19bfb906d377#diff-19821342174c77119be4a99dc3f3618d



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23787) SparkSubmitSuite::"download remote resource if it is not supported by yarn" fails on Hadoop 2.9

2018-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23787:


Assignee: Apache Spark

> SparkSubmitSuite::"download remote resource if it is not supported by yarn" 
> fails on Hadoop 2.9
> ---
>
> Key: SPARK-23787
> URL: https://issues.apache.org/jira/browse/SPARK-23787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> {noformat}
> [info] - download list of files to local (10 milliseconds)
> [info]   
> "http:///work/apache/spark/target/tmp/spark-a17bc160-641b-41e1-95be-a2e31b175e09/testJar3393247632492201277.jar";
>  did not start with substring "file:" (SparkSubmitSuite.scala:1022)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340)
> [info]   at 
> org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668)
> [info]   at 
> org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite.org$apache$spark$deploy$SparkSubmitSuite$$testRemoteResources(SparkSubmitSuite.scala:1022)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply$mcV$sp(SparkSubmitSuite.scala:962)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962)
> {noformat}
> That's because Hadoop 2.9 supports http as a file system, and the test 
> expects the Hadoop libraries not to. I also found a couple of other bugs in 
> the test (although the code itself for the feature is fine).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23787) SparkSubmitSuite::"download remote resource if it is not supported by yarn" fails on Hadoop 2.9

2018-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412202#comment-16412202
 ] 

Apache Spark commented on SPARK-23787:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20895

> SparkSubmitSuite::"download remote resource if it is not supported by yarn" 
> fails on Hadoop 2.9
> ---
>
> Key: SPARK-23787
> URL: https://issues.apache.org/jira/browse/SPARK-23787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> {noformat}
> [info] - download list of files to local (10 milliseconds)
> [info]   
> "http:///work/apache/spark/target/tmp/spark-a17bc160-641b-41e1-95be-a2e31b175e09/testJar3393247632492201277.jar";
>  did not start with substring "file:" (SparkSubmitSuite.scala:1022)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340)
> [info]   at 
> org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668)
> [info]   at 
> org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite.org$apache$spark$deploy$SparkSubmitSuite$$testRemoteResources(SparkSubmitSuite.scala:1022)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply$mcV$sp(SparkSubmitSuite.scala:962)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962)
> {noformat}
> That's because Hadoop 2.9 supports http as a file system, and the test 
> expects the Hadoop libraries not to. I also found a couple of other bugs in 
> the test (although the code itself for the feature is fine).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23787) SparkSubmitSuite::"download remote resource if it is not supported by yarn" fails on Hadoop 2.9

2018-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23787:


Assignee: (was: Apache Spark)

> SparkSubmitSuite::"download remote resource if it is not supported by yarn" 
> fails on Hadoop 2.9
> ---
>
> Key: SPARK-23787
> URL: https://issues.apache.org/jira/browse/SPARK-23787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> {noformat}
> [info] - download list of files to local (10 milliseconds)
> [info]   
> "http:///work/apache/spark/target/tmp/spark-a17bc160-641b-41e1-95be-a2e31b175e09/testJar3393247632492201277.jar";
>  did not start with substring "file:" (SparkSubmitSuite.scala:1022)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340)
> [info]   at 
> org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668)
> [info]   at 
> org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite.org$apache$spark$deploy$SparkSubmitSuite$$testRemoteResources(SparkSubmitSuite.scala:1022)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply$mcV$sp(SparkSubmitSuite.scala:962)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962)
> {noformat}
> That's because Hadoop 2.9 supports http as a file system, and the test 
> expects the Hadoop libraries not to. I also found a couple of other bugs in 
> the test (although the code itself for the feature is fine).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23787) SparkSubmitSuite::"download remote resource if it is not supported by yarn" fails on Hadoop 2.9

2018-03-23 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-23787:
---
Summary: SparkSubmitSuite::"download remote resource if it is not supported 
by yarn" fails on Hadoop 2.9  (was: SparkSubmitSuite::"download list of files 
to local" fails on Hadoop 2.9)

> SparkSubmitSuite::"download remote resource if it is not supported by yarn" 
> fails on Hadoop 2.9
> ---
>
> Key: SPARK-23787
> URL: https://issues.apache.org/jira/browse/SPARK-23787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> {noformat}
> [info] - download list of files to local (10 milliseconds)
> [info]   
> "http:///work/apache/spark/target/tmp/spark-a17bc160-641b-41e1-95be-a2e31b175e09/testJar3393247632492201277.jar";
>  did not start with substring "file:" (SparkSubmitSuite.scala:1022)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340)
> [info]   at 
> org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668)
> [info]   at 
> org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite.org$apache$spark$deploy$SparkSubmitSuite$$testRemoteResources(SparkSubmitSuite.scala:1022)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply$mcV$sp(SparkSubmitSuite.scala:962)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962)
> {noformat}
> That's because Hadoop 2.9 supports http as a file system, and the test 
> expects the Hadoop libraries not to. I also found a couple of other bugs in 
> the test (although the code itself for the feature is fine).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23787) SparkSubmitSuite::"" fails on Hadoop 2.9

2018-03-23 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-23787:
--

 Summary: SparkSubmitSuite::"" fails on Hadoop 2.9
 Key: SPARK-23787
 URL: https://issues.apache.org/jira/browse/SPARK-23787
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Affects Versions: 2.4.0
Reporter: Marcelo Vanzin


{noformat}
[info] - download list of files to local (10 milliseconds)
[info]   
"http:///work/apache/spark/target/tmp/spark-a17bc160-641b-41e1-95be-a2e31b175e09/testJar3393247632492201277.jar";
 did not start with substring "file:" (SparkSubmitSuite.scala:1022)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340)
[info]   at 
org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668)
[info]   at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704)
[info]   at 
org.apache.spark.deploy.SparkSubmitSuite.org$apache$spark$deploy$SparkSubmitSuite$$testRemoteResources(SparkSubmitSuite.scala:1022)
[info]   at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply$mcV$sp(SparkSubmitSuite.scala:962)
[info]   at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962)
[info]   at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962)
{noformat}

That's because Hadoop 2.9 supports http as a file system, and the test expects 
the Hadoop libraries not to. I also found a couple of other bugs in the test 
(although the code itself for the feature is fine).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23787) SparkSubmitSuite::"download list of files to local" fails on Hadoop 2.9

2018-03-23 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-23787:
---
Summary: SparkSubmitSuite::"download list of files to local" fails on 
Hadoop 2.9  (was: SparkSubmitSuite::"" fails on Hadoop 2.9)

> SparkSubmitSuite::"download list of files to local" fails on Hadoop 2.9
> ---
>
> Key: SPARK-23787
> URL: https://issues.apache.org/jira/browse/SPARK-23787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> {noformat}
> [info] - download list of files to local (10 milliseconds)
> [info]   
> "http:///work/apache/spark/target/tmp/spark-a17bc160-641b-41e1-95be-a2e31b175e09/testJar3393247632492201277.jar";
>  did not start with substring "file:" (SparkSubmitSuite.scala:1022)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340)
> [info]   at 
> org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668)
> [info]   at 
> org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite.org$apache$spark$deploy$SparkSubmitSuite$$testRemoteResources(SparkSubmitSuite.scala:1022)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply$mcV$sp(SparkSubmitSuite.scala:962)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962)
> {noformat}
> That's because Hadoop 2.9 supports http as a file system, and the test 
> expects the Hadoop libraries not to. I also found a couple of other bugs in 
> the test (although the code itself for the feature is fine).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23785) LauncherBackend doesn't check state of connection before setting state

2018-03-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412130#comment-16412130
 ] 

Marcelo Vanzin commented on SPARK-23785:


This is a little trickier than just the checks you have in the PR.

The check that is triggering in Hive is on the {{LauncherBackend}} side. So it 
has somehow already been closed, and a {{setState}} call happens. That can 
happen if there are two calls to {{LocalSchedulerBackend.stop}}, which can 
happen if someone with a launcher handle calls {{stop()}} on the handle. But 
the code should be safe against that and just ignore subsequent calls.

The race you describe also exists; it's not what the exception in the Hive bug 
shows, though.

So perhaps it's better to do a few different things:

- add the checks in your PR
- in LauncherBackend.BackendConnection, set "isDisconnected" before calling 
super.close()
- in that same class, override the "send()" method to ignore "SocketException", 
to handle the second race.


> LauncherBackend doesn't check state of connection before setting state
> --
>
> Key: SPARK-23785
> URL: https://issues.apache.org/jira/browse/SPARK-23785
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Priority: Major
>
> Found in HIVE-18533 while trying to integration with the 
> {{InProcessLauncher}}. {{LauncherBackend}} doesn't check the state of its 
> connection to the {{LauncherServer}} before trying to run {{setState}} - 
> which sends a {{SetState}} message on the connection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23786) CSV schema validation - column names are not checked

2018-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23786:


Assignee: Apache Spark

> CSV schema validation - column names are not checked
> 
>
> Key: SPARK-23786
> URL: https://issues.apache.org/jira/browse/SPARK-23786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Here is a csv file contains two columns of the same type:
> {code}
> $cat marina.csv
> depth, temperature
> 10.2, 9.0
> 5.5, 12.3
> {code}
> If we define the schema with correct types but wrong column names (reversed 
> order):
> {code:scala}
> val schema = new StructType().add("temperature", DoubleType).add("depth", 
> DoubleType)
> {code}
> Spark reads the csv file without any errors:
> {code:scala}
> val ds = spark.read.schema(schema).option("header", "true").csv("marina.csv")
> ds.show
> {code}
> and outputs wrong result:
> {code}
> +---+-+
> |temperature|depth|
> +---+-+
> |   10.2|  9.0|
> |5.5| 12.3|
> +---+-+
> {code}
> The correct behavior would be either output error or read columns according 
> its names in the schema.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23786) CSV schema validation - column names are not checked

2018-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412088#comment-16412088
 ] 

Apache Spark commented on SPARK-23786:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20894

> CSV schema validation - column names are not checked
> 
>
> Key: SPARK-23786
> URL: https://issues.apache.org/jira/browse/SPARK-23786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Here is a csv file contains two columns of the same type:
> {code}
> $cat marina.csv
> depth, temperature
> 10.2, 9.0
> 5.5, 12.3
> {code}
> If we define the schema with correct types but wrong column names (reversed 
> order):
> {code:scala}
> val schema = new StructType().add("temperature", DoubleType).add("depth", 
> DoubleType)
> {code}
> Spark reads the csv file without any errors:
> {code:scala}
> val ds = spark.read.schema(schema).option("header", "true").csv("marina.csv")
> ds.show
> {code}
> and outputs wrong result:
> {code}
> +---+-+
> |temperature|depth|
> +---+-+
> |   10.2|  9.0|
> |5.5| 12.3|
> +---+-+
> {code}
> The correct behavior would be either output error or read columns according 
> its names in the schema.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23786) CSV schema validation - column names are not checked

2018-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23786:


Assignee: (was: Apache Spark)

> CSV schema validation - column names are not checked
> 
>
> Key: SPARK-23786
> URL: https://issues.apache.org/jira/browse/SPARK-23786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Here is a csv file contains two columns of the same type:
> {code}
> $cat marina.csv
> depth, temperature
> 10.2, 9.0
> 5.5, 12.3
> {code}
> If we define the schema with correct types but wrong column names (reversed 
> order):
> {code:scala}
> val schema = new StructType().add("temperature", DoubleType).add("depth", 
> DoubleType)
> {code}
> Spark reads the csv file without any errors:
> {code:scala}
> val ds = spark.read.schema(schema).option("header", "true").csv("marina.csv")
> ds.show
> {code}
> and outputs wrong result:
> {code}
> +---+-+
> |temperature|depth|
> +---+-+
> |   10.2|  9.0|
> |5.5| 12.3|
> +---+-+
> {code}
> The correct behavior would be either output error or read columns according 
> its names in the schema.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23786) CSV schema validation - column names are not checked

2018-03-23 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-23786:
--

 Summary: CSV schema validation - column names are not checked
 Key: SPARK-23786
 URL: https://issues.apache.org/jira/browse/SPARK-23786
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maxim Gekk


Here is a csv file contains two columns of the same type:
{code}
$cat marina.csv
depth, temperature
10.2, 9.0
5.5, 12.3
{code}

If we define the schema with correct types but wrong column names (reversed 
order):
{code:scala}
val schema = new StructType().add("temperature", DoubleType).add("depth", 
DoubleType)
{code}

Spark reads the csv file without any errors:
{code:scala}
val ds = spark.read.schema(schema).option("header", "true").csv("marina.csv")
ds.show
{code}
and outputs wrong result:
{code}
+---+-+
|temperature|depth|
+---+-+
|   10.2|  9.0|
|5.5| 12.3|
+---+-+
{code}
The correct behavior would be either output error or read columns according its 
names in the schema.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23785) LauncherBackend doesn't check state of connection before setting state

2018-03-23 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412042#comment-16412042
 ] 

Sahil Takiar commented on SPARK-23785:
--

[~vanzin] opened a PR that just checks if {{isConnected}} is true before 
writing to the connection, but not sure this will fix the issue from 
HIVE-18533. Looking through the code it seems like this only gets set when the 
{{LauncherBackend}} closes the connection, but doesn't detect when the 
{{LauncherServer}} closes the connection. Unless, I'm missing something.

> LauncherBackend doesn't check state of connection before setting state
> --
>
> Key: SPARK-23785
> URL: https://issues.apache.org/jira/browse/SPARK-23785
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Priority: Major
>
> Found in HIVE-18533 while trying to integration with the 
> {{InProcessLauncher}}. {{LauncherBackend}} doesn't check the state of its 
> connection to the {{LauncherServer}} before trying to run {{setState}} - 
> which sends a {{SetState}} message on the connection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23785) LauncherBackend doesn't check state of connection before setting state

2018-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23785:


Assignee: Apache Spark

> LauncherBackend doesn't check state of connection before setting state
> --
>
> Key: SPARK-23785
> URL: https://issues.apache.org/jira/browse/SPARK-23785
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Assignee: Apache Spark
>Priority: Major
>
> Found in HIVE-18533 while trying to integration with the 
> {{InProcessLauncher}}. {{LauncherBackend}} doesn't check the state of its 
> connection to the {{LauncherServer}} before trying to run {{setState}} - 
> which sends a {{SetState}} message on the connection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23785) LauncherBackend doesn't check state of connection before setting state

2018-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412004#comment-16412004
 ] 

Apache Spark commented on SPARK-23785:
--

User 'sahilTakiar' has created a pull request for this issue:
https://github.com/apache/spark/pull/20893

> LauncherBackend doesn't check state of connection before setting state
> --
>
> Key: SPARK-23785
> URL: https://issues.apache.org/jira/browse/SPARK-23785
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Priority: Major
>
> Found in HIVE-18533 while trying to integration with the 
> {{InProcessLauncher}}. {{LauncherBackend}} doesn't check the state of its 
> connection to the {{LauncherServer}} before trying to run {{setState}} - 
> which sends a {{SetState}} message on the connection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23772) Provide an option to ignore column of all null values or empty map/array during JSON schema inference

2018-03-23 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412007#comment-16412007
 ] 

Reynold Xin commented on SPARK-23772:
-

This is a good change to do!

 

> Provide an option to ignore column of all null values or empty map/array 
> during JSON schema inference
> -
>
> Key: SPARK-23772
> URL: https://issues.apache.org/jira/browse/SPARK-23772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> It is common that we convert data from JSON source to structured format 
> periodically. In the initial batch of JSON data, if a field's values are 
> always null, Spark infers this field as StringType. However, in the second 
> batch, one non-null value appears in this field and its type turns out to be 
> not StringType. Then merge schema failed because schema inconsistency.
> This also applies to empty arrays and empty objects. My proposal is providing 
> an option in Spark JSON source to omit those fields until we see a non-null 
> value.
> This is similar to SPARK-12436 but the proposed solution is different.
> cc: [~rxin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23785) LauncherBackend doesn't check state of connection before setting state

2018-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23785:


Assignee: (was: Apache Spark)

> LauncherBackend doesn't check state of connection before setting state
> --
>
> Key: SPARK-23785
> URL: https://issues.apache.org/jira/browse/SPARK-23785
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Priority: Major
>
> Found in HIVE-18533 while trying to integration with the 
> {{InProcessLauncher}}. {{LauncherBackend}} doesn't check the state of its 
> connection to the {{LauncherServer}} before trying to run {{setState}} - 
> which sends a {{SetState}} message on the connection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23654) Cut jets3t as a dependency of spark-core; exclude it from hadoop-cloud module as incompatible

2018-03-23 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-23654:
---
Summary: Cut jets3t as a dependency of spark-core; exclude it from 
hadoop-cloud module as incompatible  (was: cut jets3t as a dependency of 
spark-core; exclude it from hadoop-cloud module as incompatible)

> Cut jets3t as a dependency of spark-core; exclude it from hadoop-cloud module 
> as incompatible
> -
>
> Key: SPARK-23654
> URL: https://issues.apache.org/jira/browse/SPARK-23654
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Spark core declares a dependency on Jets3t, which pulls in other cruft
> # the hadoop-cloud module pulls in the hadoop-aws module with the 
> jets3t-compatible connectors, and the relevant dependencies: the spark-core 
> dependency is incomplete if that module isn't built, and superflous or 
> inconsistent if it is.
> # We've cut out s3n/s3 and all dependencies on jets3t entirely from hadoop 
> 3.x in favour we're willing to maintain.
> JetS3t was wonderful when it came out, but now the amazon SDKs massively 
> exceed it in functionality, albeit at the expense of week-to-week stability 
> and JAR binary compatibility



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23785) LauncherBackend doesn't check state of connection before setting state

2018-03-23 Thread Sahil Takiar (JIRA)
Sahil Takiar created SPARK-23785:


 Summary: LauncherBackend doesn't check state of connection before 
setting state
 Key: SPARK-23785
 URL: https://issues.apache.org/jira/browse/SPARK-23785
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Sahil Takiar


Found in HIVE-18533 while trying to integration with the {{InProcessLauncher}}. 
{{LauncherBackend}} doesn't check the state of its connection to the 
{{LauncherServer}} before trying to run {{setState}} - which sends a 
{{SetState}} message on the connection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23784) Cannot use custom Aggregator with groupBy/agg

2018-03-23 Thread Joshua Howard (JIRA)
Joshua Howard created SPARK-23784:
-

 Summary: Cannot use custom Aggregator with groupBy/agg 
 Key: SPARK-23784
 URL: https://issues.apache.org/jira/browse/SPARK-23784
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 2.3.0
Reporter: Joshua Howard


{{I have code 
[here|[https://stackoverflow.com/questions/49440766/trouble-getting-spark-aggregators-to-work],]
 where I am trying to use an Aggregator with both the `select` and `agg` 
functions. I cannot seem to get this to work in Spark 2.3.0. 
[Here|https://docs.cloud.databricks.com/docs/spark/1.6/examples/Dataset%20Aggregator.html]
 is a blog post that appears to be using this functionality in Spark 1.6, but 
It appears to no longer work. }}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23783) Add new generic export trait for ML pipelines

2018-03-23 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-23783.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Add new generic export trait for ML pipelines
> -
>
> Key: SPARK-23783
> URL: https://issues.apache.org/jira/browse/SPARK-23783
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Major
> Fix For: 2.4.0
>
>
> Add a new generic export trait for ML pipelines so that we can support more 
> than just our internal format. API design based off of the 
> DataFrameReader/Writer design



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23783) Add new generic export trait for ML pipelines

2018-03-23 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-23783:
---

Assignee: holdenk

> Add new generic export trait for ML pipelines
> -
>
> Key: SPARK-23783
> URL: https://issues.apache.org/jira/browse/SPARK-23783
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Major
> Fix For: 2.4.0
>
>
> Add a new generic export trait for ML pipelines so that we can support more 
> than just our internal format. API design based off of the 
> DataFrameReader/Writer design



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11239) PMML export for ML linear regression

2018-03-23 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-11239:
---

Assignee: holdenk

> PMML export for ML linear regression
> 
>
> Key: SPARK-11239
> URL: https://issues.apache.org/jira/browse/SPARK-11239
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: holdenk
>Assignee: holdenk
>Priority: Major
> Fix For: 2.4.0
>
>
> Add PMML export for linear regression models form the ML pipeline.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11239) PMML export for ML linear regression

2018-03-23 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-11239.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> PMML export for ML linear regression
> 
>
> Key: SPARK-11239
> URL: https://issues.apache.org/jira/browse/SPARK-11239
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: holdenk
>Assignee: holdenk
>Priority: Major
> Fix For: 2.4.0
>
>
> Add PMML export for linear regression models form the ML pipeline.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21834) Incorrect executor request in case of dynamic allocation

2018-03-23 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411897#comment-16411897
 ] 

Imran Rashid commented on SPARK-21834:
--

SPARK-23365 is basically a duplicate of this, though they both have changes 
associated with them (though I didn't realize it at the time, SPARK-23365 is 
not strictly necessary on top of this, but does improve code clarity).

> Incorrect executor request in case of dynamic allocation
> 
>
> Key: SPARK-21834
> URL: https://issues.apache.org/jira/browse/SPARK-21834
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Sital Kedia
>Priority: Major
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> killExecutor api currently does not allow killing an executor without 
> updating the total number of executors needed. In case of dynamic allocation 
> is turned on and the allocator tries to kill an executor, the scheduler 
> reduces the total number of executors needed ( see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635)
>  which is incorrect because the allocator already takes care of setting the 
> required number of executors itself. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23365) DynamicAllocation with failure in straggler task can lead to a hung spark job

2018-03-23 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411894#comment-16411894
 ] 

Imran Rashid commented on SPARK-23365:
--

This is mostly a duplicate of  SPARK-21834, though I'm not marking it as such 
since both had committed changes, and I think this change is still good as 
useful cleanup.

The change from SPARK-21834 is actually sufficient to solve the problem 
described above.

> DynamicAllocation with failure in straggler task can lead to a hung spark job
> -
>
> Key: SPARK-23365
> URL: https://issues.apache.org/jira/browse/SPARK-23365
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> Dynamic Allocation can lead to a spark app getting stuck with 0 executors 
> requested when the executors in the last tasks of a taskset fail (eg. with an 
> OOM).
> This happens when {{ExecutorAllocationManager}} s internal target number of 
> executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target 
> number.  {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many 
> tasks are active or pending in submitted stages, and computes how many 
> executors would be needed for them.  And as tasks finish, it will actively 
> decrease that count, informing the {{CGSB}} along the way.  (2) When it 
> decides executors are inactive for long enough, then it requests that 
> {{CGSB}} kill the executors -- this also tells the {{CGSB}} to update its 
> target number of executors: 
> https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622
> So when there is just one task left, you could have the following sequence of 
> events:
> (1) the {{EAM}} sets the desired number of executors to 1, and updates the 
> {{CGSB}} too
> (2) while that final task is still running, the other executors cross the 
> idle timeout, and the {{EAM}} requests the {{CGSB}} kill them
> (3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target 
> of 0 executors
> If the final task completed normally now, everything would be OK; the next 
> taskset would get submitted, the {{EAM}} would increase the target number of 
> executors and it would update the {{CGSB}}.
> But if the executor for that final task failed (eg. an OOM), then the {{EAM}} 
> thinks it [doesn't need to update 
> anything|https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L384-L386],
>  because its target is already 1, which is all it needs for that final task; 
> and the {{CGSB}} doesn't update anything either since its target is 0.
> I think you can determine if this is the cause of a stuck app by looking for
> {noformat}
> yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
> {noformat}
> in the logs of the ApplicationMaster (at least on yarn).
> You can reproduce this with this test app, run with {{--conf 
> "spark.dynamicAllocation.minExecutors=1" --conf 
> "spark.dynamicAllocation.maxExecutors=5" --conf 
> "spark.dynamicAllocation.executorIdleTimeout=5s"}}
> {code}
> import org.apache.spark.SparkEnv
> sc.setLogLevel("INFO")
> sc.parallelize(1 to 1, 1000).count()
> val execs = sc.parallelize(1 to 1000, 1000).map { _ => 
> SparkEnv.get.executorId}.collect().toSet
> val badExec = execs.head
> println("will kill exec " + badExec)
> sc.parallelize(1 to 5, 5).mapPartitions { itr =>
>   val exec = SparkEnv.get.executorId
>   if (exec == badExec) {
> Thread.sleep(2) // long enough that all the other tasks finish, and 
> the executors cross the idle timeout
> // now cause the executor to oom
> var buffers = Seq[Array[Byte]]()
> while(true) {
>   buffers :+= new Array[Byte](1e8.toInt)
> }
> itr
>   } else {
> itr
>   }
> }.collect()
> {code}
> *EDIT*: I adjusted the repro to cause an OOM on the bad executor, since 
> {{sc.killExecutor}} doesn't play nice with dynamic allocation in other ways.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23783) Add new generic export trait for ML pipelines

2018-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23783:


Assignee: Apache Spark

> Add new generic export trait for ML pipelines
> -
>
> Key: SPARK-23783
> URL: https://issues.apache.org/jira/browse/SPARK-23783
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Major
>
> Add a new generic export trait for ML pipelines so that we can support more 
> than just our internal format. API design based off of the 
> DataFrameReader/Writer design



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23365) DynamicAllocation with failure in straggler task can lead to a hung spark job

2018-03-23 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-23365:
-
Description: 
Dynamic Allocation can lead to a spark app getting stuck with 0 executors 
requested when the executors in the last tasks of a taskset fail (eg. with an 
OOM).

This happens when {{ExecutorAllocationManager}} s internal target number of 
executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target 
number.  {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many tasks 
are active or pending in submitted stages, and computes how many executors 
would be needed for them.  And as tasks finish, it will actively decrease that 
count, informing the {{CGSB}} along the way.  (2) When it decides executors are 
inactive for long enough, then it requests that {{CGSB}} kill the executors -- 
this also tells the {{CGSB}} to update its target number of executors: 
https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622

So when there is just one task left, you could have the following sequence of 
events:
(1) the {{EAM}} sets the desired number of executors to 1, and updates the 
{{CGSB}} too
(2) while that final task is still running, the other executors cross the idle 
timeout, and the {{EAM}} requests the {{CGSB}} kill them
(3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target 
of 0 executors

If the final task completed normally now, everything would be OK; the next 
taskset would get submitted, the {{EAM}} would increase the target number of 
executors and it would update the {{CGSB}}.

But if the executor for that final task failed (eg. an OOM), then the {{EAM}} 
thinks it [doesn't need to update 
anything|https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L384-L386],
 because its target is already 1, which is all it needs for that final task; 
and the {{CGSB}} doesn't update anything either since its target is 0.

I think you can determine if this is the cause of a stuck app by looking for
{noformat}
yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
{noformat}
in the logs of the ApplicationMaster (at least on yarn).

You can reproduce this with this test app, run with {{--conf 
"spark.dynamicAllocation.minExecutors=1" --conf 
"spark.dynamicAllocation.maxExecutors=5" --conf 
"spark.dynamicAllocation.executorIdleTimeout=5s"}}

{code}
import org.apache.spark.SparkEnv

sc.setLogLevel("INFO")

sc.parallelize(1 to 1, 1000).count()

val execs = sc.parallelize(1 to 1000, 1000).map { _ => 
SparkEnv.get.executorId}.collect().toSet
val badExec = execs.head
println("will kill exec " + badExec)

sc.parallelize(1 to 5, 5).mapPartitions { itr =>
  val exec = SparkEnv.get.executorId
  if (exec == badExec) {
Thread.sleep(2) // long enough that all the other tasks finish, and the 
executors cross the idle timeout
// now cause the executor to oom
var buffers = Seq[Array[Byte]]()
while(true) {
  buffers :+= new Array[Byte](1e8.toInt)
}


itr
  } else {
itr
  }
}.collect()
{code}

*EDIT*: I adjusted the repro to cause an OOM on the bad executor, since 
{{sc.killExecutor}} doesn't play nice with dynamic allocation in other ways.

  was:
Dynamic Allocation can lead to a spark app getting stuck with 0 executors 
requested when the executors in the last tasks of a taskset fail (eg. with an 
OOM).

This happens when {{ExecutorAllocationManager}} s internal target number of 
executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target 
number.  {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many tasks 
are active or pending in submitted stages, and computes how many executors 
would be needed for them.  And as tasks finish, it will actively decrease that 
count, informing the {{CGSB}} along the way.  (2) When it decides executors are 
inactive for long enough, then it requests that {{CGSB}} kill the executors -- 
this also tells the {{CGSB}} to update its target number of executors: 
https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622

So when there is just one task left, you could have the following sequence of 
events:
(1) the {{EAM}} sets the desired number of executors to 1, and updates the 
{{CGSB}} too
(2) while that final task is still running, the other executors cross the idle 
timeout, and the {{EAM}} requests the {{CGSB}} kill them
(3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target 
of 0 executors

If the final task completed normally now, everything would be OK; the next 
taskset would get submitted, the {{EAM}} would increase th

[jira] [Assigned] (SPARK-23783) Add new generic export trait for ML pipelines

2018-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23783:


Assignee: (was: Apache Spark)

> Add new generic export trait for ML pipelines
> -
>
> Key: SPARK-23783
> URL: https://issues.apache.org/jira/browse/SPARK-23783
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Major
>
> Add a new generic export trait for ML pipelines so that we can support more 
> than just our internal format. API design based off of the 
> DataFrameReader/Writer design



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23783) Add new generic export trait for ML pipelines

2018-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411891#comment-16411891
 ] 

Apache Spark commented on SPARK-23783:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19876

> Add new generic export trait for ML pipelines
> -
>
> Key: SPARK-23783
> URL: https://issues.apache.org/jira/browse/SPARK-23783
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Major
>
> Add a new generic export trait for ML pipelines so that we can support more 
> than just our internal format. API design based off of the 
> DataFrameReader/Writer design



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23783) Add new generic export trait for ML pipelines

2018-03-23 Thread holdenk (JIRA)
holdenk created SPARK-23783:
---

 Summary: Add new generic export trait for ML pipelines
 Key: SPARK-23783
 URL: https://issues.apache.org/jira/browse/SPARK-23783
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.4.0
Reporter: holdenk


Add a new generic export trait for ML pipelines so that we can support more 
than just our internal format. API design based off of the 
DataFrameReader/Writer design



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21685) Params isSet in scala Transformer triggered by _setDefault in pyspark

2018-03-23 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-21685:
---

Assignee: Bryan Cutler

> Params isSet in scala Transformer triggered by _setDefault in pyspark
> -
>
> Key: SPARK-21685
> URL: https://issues.apache.org/jira/browse/SPARK-21685
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Ratan Rai Sur
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> I'm trying to write a PySpark wrapper for a Transformer whose transform 
> method includes the line
> {code:java}
> require(!(isSet(outputNodeName) && isSet(outputNodeIndex)), "Can't set both 
> outputNodeName and outputNodeIndex")
> {code}
> This should only throw an exception when both of these parameters are 
> explicitly set.
> In the PySpark wrapper for the Transformer, there is this line in ___init___
> {code:java}
> self._setDefault(outputNodeIndex=0)
> {code}
> Here is the line in the main python script showing how it is being configured
> {code:java}
> cntkModel = 
> CNTKModel().setInputCol("images").setOutputCol("output").setModelLocation(spark,
>  model.uri).setOutputNodeName("z")
> {code}
> As you can see, only setOutputNodeName is being explicitly set but the 
> exception is still being thrown.
> If you need more context, 
> https://github.com/RatanRSur/mmlspark/tree/default-cntkmodel-output is the 
> branch with the code, the files I'm referring to here that are tracked are 
> the following:
> src/cntk-model/src/main/scala/CNTKModel.scala
> notebooks/tests/301 - CIFAR10 CNTK CNN Evaluation.ipynb
> The pyspark wrapper code is autogenerated



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21685) Params isSet in scala Transformer triggered by _setDefault in pyspark

2018-03-23 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-21685.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Params isSet in scala Transformer triggered by _setDefault in pyspark
> -
>
> Key: SPARK-21685
> URL: https://issues.apache.org/jira/browse/SPARK-21685
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Ratan Rai Sur
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> I'm trying to write a PySpark wrapper for a Transformer whose transform 
> method includes the line
> {code:java}
> require(!(isSet(outputNodeName) && isSet(outputNodeIndex)), "Can't set both 
> outputNodeName and outputNodeIndex")
> {code}
> This should only throw an exception when both of these parameters are 
> explicitly set.
> In the PySpark wrapper for the Transformer, there is this line in ___init___
> {code:java}
> self._setDefault(outputNodeIndex=0)
> {code}
> Here is the line in the main python script showing how it is being configured
> {code:java}
> cntkModel = 
> CNTKModel().setInputCol("images").setOutputCol("output").setModelLocation(spark,
>  model.uri).setOutputNodeName("z")
> {code}
> As you can see, only setOutputNodeName is being explicitly set but the 
> exception is still being thrown.
> If you need more context, 
> https://github.com/RatanRSur/mmlspark/tree/default-cntkmodel-output is the 
> branch with the code, the files I'm referring to here that are tracked are 
> the following:
> src/cntk-model/src/main/scala/CNTKModel.scala
> notebooks/tests/301 - CIFAR10 CNTK CNN Evaluation.ipynb
> The pyspark wrapper code is autogenerated



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23700) Cleanup unused imports

2018-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411860#comment-16411860
 ] 

Apache Spark commented on SPARK-23700:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/20892

> Cleanup unused imports
> --
>
> Key: SPARK-23700
> URL: https://issues.apache.org/jira/browse/SPARK-23700
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Major
>
> I've noticed a fair amount of unused imports in pyspark, I'll take a look 
> through and try to clean them up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23700) Cleanup unused imports

2018-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23700:


Assignee: Apache Spark

> Cleanup unused imports
> --
>
> Key: SPARK-23700
> URL: https://issues.apache.org/jira/browse/SPARK-23700
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Major
>
> I've noticed a fair amount of unused imports in pyspark, I'll take a look 
> through and try to clean them up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23700) Cleanup unused imports

2018-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23700:


Assignee: (was: Apache Spark)

> Cleanup unused imports
> --
>
> Key: SPARK-23700
> URL: https://issues.apache.org/jira/browse/SPARK-23700
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Major
>
> I've noticed a fair amount of unused imports in pyspark, I'll take a look 
> through and try to clean them up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23776) pyspark-sql tests should display build instructions when components are missing

2018-03-23 Thread Bruce Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411854#comment-16411854
 ] 

Bruce Robbins commented on SPARK-23776:
---

As it turns out, the building-spark page does have maven instructions to build 
with hive support before running pyspark tests. However, it does not include 
instructions if you are building with sbt, which needs more than just building 
with hive support (you need to also compile the test classes).

> pyspark-sql tests should display build instructions when components are 
> missing
> ---
>
> Key: SPARK-23776
> URL: https://issues.apache.org/jira/browse/SPARK-23776
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> This is a follow up to SPARK-23417.
> The pyspark-streaming tests print useful build instructions when certain 
> components are missing in the build.
> pyspark-sql's udf and readwrite tests also have specific build requirements: 
> the build must compile test scala files, and the build must also create the 
> Hive assembly. When those class or jar files are not created, the tests throw 
> only partially helpful exceptions, e.g.:
> {noformat}
> AnalysisException: u'Can not load class 
> test.org.apache.spark.sql.JavaStringLength, please make sure it is on the 
> classpath;'
> {noformat}
> or
> {noformat}
> IllegalArgumentException: u"Error while instantiating 
> 'org.apache.spark.sql.hive.HiveExternalCatalog':"
> {noformat}
> You end up in this situation when you follow Spark's build instructions and 
> then attempt to run the pyspark tests.
> It would be nice if pyspark-sql tests provide helpful build instructions in 
> these cases.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23782) SHS should not show applications to user without read permission

2018-03-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411851#comment-16411851
 ] 

Marcelo Vanzin commented on SPARK-23782:


More discussion at: https://github.com/apache/spark/pull/17582

> SHS should not show applications to user without read permission
> 
>
> Key: SPARK-23782
> URL: https://issues.apache.org/jira/browse/SPARK-23782
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> The History Server shows all the applications to all the users, even though 
> they have no permission to read them. They cannot read the details of the 
> applications they cannot access, but still anybody can list all the 
> applications submitted by all users.
> For instance, if we have an admin user {{admin}} and two normal users {{u1}} 
> and {{u2}}, and each of them submitted one application, all of them can see 
> in the main page of SHS:
> ||App ID||App Name|| ... ||Spark User|| ... ||
> |app-123456789|The Admin App| .. |admin| ... |
> |app-123456790|u1 secret app| .. |u1| ... |
> |app-123456791|u2 secret app| .. |u2| ... |
> Then clicking on each application, the proper permissions are applied and 
> each user can see only the applications he has the read permission for.
> Instead, each user should see only the applications he has the permission to 
> read and he/she should not be able to see applications he has not the 
> permissions for.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23782) SHS should not show applications to user without read permission

2018-03-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411843#comment-16411843
 ] 

Marcelo Vanzin edited comment on SPARK-23782 at 3/23/18 6:23 PM:
-

bq. This seems a security hole to me

What sensitive information is being exposed to users that should not see it?

Won't you get that same info if you go to the resource manager's page and look 
at what applications have run?


was (Author: vanzin):
bq. This seems a security hole to me

What sensitive information if being exposed to users that should not see it?

Won't you get that same info if you go to the resource manager's page and look 
at what applications have run?

> SHS should not show applications to user without read permission
> 
>
> Key: SPARK-23782
> URL: https://issues.apache.org/jira/browse/SPARK-23782
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> The History Server shows all the applications to all the users, even though 
> they have no permission to read them. They cannot read the details of the 
> applications they cannot access, but still anybody can list all the 
> applications submitted by all users.
> For instance, if we have an admin user {{admin}} and two normal users {{u1}} 
> and {{u2}}, and each of them submitted one application, all of them can see 
> in the main page of SHS:
> ||App ID||App Name|| ... ||Spark User|| ... ||
> |app-123456789|The Admin App| .. |admin| ... |
> |app-123456790|u1 secret app| .. |u1| ... |
> |app-123456791|u2 secret app| .. |u2| ... |
> Then clicking on each application, the proper permissions are applied and 
> each user can see only the applications he has the read permission for.
> Instead, each user should see only the applications he has the permission to 
> read and he/she should not be able to see applications he has not the 
> permissions for.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23782) SHS should not show applications to user without read permission

2018-03-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411843#comment-16411843
 ] 

Marcelo Vanzin commented on SPARK-23782:


bq. This seems a security hole to me

What sensitive information if being exposed to users that should not see it?

Won't you get that same info if you go to the resource manager's page and look 
at what applications have run?

> SHS should not show applications to user without read permission
> 
>
> Key: SPARK-23782
> URL: https://issues.apache.org/jira/browse/SPARK-23782
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> The History Server shows all the applications to all the users, even though 
> they have no permission to read them. They cannot read the details of the 
> applications they cannot access, but still anybody can list all the 
> applications submitted by all users.
> For instance, if we have an admin user {{admin}} and two normal users {{u1}} 
> and {{u2}}, and each of them submitted one application, all of them can see 
> in the main page of SHS:
> ||App ID||App Name|| ... ||Spark User|| ... ||
> |app-123456789|The Admin App| .. |admin| ... |
> |app-123456790|u1 secret app| .. |u1| ... |
> |app-123456791|u2 secret app| .. |u2| ... |
> Then clicking on each application, the proper permissions are applied and 
> each user can see only the applications he has the read permission for.
> Instead, each user should see only the applications he has the permission to 
> read and he/she should not be able to see applications he has not the 
> permissions for.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23782) SHS should not show applications to user without read permission

2018-03-23 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411836#comment-16411836
 ] 

Marco Gaido commented on SPARK-23782:
-

[~vanzin] sorry but I have not been able to find any JIRA related to this. 
Probably it is my fault and I just missed it.

Anyway, it sounds pretty weird to me that listing applications doesn't need to 
be filtered. This seems a security hole to me. I can't think of any reason/any 
tool which lets users to list things the user has no read permissions for.

> SHS should not show applications to user without read permission
> 
>
> Key: SPARK-23782
> URL: https://issues.apache.org/jira/browse/SPARK-23782
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> The History Server shows all the applications to all the users, even though 
> they have no permission to read them. They cannot read the details of the 
> applications they cannot access, but still anybody can list all the 
> applications submitted by all users.
> For instance, if we have an admin user {{admin}} and two normal users {{u1}} 
> and {{u2}}, and each of them submitted one application, all of them can see 
> in the main page of SHS:
> ||App ID||App Name|| ... ||Spark User|| ... ||
> |app-123456789|The Admin App| .. |admin| ... |
> |app-123456790|u1 secret app| .. |u1| ... |
> |app-123456791|u2 secret app| .. |u2| ... |
> Then clicking on each application, the proper permissions are applied and 
> each user can see only the applications he has the read permission for.
> Instead, each user should see only the applications he has the permission to 
> read and he/she should not be able to see applications he has not the 
> permissions for.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23782) SHS should not show applications to user without read permission

2018-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23782:


Assignee: Apache Spark

> SHS should not show applications to user without read permission
> 
>
> Key: SPARK-23782
> URL: https://issues.apache.org/jira/browse/SPARK-23782
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Major
>
> The History Server shows all the applications to all the users, even though 
> they have no permission to read them. They cannot read the details of the 
> applications they cannot access, but still anybody can list all the 
> applications submitted by all users.
> For instance, if we have an admin user {{admin}} and two normal users {{u1}} 
> and {{u2}}, and each of them submitted one application, all of them can see 
> in the main page of SHS:
> ||App ID||App Name|| ... ||Spark User|| ... ||
> |app-123456789|The Admin App| .. |admin| ... |
> |app-123456790|u1 secret app| .. |u1| ... |
> |app-123456791|u2 secret app| .. |u2| ... |
> Then clicking on each application, the proper permissions are applied and 
> each user can see only the applications he has the read permission for.
> Instead, each user should see only the applications he has the permission to 
> read and he/she should not be able to see applications he has not the 
> permissions for.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23782) SHS should not show applications to user without read permission

2018-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411824#comment-16411824
 ] 

Apache Spark commented on SPARK-23782:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/20891

> SHS should not show applications to user without read permission
> 
>
> Key: SPARK-23782
> URL: https://issues.apache.org/jira/browse/SPARK-23782
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> The History Server shows all the applications to all the users, even though 
> they have no permission to read them. They cannot read the details of the 
> applications they cannot access, but still anybody can list all the 
> applications submitted by all users.
> For instance, if we have an admin user {{admin}} and two normal users {{u1}} 
> and {{u2}}, and each of them submitted one application, all of them can see 
> in the main page of SHS:
> ||App ID||App Name|| ... ||Spark User|| ... ||
> |app-123456789|The Admin App| .. |admin| ... |
> |app-123456790|u1 secret app| .. |u1| ... |
> |app-123456791|u2 secret app| .. |u2| ... |
> Then clicking on each application, the proper permissions are applied and 
> each user can see only the applications he has the read permission for.
> Instead, each user should see only the applications he has the permission to 
> read and he/she should not be able to see applications he has not the 
> permissions for.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23782) SHS should not show applications to user without read permission

2018-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23782:


Assignee: (was: Apache Spark)

> SHS should not show applications to user without read permission
> 
>
> Key: SPARK-23782
> URL: https://issues.apache.org/jira/browse/SPARK-23782
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> The History Server shows all the applications to all the users, even though 
> they have no permission to read them. They cannot read the details of the 
> applications they cannot access, but still anybody can list all the 
> applications submitted by all users.
> For instance, if we have an admin user {{admin}} and two normal users {{u1}} 
> and {{u2}}, and each of them submitted one application, all of them can see 
> in the main page of SHS:
> ||App ID||App Name|| ... ||Spark User|| ... ||
> |app-123456789|The Admin App| .. |admin| ... |
> |app-123456790|u1 secret app| .. |u1| ... |
> |app-123456791|u2 secret app| .. |u2| ... |
> Then clicking on each application, the proper permissions are applied and 
> each user can see only the applications he has the read permission for.
> Instead, each user should see only the applications he has the permission to 
> read and he/she should not be able to see applications he has not the 
> permissions for.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23782) SHS should not show applications to user without read permission

2018-03-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411811#comment-16411811
 ] 

Marcelo Vanzin commented on SPARK-23782:


I'm pretty sure this was discussed before and the decision was that listing 
information does not need to be filtered per user, but I'm too lazy to do a 
jira search right now.

> SHS should not show applications to user without read permission
> 
>
> Key: SPARK-23782
> URL: https://issues.apache.org/jira/browse/SPARK-23782
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> The History Server shows all the applications to all the users, even though 
> they have no permission to read them. They cannot read the details of the 
> applications they cannot access, but still anybody can list all the 
> applications submitted by all users.
> For instance, if we have an admin user {{admin}} and two normal users {{u1}} 
> and {{u2}}, and each of them submitted one application, all of them can see 
> in the main page of SHS:
> ||App ID||App Name|| ... ||Spark User|| ... ||
> |app-123456789|The Admin App| .. |admin| ... |
> |app-123456790|u1 secret app| .. |u1| ... |
> |app-123456791|u2 secret app| .. |u2| ... |
> Then clicking on each application, the proper permissions are applied and 
> each user can see only the applications he has the read permission for.
> Instead, each user should see only the applications he has the permission to 
> read and he/she should not be able to see applications he has not the 
> permissions for.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23782) SHS should not show applications to user without read permission

2018-03-23 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-23782:
---

 Summary: SHS should not show applications to user without read 
permission
 Key: SPARK-23782
 URL: https://issues.apache.org/jira/browse/SPARK-23782
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.4.0
Reporter: Marco Gaido


The History Server shows all the applications to all the users, even though 
they have no permission to read them. They cannot read the details of the 
applications they cannot access, but still anybody can list all the 
applications submitted by all users.

For instance, if we have an admin user {{admin}} and two normal users {{u1}} 
and {{u2}}, and each of them submitted one application, all of them can see in 
the main page of SHS:

||App ID||App Name|| ... ||Spark User|| ... ||
|app-123456789|The Admin App| .. |admin| ... |
|app-123456790|u1 secret app| .. |u1| ... |
|app-123456791|u2 secret app| .. |u2| ... |

Then clicking on each application, the proper permissions are applied and each 
user can see only the applications he has the read permission for.

Instead, each user should see only the applications he has the permission to 
read and he/she should not be able to see applications he has not the 
permissions for.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22342) refactor schedulerDriver registration

2018-03-23 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411780#comment-16411780
 ] 

Susan X. Huynh commented on SPARK-22342:


The multiple re-registration issue can lead to blacklisting and starvation when 
there are multiple executors per host. For example, suppose I have a host with 
8 cpu, and I specify spark.executor.cores=4. Then 2 Executors could potentially 
get allocated on that host. If they both receive a TASK_LOST, that host will 
get blacklisted (since MAX_SLAVE_FAILURES=2). If this happens on every host, 
the app will be starved. I have hit this bug a lot when running on large 
machines (16-64 cpus) and specifying a small executor size, 
spark.executor.cores=4.

> refactor schedulerDriver registration
> -
>
> Key: SPARK-22342
> URL: https://issues.apache.org/jira/browse/SPARK-22342
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This is an umbrella issue for working on:
> https://github.com/apache/spark/pull/13143
> and handle the multiple re-registration issue which invalidates an offer.
> To test:
>  dcos spark run --verbose --name=spark-nohive  --submit-args="--driver-cores 
> 1 --conf spark.cores.max=1 --driver-memory 512M --class 
> org.apache.spark.examples.SparkPi http://.../spark-examples_2.11-2.2.0.jar";
> master log:
> I1020 13:49:05.00  3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3085 hierarchical.cpp:303] Added framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3085 hierarchical.cpp:412] Deactivated framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3090 hierarchical.cpp:380] Activated framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [  ]
> I1020 13:49:05.00  3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00  3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00  3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [ ]
> I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00 3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00 3087 master.cpp:7662] Sending 6 offers to framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [ ]
> I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00 3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10039
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10038
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10037
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c

[jira] [Assigned] (SPARK-23759) Unable to bind Spark UI to specific host name / IP

2018-03-23 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23759:
--

Assignee: Felix

> Unable to bind Spark UI to specific host name / IP
> --
>
> Key: SPARK-23759
> URL: https://issues.apache.org/jira/browse/SPARK-23759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.2.0
>Reporter: Felix
>Assignee: Felix
>Priority: Major
> Fix For: 2.2.2, 2.3.1, 2.4.0
>
>
> Ideally, exporting SPARK_LOCAL_IP= in spark2 
> environment should allow Spark2 History server to bind to private interface 
> however this is not working in spark 2.2.0
>  
> Spark2 history server still listens on 0.0.0.0
> {code:java}
> [root@sparknode1 ~]# netstat -tulapn|grep 18081
> tcp0  0 0.0.0.0:18081   0.0.0.0:*   
> LISTEN  21313/java
> tcp0  0 172.26.104.151:39126172.26.104.151:18081
> TIME_WAIT   -
> {code}
> On earlier versions this change was working fine:
> {code:java}
> [root@dwphive1 ~]# netstat -tulapn|grep 18081
> tcp0  0 172.26.113.55:18081 0.0.0.0:*   
> LISTEN  2565/java
> {code}
>  
> This issue not only affects SHS but also Spark UI in general
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23759) Unable to bind Spark UI to specific host name / IP

2018-03-23 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23759.

   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0
   2.2.2

Issue resolved by pull request 20883
[https://github.com/apache/spark/pull/20883]

> Unable to bind Spark UI to specific host name / IP
> --
>
> Key: SPARK-23759
> URL: https://issues.apache.org/jira/browse/SPARK-23759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.2.0
>Reporter: Felix
>Priority: Major
> Fix For: 2.2.2, 2.4.0, 2.3.1
>
>
> Ideally, exporting SPARK_LOCAL_IP= in spark2 
> environment should allow Spark2 History server to bind to private interface 
> however this is not working in spark 2.2.0
>  
> Spark2 history server still listens on 0.0.0.0
> {code:java}
> [root@sparknode1 ~]# netstat -tulapn|grep 18081
> tcp0  0 0.0.0.0:18081   0.0.0.0:*   
> LISTEN  21313/java
> tcp0  0 172.26.104.151:39126172.26.104.151:18081
> TIME_WAIT   -
> {code}
> On earlier versions this change was working fine:
> {code:java}
> [root@dwphive1 ~]# netstat -tulapn|grep 18081
> tcp0  0 172.26.113.55:18081 0.0.0.0:*   
> LISTEN  2565/java
> {code}
>  
> This issue not only affects SHS but also Spark UI in general
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23781) Merge YARN and Mesos token renewal code

2018-03-23 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-23781:
--

 Summary: Merge YARN and Mesos token renewal code
 Key: SPARK-23781
 URL: https://issues.apache.org/jira/browse/SPARK-23781
 Project: Spark
  Issue Type: Improvement
  Components: yan, Mesos
Affects Versions: 2.4.0
Reporter: Marcelo Vanzin


With the fix for SPARK-23361, the code that handles delegation tokens in Mesos 
and YARN ends up being very similar.

We shouyld refactor that code so that both backends are sharing the same code, 
which also would make it easier for other cluster managers to use that code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23365) DynamicAllocation with failure in straggler task can lead to a hung spark job

2018-03-23 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-23365:
-
Description: 
Dynamic Allocation can lead to a spark app getting stuck with 0 executors 
requested when the executors in the last tasks of a taskset fail (eg. with an 
OOM).

This happens when {{ExecutorAllocationManager}} s internal target number of 
executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target 
number.  {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many tasks 
are active or pending in submitted stages, and computes how many executors 
would be needed for them.  And as tasks finish, it will actively decrease that 
count, informing the {{CGSB}} along the way.  (2) When it decides executors are 
inactive for long enough, then it requests that {{CGSB}} kill the executors -- 
this also tells the {{CGSB}} to update its target number of executors: 
https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622

So when there is just one task left, you could have the following sequence of 
events:
(1) the {{EAM}} sets the desired number of executors to 1, and updates the 
{{CGSB}} too
(2) while that final task is still running, the other executors cross the idle 
timeout, and the {{EAM}} requests the {{CGSB}} kill them
(3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target 
of 0 executors

If the final task completed normally now, everything would be OK; the next 
taskset would get submitted, the {{EAM}} would increase the target number of 
executors and it would update the {{CGSB}}.

But if the executor for that final task failed (eg. an OOM), then the {{EAM}} 
thinks it [doesn't need to update 
anything|https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L384-L386],
 because its target is already 1, which is all it needs for that final task; 
and the {{CGSB}} doesn't update anything either since its target is 0.

I think you can determine if this is the cause of a stuck app by looking for
{noformat}
yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
{noformat}
in the logs of the ApplicationMaster (at least on yarn).

You can reproduce this with this test app, run with {{--conf 
"spark.dynamicAllocation.minExecutors=1" --conf 
"spark.dynamicAllocation.maxExecutors=5" --conf 
"spark.dynamicAllocation.executorIdleTimeout=5s"}}

{code}
import org.apache.spark.SparkEnv

sc.setLogLevel("INFO")

sc.parallelize(1 to 1, 1000).count()

val execs = sc.parallelize(1 to 1000, 1000).map { _ => 
SparkEnv.get.executorId}.collect().toSet
val badExec = execs.head
println("will kill exec " + badExec)

new Thread() {
  override def run(): Unit = {
Thread.sleep(1)
println("about to kill exec " + badExec)
sc.killExecutor(badExec)
  }
}.start()

sc.parallelize(1 to 5, 5).mapPartitions { itr =>
  val exec = SparkEnv.get.executorId
  if (exec == badExec) {
Thread.sleep(2) // long enough that all the other tasks finish, and the 
executors cross the idle timeout
// meanwhile, something else should kill this executor
itr
  } else {
itr
  }
}.collect()
{code}

  was:
Dynamic Allocation can lead to a spark app getting stuck with 0 executors 
requested when the executors in the last tasks of a taskset fail (eg. with an 
OOM).

This happens when {{ExecutorAllocationManager}} s internal target number of 
executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target 
number.  {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many tasks 
are active or pending in submitted stages, and computes how many executors 
would be needed for them.  And as tasks finish, it will actively decrease that 
count, informing the {{CGSB}} along the way.  (2) When it decides executors are 
inactive for long enough, then it requests that {{CGSB}} kill the executors -- 
this also tells the {{CGSB}} to update its target number of executors: 
https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622

So when there is just one task left, you could have the following sequence of 
events:
(1) the {{EAM}} sets the desired number of executors to 1, and updates the 
{{CGSB}} too
(2) while that final task is still running, the other executors cross the idle 
timeout, and the {{EAM}} requests the {{CGSB}} kill them
(3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target 
of 0 executors

If the final task completed normally now, everything would be OK; the next 
taskset would get submitted, the {{EAM}} would increase the target number of 
executors and it would update the {{CGSB}}.

But if the ex

[jira] [Created] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-03-23 Thread Ivan Dzikovsky (JIRA)
Ivan Dzikovsky created SPARK-23780:
--

 Summary: Failed to use googleVis library with new SparkR
 Key: SPARK-23780
 URL: https://issues.apache.org/jira/browse/SPARK-23780
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.1
Reporter: Ivan Dzikovsky


I've tried to use googleVis library with Spark 2.2.1, and faced with problem.

Steps to reproduce:
# Install R with googleVis library.
# Run SparkR:
{code}
sparkR --master yarn --deploy-mode client
{code}
# Run code that uses googleVis:
{code}
library(googleVis)
df=data.frame(country=c("US", "GB", "BR"), 
  val1=c(10,13,14), 
  val2=c(23,12,32))
Bar <- gvisBarChart(df)
cat("%html ", Bar$html$chart)
{code}

Than I got following error message:
{code}
Error : .onLoad failed in loadNamespace() for 'googleVis', details:
  call: rematchDefinition(definition, fdef, mnames, fnames, signature)
  error: methods can add arguments to the generic 'toJSON' only if '...' is an 
argument to the generic
Error : package or namespace load failed for 'googleVis'
{code}

But expected result is to get some HTML code output, as it was with Spark 2.1.0.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23739) Spark structured streaming long running problem

2018-03-23 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411503#comment-16411503
 ] 

Cody Koeninger commented on SPARK-23739:


I meant the version of the org.apache.kafka kafka-clients artifact,
not the org.apache.spark artifact.

You can also unzip the assembly jar, it should have a license file for
kafka-clients that has the version number appended to it, and that way
you can also verify that the class file for that class is in the
assembly.



> Spark structured streaming long running problem
> ---
>
> Key: SPARK-23739
> URL: https://issues.apache.org/jira/browse/SPARK-23739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Florencio
>Priority: Critical
>  Labels: spark, streaming, structured
>
> I had a problem with long running spark structured streaming in spark 2.1. 
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.kafka.common.requests.LeaveGroupResponse.
> The detailed error is the following:
> 18/03/16 16:10:57 INFO StreamExecution: Committed offsets for batch 2110. 
> Metadata OffsetSeqMetadata(0,1521216656590)
> 18/03/16 16:10:57 INFO KafkaSource: GetBatch called with start = 
> Some(\{"TopicName":{"2":5520197,"1":5521045,"3":5522054,"0":5527915}}), end = 
> \{"TopicName":{"2":5522730,"1":5523577,"3":5524586,"0":5530441}}
> 18/03/16 16:10:57 INFO KafkaSource: Partitions added: Map()
> 18/03/16 16:10:57 ERROR StreamExecution: Query [id = 
> a233b9ff-cc39-44d3-b953-a255986c04bf, runId = 
> 8520e3c0-2455-4ac1-9021-8518fb58b3f8] terminated with error
> java.util.zip.ZipException: invalid code lengths set
>  at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>  at java.io.FilterInputStream.read(FilterInputStream.java:133)
>  at java.io.FilterInputStream.read(FilterInputStream.java:107)
>  at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354)
>  at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>  at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
>  at org.apache.spark.util.Utils$.copyStream(Utils.scala:362)
>  at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:45)
>  at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:83)
>  at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:173)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>  at org.apache.spark.SparkContext.clean(SparkContext.scala:2101)
>  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
>  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>  at org.apache.spark.rdd.RDD.map(RDD.scala:369)
>  at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:287)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:503)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:499)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>  at 
> org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
>  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> 18/03/16 16:10:57 ERROR ClientUtils: Failed to close coordinator
> java.lang.NoClassDefFoundError: 
> org/apache/kafka/common/requests/LeaveGroupResponse
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.sendLeaveGroupRequest(AbstractCoordinator.java:575)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.maybeLeaveGroup(AbstractCoordinator.java:566)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.close(AbstractCoordinator.java:555)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.close(ConsumerCoordinator.java:377)
>  at org.apache.kaf

[jira] [Commented] (SPARK-23655) Add support for type aclitem (PostgresDialect)

2018-03-23 Thread Diego da Silva Colombo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411485#comment-16411485
 ] 

Diego da Silva Colombo commented on SPARK-23655:


[~maropu] I feel it too, it's look like a awkward type for spark, but 
unfortunately the table I'm trying to access is managed by postgres, so I 
cannot change it type. I've made a workaround by creating a custom JdbcDialect, 
and returning a StringType on toCatalystType. For my case it worked, but if you 
try to access the aclItem column an exception occurs.
If you don't think this is a useful thing for spark, feel free to close the 
issue.

Thanks for the attention

> Add support for type aclitem (PostgresDialect)
> --
>
> Key: SPARK-23655
> URL: https://issues.apache.org/jira/browse/SPARK-23655
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Diego da Silva Colombo
>Priority: Major
>
> When I try to load the data of pg_database, an exception occurs:
> `java.lang.RuntimeException: java.sql.SQLException: Unsupported type 2003`
> It's happens because the typeName of the column is *aclitem,* and there is no 
> match case for thist type on toCatalystType



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22239) User-defined window functions with pandas udf

2018-03-23 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411465#comment-16411465
 ] 

Li Jin commented on SPARK-22239:


Yeah unbounded windows are really just "groupby" in this case. I need to think 
more about bounded windows. I will send out some doc/ideas before implementing 
bounded windows. For now I will just do unbounded windows.

> User-defined window functions with pandas udf
> -
>
> Key: SPARK-22239
> URL: https://issues.apache.org/jira/browse/SPARK-22239
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: 
>Reporter: Li Jin
>Priority: Major
>
> Window function is another place we can benefit from vectored udf and add 
> another useful function to the pandas_udf suite.
> Example usage (preliminary):
> {code:java}
> w = Window.partitionBy('id').orderBy('time').rangeBetween(-200, 0)
> @pandas_udf(DoubleType())
> def ema(v1):
> return v1.ewm(alpha=0.5).mean().iloc[-1]
> df.withColumn('v1_ema', ema(df.v1).over(window))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23779) TaskMemoryManager and UnsafeSorter use MemoryBlock

2018-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411425#comment-16411425
 ] 

Apache Spark commented on SPARK-23779:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20890

> TaskMemoryManager and UnsafeSorter use MemoryBlock
> --
>
> Key: SPARK-23779
> URL: https://issues.apache.org/jira/browse/SPARK-23779
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> This JIRA entry tries to use {{MemoryBlock}} in {TaskMemoryManager} and 
> classes related to {{UnsafeSorter}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23779) TaskMemoryManager and UnsafeSorter use MemoryBlock

2018-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23779:


Assignee: Apache Spark

> TaskMemoryManager and UnsafeSorter use MemoryBlock
> --
>
> Key: SPARK-23779
> URL: https://issues.apache.org/jira/browse/SPARK-23779
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>Priority: Major
>
> This JIRA entry tries to use {{MemoryBlock}} in {TaskMemoryManager} and 
> classes related to {{UnsafeSorter}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23779) TaskMemoryManager and UnsafeSorter use MemoryBlock

2018-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23779:


Assignee: (was: Apache Spark)

> TaskMemoryManager and UnsafeSorter use MemoryBlock
> --
>
> Key: SPARK-23779
> URL: https://issues.apache.org/jira/browse/SPARK-23779
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> This JIRA entry tries to use {{MemoryBlock}} in {TaskMemoryManager} and 
> classes related to {{UnsafeSorter}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23779) TaskMemoryManager and UnsafeSorter use MemoryBlock

2018-03-23 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-23779:


 Summary: TaskMemoryManager and UnsafeSorter use MemoryBlock
 Key: SPARK-23779
 URL: https://issues.apache.org/jira/browse/SPARK-23779
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki


This JIRA entry tries to use {{MemoryBlock}} in {TaskMemoryManager} and classes 
related to {{UnsafeSorter}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23739) Spark structured streaming long running problem

2018-03-23 Thread Florencio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411406#comment-16411406
 ] 

Florencio commented on SPARK-23739:
---

Thanks for the information. The kafka version is "org.apache.spark" % 
"spark-sql-kafka-0-10_2.11" % "2.1.0"

> Spark structured streaming long running problem
> ---
>
> Key: SPARK-23739
> URL: https://issues.apache.org/jira/browse/SPARK-23739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Florencio
>Priority: Critical
>  Labels: spark, streaming, structured
>
> I had a problem with long running spark structured streaming in spark 2.1. 
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.kafka.common.requests.LeaveGroupResponse.
> The detailed error is the following:
> 18/03/16 16:10:57 INFO StreamExecution: Committed offsets for batch 2110. 
> Metadata OffsetSeqMetadata(0,1521216656590)
> 18/03/16 16:10:57 INFO KafkaSource: GetBatch called with start = 
> Some(\{"TopicName":{"2":5520197,"1":5521045,"3":5522054,"0":5527915}}), end = 
> \{"TopicName":{"2":5522730,"1":5523577,"3":5524586,"0":5530441}}
> 18/03/16 16:10:57 INFO KafkaSource: Partitions added: Map()
> 18/03/16 16:10:57 ERROR StreamExecution: Query [id = 
> a233b9ff-cc39-44d3-b953-a255986c04bf, runId = 
> 8520e3c0-2455-4ac1-9021-8518fb58b3f8] terminated with error
> java.util.zip.ZipException: invalid code lengths set
>  at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>  at java.io.FilterInputStream.read(FilterInputStream.java:133)
>  at java.io.FilterInputStream.read(FilterInputStream.java:107)
>  at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354)
>  at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>  at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
>  at org.apache.spark.util.Utils$.copyStream(Utils.scala:362)
>  at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:45)
>  at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:83)
>  at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:173)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>  at org.apache.spark.SparkContext.clean(SparkContext.scala:2101)
>  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
>  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>  at org.apache.spark.rdd.RDD.map(RDD.scala:369)
>  at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:287)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:503)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:499)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>  at 
> org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
>  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> 18/03/16 16:10:57 ERROR ClientUtils: Failed to close coordinator
> java.lang.NoClassDefFoundError: 
> org/apache/kafka/common/requests/LeaveGroupResponse
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.sendLeaveGroupRequest(AbstractCoordinator.java:575)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.maybeLeaveGroup(AbstractCoordinator.java:566)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.close(AbstractCoordinator.java:555)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.close(ConsumerCoordinator.java:377)
>  at org.apache.kafka.clients.ClientUtils.closeQuietly(ClientUtils.java:66)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1383)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.ja

[jira] [Comment Edited] (SPARK-23650) Slow SparkR udf (dapply)

2018-03-23 Thread Deepansh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411246#comment-16411246
 ] 

Deepansh edited comment on SPARK-23650 at 3/23/18 1:43 PM:
---

R environment inside the thread for applying UDF is not getting reused(i think 
cached is not the right word for this context). It is created and destroyed 
with each query.

{code:R}
kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
"10.117.172.48:9092", topic = "source")
lines<- select(kafka, cast(kafka$value, "string"))
schema<-schema(lines)
library(caret)

df4<-dapply(lines,function(x){
  print(system.time(library(caret)))
  x
},schema)

q2 <- write.stream(df4,"kafka", checkpointLocation = loc, topic = "sink", 
kafka.bootstrap.servers = "10.117.172.48:9092")
awaitTermination(q2)
{code}

For the above code, for every new stream my output is,
18/03/23 11:08:10 INFO BufferedStreamThread: Loading required package: lattice
18/03/23 11:08:10 INFO BufferedStreamThread: 
18/03/23 11:08:10 INFO BufferedStreamThread: Attaching package: ‘lattice’
18/03/23 11:08:10 INFO BufferedStreamThread: 
18/03/23 11:08:10 INFO BufferedStreamThread: The following object is masked 
from ‘package:SparkR’:
18/03/23 11:08:10 INFO BufferedStreamThread: 
18/03/23 11:08:10 INFO BufferedStreamThread: histogram
18/03/23 11:08:10 INFO BufferedStreamThread: 
18/03/23 11:08:10 INFO BufferedStreamThread: Loading required package: ggplot2
18/03/23 11:08:12 INFO BufferedStreamThread:user  system elapsed 
18/03/23 11:08:12 INFO BufferedStreamThread:   1.937   0.062   1.999 
18/03/23 11:08:12 INFO RRunner: Times: boot = 0.009 s, init = 0.017 s, 
broadcast = 0.001 s, read-input = 0.001 s, compute = 2.064 s, write-output = 
0.001 s, total = 2.093 s

PFA: rest log file.

For every new coming stream, the packages are loaded again inside the thread, 
which means R environment inside the thread is not getting reused, it is 
created and destroyed every time.

The model(iris model), on which I am testing requires caret package. So, when I 
use the readRDS method, caret package is also loaded, which adds an overhead of 
(~2s) every time. 

The same problem is with the broadcast. Broadcasting the model doesn't take 
time, but when it deserializes the model it loads caret package which adds 2s 
overhead.

Ideally, the packages shouldn't load again. Is there a way around to this 
problem?


was (Author: litup):
R environment inside the thread for applying UDF is not getting cached. It is 
created and destroyed with each query.

{code:R}
kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
"10.117.172.48:9092", topic = "source")
lines<- select(kafka, cast(kafka$value, "string"))
schema<-schema(lines)
library(caret)

df4<-dapply(lines,function(x){
  print(system.time(library(caret)))
  x
},schema)

q2 <- write.stream(df4,"kafka", checkpointLocation = loc, topic = "sink", 
kafka.bootstrap.servers = "10.117.172.48:9092")
awaitTermination(q2)
{code}

For the above code, for every new stream my output is,
18/03/23 11:08:10 INFO BufferedStreamThread: Loading required package: lattice
18/03/23 11:08:10 INFO BufferedStreamThread: 
18/03/23 11:08:10 INFO BufferedStreamThread: Attaching package: ‘lattice’
18/03/23 11:08:10 INFO BufferedStreamThread: 
18/03/23 11:08:10 INFO BufferedStreamThread: The following object is masked 
from ‘package:SparkR’:
18/03/23 11:08:10 INFO BufferedStreamThread: 
18/03/23 11:08:10 INFO BufferedStreamThread: histogram
18/03/23 11:08:10 INFO BufferedStreamThread: 
18/03/23 11:08:10 INFO BufferedStreamThread: Loading required package: ggplot2
18/03/23 11:08:12 INFO BufferedStreamThread:user  system elapsed 
18/03/23 11:08:12 INFO BufferedStreamThread:   1.937   0.062   1.999 
18/03/23 11:08:12 INFO RRunner: Times: boot = 0.009 s, init = 0.017 s, 
broadcast = 0.001 s, read-input = 0.001 s, compute = 2.064 s, write-output = 
0.001 s, total = 2.093 s

PFA: rest log file.

For every new coming stream, the packages are loaded again inside the thread, 
which means R environment inside the thread is not getting reused, it is 
created and destroyed every time.

The model(iris model), on which I am testing requires caret package. So, when I 
use the readRDS method, caret package is also loaded, which adds an overhead of 
(~2s) every time. 

The same problem is with the broadcast. Broadcasting the model doesn't take 
time, but when it deserializes the model it loads caret package which adds 2s 
overhead.

Ideally, the packages shouldn't load again. Is there a way around to this 
problem?

> Slow SparkR udf (dapply)
> 
>
> Key: SPARK-23650
> URL: https://issues.apache.org/jira/browse/SPARK-23650
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SparkR, Structured Streaming
>Affects V

[jira] [Commented] (SPARK-23739) Spark structured streaming long running problem

2018-03-23 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411396#comment-16411396
 ] 

Cody Koeninger commented on SPARK-23739:


What version of the org.apache.kafka artifact is in the assembly?

> Spark structured streaming long running problem
> ---
>
> Key: SPARK-23739
> URL: https://issues.apache.org/jira/browse/SPARK-23739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Florencio
>Priority: Critical
>  Labels: spark, streaming, structured
>
> I had a problem with long running spark structured streaming in spark 2.1. 
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.kafka.common.requests.LeaveGroupResponse.
> The detailed error is the following:
> 18/03/16 16:10:57 INFO StreamExecution: Committed offsets for batch 2110. 
> Metadata OffsetSeqMetadata(0,1521216656590)
> 18/03/16 16:10:57 INFO KafkaSource: GetBatch called with start = 
> Some(\{"TopicName":{"2":5520197,"1":5521045,"3":5522054,"0":5527915}}), end = 
> \{"TopicName":{"2":5522730,"1":5523577,"3":5524586,"0":5530441}}
> 18/03/16 16:10:57 INFO KafkaSource: Partitions added: Map()
> 18/03/16 16:10:57 ERROR StreamExecution: Query [id = 
> a233b9ff-cc39-44d3-b953-a255986c04bf, runId = 
> 8520e3c0-2455-4ac1-9021-8518fb58b3f8] terminated with error
> java.util.zip.ZipException: invalid code lengths set
>  at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>  at java.io.FilterInputStream.read(FilterInputStream.java:133)
>  at java.io.FilterInputStream.read(FilterInputStream.java:107)
>  at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354)
>  at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>  at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
>  at org.apache.spark.util.Utils$.copyStream(Utils.scala:362)
>  at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:45)
>  at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:83)
>  at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:173)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>  at org.apache.spark.SparkContext.clean(SparkContext.scala:2101)
>  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
>  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>  at org.apache.spark.rdd.RDD.map(RDD.scala:369)
>  at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:287)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:503)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:499)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>  at 
> org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
>  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> 18/03/16 16:10:57 ERROR ClientUtils: Failed to close coordinator
> java.lang.NoClassDefFoundError: 
> org/apache/kafka/common/requests/LeaveGroupResponse
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.sendLeaveGroupRequest(AbstractCoordinator.java:575)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.maybeLeaveGroup(AbstractCoordinator.java:566)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.close(AbstractCoordinator.java:555)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.close(ConsumerCoordinator.java:377)
>  at org.apache.kafka.clients.ClientUtils.closeQuietly(ClientUtils.java:66)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1383)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1364)
>  at org.apache.spark.s

[jira] [Commented] (SPARK-23739) Spark structured streaming long running problem

2018-03-23 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411377#comment-16411377
 ] 

Marco Gaido commented on SPARK-23739:
-

[~zsxwing] [~joseph.torres] [~c...@koeninger.org] I am not very familiar with 
Structured Streaming, but this seems a bug to me. Do you have any idea? Thanks.

> Spark structured streaming long running problem
> ---
>
> Key: SPARK-23739
> URL: https://issues.apache.org/jira/browse/SPARK-23739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Florencio
>Priority: Critical
>  Labels: spark, streaming, structured
>
> I had a problem with long running spark structured streaming in spark 2.1. 
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.kafka.common.requests.LeaveGroupResponse.
> The detailed error is the following:
> 18/03/16 16:10:57 INFO StreamExecution: Committed offsets for batch 2110. 
> Metadata OffsetSeqMetadata(0,1521216656590)
> 18/03/16 16:10:57 INFO KafkaSource: GetBatch called with start = 
> Some(\{"TopicName":{"2":5520197,"1":5521045,"3":5522054,"0":5527915}}), end = 
> \{"TopicName":{"2":5522730,"1":5523577,"3":5524586,"0":5530441}}
> 18/03/16 16:10:57 INFO KafkaSource: Partitions added: Map()
> 18/03/16 16:10:57 ERROR StreamExecution: Query [id = 
> a233b9ff-cc39-44d3-b953-a255986c04bf, runId = 
> 8520e3c0-2455-4ac1-9021-8518fb58b3f8] terminated with error
> java.util.zip.ZipException: invalid code lengths set
>  at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>  at java.io.FilterInputStream.read(FilterInputStream.java:133)
>  at java.io.FilterInputStream.read(FilterInputStream.java:107)
>  at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354)
>  at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>  at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
>  at org.apache.spark.util.Utils$.copyStream(Utils.scala:362)
>  at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:45)
>  at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:83)
>  at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:173)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>  at org.apache.spark.SparkContext.clean(SparkContext.scala:2101)
>  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
>  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>  at org.apache.spark.rdd.RDD.map(RDD.scala:369)
>  at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:287)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:503)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:499)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>  at 
> org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
>  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> 18/03/16 16:10:57 ERROR ClientUtils: Failed to close coordinator
> java.lang.NoClassDefFoundError: 
> org/apache/kafka/common/requests/LeaveGroupResponse
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.sendLeaveGroupRequest(AbstractCoordinator.java:575)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.maybeLeaveGroup(AbstractCoordinator.java:566)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.close(AbstractCoordinator.java:555)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.close(ConsumerCoordinator.java:377)
>  at org.apache.kafka.clients.ClientUtils.closeQuietly(ClientUtils.java:66)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1383)
>  at 
> org.apache.kafka

[jira] [Updated] (SPARK-23778) SparkContext.emptyRDD confuses SparkContext.union

2018-03-23 Thread Stefano Pettini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefano Pettini updated SPARK-23778:

Attachment: as_it_should_be.png

> SparkContext.emptyRDD confuses SparkContext.union
> -
>
> Key: SPARK-23778
> URL: https://issues.apache.org/jira/browse/SPARK-23778
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 2.3.0
>Reporter: Stefano Pettini
>Priority: Minor
> Attachments: as_it_should_be.png, 
> partitioner_lost_and_unneeded_extra_stage.png
>
>
> SparkContext.emptyRDD is an unpartitioned RDD. Clearly it's empty so whether 
> it's partitioned or not should be just a academic debate. Unfortunately it 
> doesn't seem to be like this and the issue has side effects.
> Namely, it confuses the RDD union.
> When there are N classic RDDs partitioned the same way, the union is 
> implemented with the optimized PartitionerAwareUnionRDD, that retains the 
> common partitioner in the result. If one of the N RDDs happens to be an 
> emptyRDD, as it doesn't have a partitioner, the union is implemented by just 
> appending all the partitions of the N RDDs, dropping the partitioner. But 
> there's no need for this, as the emptyRDD contains no elements. This results 
> in further unneeded shuffles once the result of the union is used.
> See for example:
> {{val p = new HashPartitioner(3)}}
> {{val a = context.parallelize(List(10, 20, 30, 40, 50)).keyBy(_ / 
> 10).partitionBy(p)}}
> {{val b1 = a.mapValues(_ + 1)}}
> {{val b2 = a.mapValues(_ - 1)}}
> {{val e = context.emptyRDD[(Int, Int)]}}
> {{val x = context.union(a, b1, b2, e)}}
> {{val y = x.reduceByKey(_ + _)}}
> {{assert(x.partitioner.contains(p))}}
> {{y.collect()}}
> The assert fails. Disabling it, it's possible to see that reduceByKey 
> introduced a shuffles, although all the input RDDs are already partitioned 
> the same way, but the emptyRDD.
> Forcing a partitioner on the emptyRDD:
> {{val e = context.emptyRDD[(Int, Int)].partitionBy(p)}}
> solves the problem with the assert and doesn't introduce the unneeded extra 
> stage and shuffle.
> Union implementation should be changed to ignore the partitioner of emptyRDDs 
> and consider those as _partitioned in a way compatible with any partitioner_, 
> basically ignoring them.
> Present since 1.3 at least.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23778) SparkContext.emptyRDD confuses SparkContext.union

2018-03-23 Thread Stefano Pettini (JIRA)
Stefano Pettini created SPARK-23778:
---

 Summary: SparkContext.emptyRDD confuses SparkContext.union
 Key: SPARK-23778
 URL: https://issues.apache.org/jira/browse/SPARK-23778
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0, 1.3.0
Reporter: Stefano Pettini
 Attachments: partitioner_lost_and_unneeded_extra_stage.png

SparkContext.emptyRDD is an unpartitioned RDD. Clearly it's empty so whether 
it's partitioned or not should be just a academic debate. Unfortunately it 
doesn't seem to be like this and the issue has side effects.

Namely, it confuses the RDD union.

When there are N classic RDDs partitioned the same way, the union is 
implemented with the optimized PartitionerAwareUnionRDD, that retains the 
common partitioner in the result. If one of the N RDDs happens to be an 
emptyRDD, as it doesn't have a partitioner, the union is implemented by just 
appending all the partitions of the N RDDs, dropping the partitioner. But 
there's no need for this, as the emptyRDD contains no elements. This results in 
further unneeded shuffles once the result of the union is used.

See for example:

{{val p = new HashPartitioner(3)}}
{{val a = context.parallelize(List(10, 20, 30, 40, 50)).keyBy(_ / 
10).partitionBy(p)}}
{{val b1 = a.mapValues(_ + 1)}}
{{val b2 = a.mapValues(_ - 1)}}
{{val e = context.emptyRDD[(Int, Int)]}}
{{val x = context.union(a, b1, b2, e)}}
{{val y = x.reduceByKey(_ + _)}}
{{assert(x.partitioner.contains(p))}}
{{y.collect()}}

The assert fails. Disabling it, it's possible to see that reduceByKey 
introduced a shuffles, although all the input RDDs are already partitioned the 
same way, but the emptyRDD.

Forcing a partitioner on the emptyRDD:

{{val e = context.emptyRDD[(Int, Int)].partitionBy(p)}}

solves the problem with the assert and doesn't introduce the unneeded extra 
stage and shuffle.

Union implementation should be changed to ignore the partitioner of emptyRDDs 
and consider those as _partitioned in a way compatible with any partitioner_, 
basically ignoring them.

Present since 1.3 at least.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23778) SparkContext.emptyRDD confuses SparkContext.union

2018-03-23 Thread Stefano Pettini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefano Pettini updated SPARK-23778:

Attachment: partitioner_lost_and_unneeded_extra_stage.png

> SparkContext.emptyRDD confuses SparkContext.union
> -
>
> Key: SPARK-23778
> URL: https://issues.apache.org/jira/browse/SPARK-23778
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 2.3.0
>Reporter: Stefano Pettini
>Priority: Minor
> Attachments: partitioner_lost_and_unneeded_extra_stage.png
>
>
> SparkContext.emptyRDD is an unpartitioned RDD. Clearly it's empty so whether 
> it's partitioned or not should be just a academic debate. Unfortunately it 
> doesn't seem to be like this and the issue has side effects.
> Namely, it confuses the RDD union.
> When there are N classic RDDs partitioned the same way, the union is 
> implemented with the optimized PartitionerAwareUnionRDD, that retains the 
> common partitioner in the result. If one of the N RDDs happens to be an 
> emptyRDD, as it doesn't have a partitioner, the union is implemented by just 
> appending all the partitions of the N RDDs, dropping the partitioner. But 
> there's no need for this, as the emptyRDD contains no elements. This results 
> in further unneeded shuffles once the result of the union is used.
> See for example:
> {{val p = new HashPartitioner(3)}}
> {{val a = context.parallelize(List(10, 20, 30, 40, 50)).keyBy(_ / 
> 10).partitionBy(p)}}
> {{val b1 = a.mapValues(_ + 1)}}
> {{val b2 = a.mapValues(_ - 1)}}
> {{val e = context.emptyRDD[(Int, Int)]}}
> {{val x = context.union(a, b1, b2, e)}}
> {{val y = x.reduceByKey(_ + _)}}
> {{assert(x.partitioner.contains(p))}}
> {{y.collect()}}
> The assert fails. Disabling it, it's possible to see that reduceByKey 
> introduced a shuffles, although all the input RDDs are already partitioned 
> the same way, but the emptyRDD.
> Forcing a partitioner on the emptyRDD:
> {{val e = context.emptyRDD[(Int, Int)].partitionBy(p)}}
> solves the problem with the assert and doesn't introduce the unneeded extra 
> stage and shuffle.
> Union implementation should be changed to ignore the partitioner of emptyRDDs 
> and consider those as _partitioned in a way compatible with any partitioner_, 
> basically ignoring them.
> Present since 1.3 at least.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23769) Remove unnecessary scalastyle check disabling

2018-03-23 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23769.
--
   Resolution: Fixed
 Assignee: Riaas Mokiem
Fix Version/s: 2.4.0
   2.3.1

Fixed in https://github.com/apache/spark/pull/20880

> Remove unnecessary scalastyle check disabling
> -
>
> Key: SPARK-23769
> URL: https://issues.apache.org/jira/browse/SPARK-23769
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Riaas Mokiem
>Assignee: Riaas Mokiem
>Priority: Minor
> Fix For: 2.3.1, 2.4.0
>
>
> In `org/apache/spark/util/CompletionIterator.scala` the Scalastyle checker is 
> disabled for 1 line of code. However, this line of code doesn't seem to 
> violate any of the currently active rules for Scalastyle. So the Scalastyle 
> checker doesn't need to be disabled there. 
> I've tested this by removing the comments that disable the checker and 
> running the checker with `build/mv scalastyle:check`. With the comments 
> removed (so with the checker active for that line) the build still succeeds 
> and no violations are shown in `core/target/scalastyle-output.xml`. I'll 
> create a pull request to remove the comments that disable the checker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23777) Missing DAG arrows between stages

2018-03-23 Thread Stefano Pettini (JIRA)
Stefano Pettini created SPARK-23777:
---

 Summary: Missing DAG arrows between stages
 Key: SPARK-23777
 URL: https://issues.apache.org/jira/browse/SPARK-23777
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.0, 1.3.0
Reporter: Stefano Pettini
 Attachments: Screenshot-2018-3-23 RDDTestApp - Details for Job 0.png

In the Spark UI DAGs, sometimes there are missing arrows between stages. It 
seems to happen when the same RDD is shuffled twice.

For example in this case:

{{val a = context.parallelize(List(10, 20, 30, 40, 50)).keyBy(_ / 10)}}
{{val b = a join a}}
{{b.collect()}}

There's a missing arrow from stage 1 to 2.

_This is an old one, since 1.3.0 at least, still reproducible in 2.3.0._



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23777) Missing DAG arrows between stages

2018-03-23 Thread Stefano Pettini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefano Pettini updated SPARK-23777:

Attachment: Screenshot-2018-3-23 RDDTestApp - Details for Job 0.png

> Missing DAG arrows between stages
> -
>
> Key: SPARK-23777
> URL: https://issues.apache.org/jira/browse/SPARK-23777
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0, 2.3.0
>Reporter: Stefano Pettini
>Priority: Trivial
> Attachments: Screenshot-2018-3-23 RDDTestApp - Details for Job 0.png
>
>
> In the Spark UI DAGs, sometimes there are missing arrows between stages. It 
> seems to happen when the same RDD is shuffled twice.
> For example in this case:
> {{val a = context.parallelize(List(10, 20, 30, 40, 50)).keyBy(_ / 10)}}
> {{val b = a join a}}
> {{b.collect()}}
> There's a missing arrow from stage 1 to 2.
> _This is an old one, since 1.3.0 at least, still reproducible in 2.3.0._



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23650) Slow SparkR udf (dapply)

2018-03-23 Thread Deepansh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepansh updated SPARK-23650:
-
Attachment: packageReload.txt

> Slow SparkR udf (dapply)
> 
>
> Key: SPARK-23650
> URL: https://issues.apache.org/jira/browse/SPARK-23650
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Deepansh
>Priority: Major
> Attachments: packageReload.txt, read_model_in_udf.txt, 
> sparkR_log2.txt, sparkRlag.txt
>
>
> For eg, I am getting streams from Kafka and I want to implement a model made 
> in R for those streams. For this, I am using dapply.
> My code is:
> iris_model <- readRDS("./iris_model.rds")
> randomBr <- SparkR:::broadcast(sc, iris_model)
> kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
> "localhost:9092", topic = "source")
> lines<- select(kafka, cast(kafka$value, "string"))
> schema<-schema(lines)
> df1<-dapply(lines,function(x){
> i_model<-SparkR:::value(randomMatBr)
> for (row in 1:nrow(x))
> { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) 
> y<-toJSON(y) x[row,"value"] = y }
> x
> },schema)
> Every time when Kafka streams are fetched the dapply method creates new 
> runner thread and ships the variables again, which causes a huge lag(~2s for 
> shipping model) every time. I even tried without broadcast variables but it 
> takes same time to ship variables. Can some other techniques be applied to 
> improve its performance?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23650) Slow SparkR udf (dapply)

2018-03-23 Thread Deepansh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411246#comment-16411246
 ] 

Deepansh commented on SPARK-23650:
--

R environment inside the thread for applying UDF is not getting cached. It is 
created and destroyed with each query.

{code:R}
kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
"10.117.172.48:9092", topic = "source")
lines<- select(kafka, cast(kafka$value, "string"))
schema<-schema(lines)
library(caret)

df4<-dapply(lines,function(x){
  print(system.time(library(caret)))
  x
},schema)

q2 <- write.stream(df4,"kafka", checkpointLocation = loc, topic = "sink", 
kafka.bootstrap.servers = "10.117.172.48:9092")
awaitTermination(q2)
{code}

For the above code, for every new stream my output is,
18/03/23 11:08:10 INFO BufferedStreamThread: Loading required package: lattice
18/03/23 11:08:10 INFO BufferedStreamThread: 
18/03/23 11:08:10 INFO BufferedStreamThread: Attaching package: ‘lattice’
18/03/23 11:08:10 INFO BufferedStreamThread: 
18/03/23 11:08:10 INFO BufferedStreamThread: The following object is masked 
from ‘package:SparkR’:
18/03/23 11:08:10 INFO BufferedStreamThread: 
18/03/23 11:08:10 INFO BufferedStreamThread: histogram
18/03/23 11:08:10 INFO BufferedStreamThread: 
18/03/23 11:08:10 INFO BufferedStreamThread: Loading required package: ggplot2
18/03/23 11:08:12 INFO BufferedStreamThread:user  system elapsed 
18/03/23 11:08:12 INFO BufferedStreamThread:   1.937   0.062   1.999 
18/03/23 11:08:12 INFO RRunner: Times: boot = 0.009 s, init = 0.017 s, 
broadcast = 0.001 s, read-input = 0.001 s, compute = 2.064 s, write-output = 
0.001 s, total = 2.093 s

PFA: rest log file.

For every new coming stream, the packages are loaded again inside the thread, 
which means R environment inside the thread is not getting reused, it is 
created and destroyed every time.

The model(iris model), on which I am testing requires caret package. So, when I 
use the readRDS method, caret package is also loaded, which adds an overhead of 
(~2s) every time. 

The same problem is with the broadcast. Broadcasting the model doesn't take 
time, but when it deserializes the model it loads caret package which adds 2s 
overhead.

Ideally, the packages shouldn't load again. Is there a way around to this 
problem?

> Slow SparkR udf (dapply)
> 
>
> Key: SPARK-23650
> URL: https://issues.apache.org/jira/browse/SPARK-23650
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Deepansh
>Priority: Major
> Attachments: read_model_in_udf.txt, sparkR_log2.txt, sparkRlag.txt
>
>
> For eg, I am getting streams from Kafka and I want to implement a model made 
> in R for those streams. For this, I am using dapply.
> My code is:
> iris_model <- readRDS("./iris_model.rds")
> randomBr <- SparkR:::broadcast(sc, iris_model)
> kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
> "localhost:9092", topic = "source")
> lines<- select(kafka, cast(kafka$value, "string"))
> schema<-schema(lines)
> df1<-dapply(lines,function(x){
> i_model<-SparkR:::value(randomMatBr)
> for (row in 1:nrow(x))
> { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) 
> y<-toJSON(y) x[row,"value"] = y }
> x
> },schema)
> Every time when Kafka streams are fetched the dapply method creates new 
> runner thread and ships the variables again, which causes a huge lag(~2s for 
> shipping model) every time. I even tried without broadcast variables but it 
> takes same time to ship variables. Can some other techniques be applied to 
> improve its performance?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23685) Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2018-03-23 Thread Gabor Somogyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16410939#comment-16410939
 ] 

Gabor Somogyi commented on SPARK-23685:
---

Jira assignment is not required. You can write a comment when you're working on 
a jira but your PR is not ready. Or a PR makes it also clear that you're 
dealing with it.

> Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive 
> Offsets (i.e. Log Compaction)
> -
>
> Key: SPARK-23685
> URL: https://issues.apache.org/jira/browse/SPARK-23685
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: sirisha
>Priority: Major
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaSourceRDD & CachedKafkaConsumer assumes that the next offset will always 
> be just an increment of 1 .If not, it throws the below exception:
>  
> "Cannot fetch records in [5589, 5693) (GroupId: XXX, TopicPartition:). 
> Some data may have been lost because they are not available in Kafka any 
> more; either the data was aged out by Kafka or the topic may have been 
> deleted before all the data in the topic was processed. If you don't want 
> your streaming query to fail on such cases, set the source option 
> "failOnDataLoss" to "false". "
>  
> FYI: This bug is related to https://issues.apache.org/jira/browse/SPARK-17147
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23734) InvalidSchemaException While Saving ALSModel

2018-03-23 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16410914#comment-16410914
 ] 

Liang-Chi Hsieh commented on SPARK-23734:
-

I use the latest master branch and can't reproduce the reported issue.

> InvalidSchemaException While Saving ALSModel
> 
>
> Key: SPARK-23734
> URL: https://issues.apache.org/jira/browse/SPARK-23734
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
> Environment: macOS 10.13.2
> Scala 2.11.8
> Spark 2.3.0
>Reporter: Stanley Poon
>Priority: Major
>  Labels: ALS, parquet, persistence
>
> After fitting an ALSModel, get following error while saving the model:
> Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can 
> not be empty. Parquet does not support empty group without leaves. Empty 
> group: spark_schema
> Exactly the same code ran ok on 2.2.1.
> Same issue also occurs on other ALSModels we have.
> h2. *To reproduce*
> Get ALSExample: 
> [https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala]
>  and add the following line to save the model right before "spark.stop".
> {quote}   model.write.overwrite().save("SparkExampleALSModel") 
> {quote}
> h2. Stack Trace
> Exception in thread "main" java.lang.ExceptionInInitializerError
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:112)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
> at 
> org.apache.spark.ml.recommendation.ALSModel$ALSModelWriter.saveImpl(ALS.scala:510)
> at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:103)
> at com.vitalmove.model.ALSExample$.main(ALSExample.scala:83)
> at com.vitalmove.model.ALSExample.main(ALSExample.scala)
> Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can 
> not be empty. Parquet does not support empty group without leaves. Empty 
> group: spark_schema
> at org.apache.parquet.schema.GroupType.(GroupType.java:92)
> at org.apache.parquet.schema.GroupType.(GroupType.java:48)
> at org.apache.parquet.schema.MessageType.(MessageType.java:50)
> at org.apache.parquet.schema.Types$MessageTypeBuilder.named(Types.java:1256)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.(ParquetSchemaConverter.scala:567)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.(ParquetSchemaConverter.scala)
>  



--
This