[jira] [Comment Edited] (SPARK-23780) Failed to use googleVis library with new SparkR
[ https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412458#comment-16412458 ] Felix Cheung edited comment on SPARK-23780 at 3/24/18 6:53 AM: --- here [https://github.com/mages/googleVis/blob/master/R/zzz.R#L39] or here [https://github.com/jeroen/jsonlite/blob/master/R/toJSON.R#L2] was (Author: felixcheung): here [https://github.com/mages/googleVis/blob/master/R/zzz.R#L39] > Failed to use googleVis library with new SparkR > --- > > Key: SPARK-23780 > URL: https://issues.apache.org/jira/browse/SPARK-23780 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.1 >Reporter: Ivan Dzikovsky >Priority: Major > > I've tried to use googleVis library with Spark 2.2.1, and faced with problem. > Steps to reproduce: > # Install R with googleVis library. > # Run SparkR: > {code} > sparkR --master yarn --deploy-mode client > {code} > # Run code that uses googleVis: > {code} > library(googleVis) > df=data.frame(country=c("US", "GB", "BR"), > val1=c(10,13,14), > val2=c(23,12,32)) > Bar <- gvisBarChart(df) > cat("%html ", Bar$html$chart) > {code} > Than I got following error message: > {code} > Error : .onLoad failed in loadNamespace() for 'googleVis', details: > call: rematchDefinition(definition, fdef, mnames, fnames, signature) > error: methods can add arguments to the generic 'toJSON' only if '...' is > an argument to the generic > Error : package or namespace load failed for 'googleVis' > {code} > But expected result is to get some HTML code output, as it was with Spark > 2.1.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23780) Failed to use googleVis library with new SparkR
[ https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412458#comment-16412458 ] Felix Cheung commented on SPARK-23780: -- here [https://github.com/mages/googleVis/blob/master/R/zzz.R#L39] > Failed to use googleVis library with new SparkR > --- > > Key: SPARK-23780 > URL: https://issues.apache.org/jira/browse/SPARK-23780 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.1 >Reporter: Ivan Dzikovsky >Priority: Major > > I've tried to use googleVis library with Spark 2.2.1, and faced with problem. > Steps to reproduce: > # Install R with googleVis library. > # Run SparkR: > {code} > sparkR --master yarn --deploy-mode client > {code} > # Run code that uses googleVis: > {code} > library(googleVis) > df=data.frame(country=c("US", "GB", "BR"), > val1=c(10,13,14), > val2=c(23,12,32)) > Bar <- gvisBarChart(df) > cat("%html ", Bar$html$chart) > {code} > Than I got following error message: > {code} > Error : .onLoad failed in loadNamespace() for 'googleVis', details: > call: rematchDefinition(definition, fdef, mnames, fnames, signature) > error: methods can add arguments to the generic 'toJSON' only if '...' is > an argument to the generic > Error : package or namespace load failed for 'googleVis' > {code} > But expected result is to get some HTML code output, as it was with Spark > 2.1.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23780) Failed to use googleVis library with new SparkR
[ https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412457#comment-16412457 ] Felix Cheung commented on SPARK-23780: -- hmm, I think the cause of this is the incompatibility of the method signature of toJSON > Failed to use googleVis library with new SparkR > --- > > Key: SPARK-23780 > URL: https://issues.apache.org/jira/browse/SPARK-23780 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.1 >Reporter: Ivan Dzikovsky >Priority: Major > > I've tried to use googleVis library with Spark 2.2.1, and faced with problem. > Steps to reproduce: > # Install R with googleVis library. > # Run SparkR: > {code} > sparkR --master yarn --deploy-mode client > {code} > # Run code that uses googleVis: > {code} > library(googleVis) > df=data.frame(country=c("US", "GB", "BR"), > val1=c(10,13,14), > val2=c(23,12,32)) > Bar <- gvisBarChart(df) > cat("%html ", Bar$html$chart) > {code} > Than I got following error message: > {code} > Error : .onLoad failed in loadNamespace() for 'googleVis', details: > call: rematchDefinition(definition, fdef, mnames, fnames, signature) > error: methods can add arguments to the generic 'toJSON' only if '...' is > an argument to the generic > Error : package or namespace load failed for 'googleVis' > {code} > But expected result is to get some HTML code output, as it was with Spark > 2.1.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23716) Change SHA512 style in release artifacts to play nicely with shasum utility
[ https://issues.apache.org/jira/browse/SPARK-23716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412423#comment-16412423 ] Nicholas Chammas edited comment on SPARK-23716 at 3/24/18 5:13 AM: --- For my use case, there is no value in updating the Spark release code if we're not going to also update the release hashes for all prior releases, which it sounds like we don't want to do. I wrote my own code to convert the GPG-style hashes to shasum style hashes, and that satisfies [my need|https://github.com/nchammas/flintrock/issues/238], which is focused on syncing Spark releases from the Apache archive to an S3 bucket. Closing this as "Won't Fix". was (Author: nchammas): For my use case, there is no value in updating the Spark release code if we're not going to also update the release hashes for all prior releases, which it sounds like we don't want to do. I wrote my own code to convert the GPG-style hashes to shasum style hashes, and that satisfies my need. I am syncing Spark releases from the Apache distribution archive to a personal S3 bucket and need a way to verify the integrity of the files. > Change SHA512 style in release artifacts to play nicely with shasum utility > --- > > Key: SPARK-23716 > URL: https://issues.apache.org/jira/browse/SPARK-23716 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.3.0 >Reporter: Nicholas Chammas >Priority: Minor > > As [discussed > here|http://apache-spark-developers-list.1001551.n3.nabble.com/Changing-how-we-compute-release-hashes-td23599.html]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23716) Change SHA512 style in release artifacts to play nicely with shasum utility
[ https://issues.apache.org/jira/browse/SPARK-23716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas resolved SPARK-23716. -- Resolution: Won't Fix For my use case, there is no value in updating the Spark release code if we're not going to also update the release hashes for all prior releases, which it sounds like we don't want to do. I wrote my own code to convert the GPG-style hashes to shasum style hashes, and that satisfies my need. I am syncing Spark releases from the Apache distribution archive to a personal S3 bucket and need a way to verify the integrity of the files. > Change SHA512 style in release artifacts to play nicely with shasum utility > --- > > Key: SPARK-23716 > URL: https://issues.apache.org/jira/browse/SPARK-23716 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.3.0 >Reporter: Nicholas Chammas >Priority: Minor > > As [discussed > here|http://apache-spark-developers-list.1001551.n3.nabble.com/Changing-how-we-compute-release-hashes-td23599.html]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly
[ https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412379#comment-16412379 ] MIN-FU YANG edited comment on SPARK-22876 at 3/24/18 3:00 AM: -- I found that current Yarn implementation doesn't expose the number of failed app attempts in their API, maybe this feature should be pending until they expose this number. was (Author: lucasmf): I found that current Yarn implementation doesn't expose number of failed app attempts number in their API, maybe this feature should be pending until they expose this number. > spark.yarn.am.attemptFailuresValidityInterval does not work correctly > - > > Key: SPARK-22876 > URL: https://issues.apache.org/jira/browse/SPARK-22876 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.0 > Environment: hadoop version 2.7.3 >Reporter: Jinhan Zhong >Priority: Minor > > I assume we can use spark.yarn.maxAppAttempts together with > spark.yarn.am.attemptFailuresValidityInterval to make a long running > application avoid stopping after acceptable number of failures. > But after testing, I found that the application always stops after failing n > times ( n is minimum value of spark.yarn.maxAppAttempts and > yarn.resourcemanager.am.max-attempts from client yarn-site.xml) > for example, following setup will allow the application master to fail 20 > times. > * spark.yarn.am.attemptFailuresValidityInterval=1s > * spark.yarn.maxAppAttempts=20 > * yarn client: yarn.resourcemanager.am.max-attempts=20 > * yarn resource manager: yarn.resourcemanager.am.max-attempts=3 > And after checking the source code, I found in source file > ApplicationMaster.scala > https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293 > there's a ShutdownHook that checks the attempt id against the maxAppAttempts, > if attempt id >= maxAppAttempts, it will try to unregister the application > and the application will finish. > is this a expected design or a bug? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly
[ https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412379#comment-16412379 ] MIN-FU YANG commented on SPARK-22876: - I found that current Yarn implementation doesn't expose number of failed app attempts number in their API, maybe this feature should be pending until they expose this number. > spark.yarn.am.attemptFailuresValidityInterval does not work correctly > - > > Key: SPARK-22876 > URL: https://issues.apache.org/jira/browse/SPARK-22876 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.0 > Environment: hadoop version 2.7.3 >Reporter: Jinhan Zhong >Priority: Minor > > I assume we can use spark.yarn.maxAppAttempts together with > spark.yarn.am.attemptFailuresValidityInterval to make a long running > application avoid stopping after acceptable number of failures. > But after testing, I found that the application always stops after failing n > times ( n is minimum value of spark.yarn.maxAppAttempts and > yarn.resourcemanager.am.max-attempts from client yarn-site.xml) > for example, following setup will allow the application master to fail 20 > times. > * spark.yarn.am.attemptFailuresValidityInterval=1s > * spark.yarn.maxAppAttempts=20 > * yarn client: yarn.resourcemanager.am.max-attempts=20 > * yarn resource manager: yarn.resourcemanager.am.max-attempts=3 > And after checking the source code, I found in source file > ApplicationMaster.scala > https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293 > there's a ShutdownHook that checks the attempt id against the maxAppAttempts, > if attempt id >= maxAppAttempts, it will try to unregister the application > and the application will finish. > is this a expected design or a bug? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14352) approxQuantile should support multi columns
[ https://issues.apache.org/jira/browse/SPARK-14352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412370#comment-16412370 ] Walt Elder commented on SPARK-14352: Seems like this should be marked closed as of 2.2, right? > approxQuantile should support multi columns > --- > > Key: SPARK-14352 > URL: https://issues.apache.org/jira/browse/SPARK-14352 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: zhengruifeng >Priority: Major > > It will be convenient and efficient to calculate quantiles of multi-columns > with approxQuantile. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23785) LauncherBackend doesn't check state of connection before setting state
[ https://issues.apache.org/jira/browse/SPARK-23785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412361#comment-16412361 ] Sahil Takiar commented on SPARK-23785: -- Updated the PR. {quote} in LauncherBackend.BackendConnection, set "isDisconnected" before calling super.close() {quote} Done {quote} The race you describe also exists; it's not what the exception in the Hive bug shows, though. {quote} I tried to write a test to replicate this issue, but it seems its already handled in {{LauncherConnection}}, if the {{SparkAppHandle}} calls {{disconnect}} the client connection automatically gets closed because it gets an {{EOFException}}, which triggers {{close()}} - the logic is in {{LauncherConnection#run}}. So I guess its already handled? I added my test case to the PR, should be useful since it covers what happens when {{SparkAppHandle#disconnect}} is called. > LauncherBackend doesn't check state of connection before setting state > -- > > Key: SPARK-23785 > URL: https://issues.apache.org/jira/browse/SPARK-23785 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Sahil Takiar >Priority: Major > > Found in HIVE-18533 while trying to integration with the > {{InProcessLauncher}}. {{LauncherBackend}} doesn't check the state of its > connection to the {{LauncherServer}} before trying to run {{setState}} - > which sends a {{SetState}} message on the connection. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23788) Race condition in StreamingQuerySuite
[ https://issues.apache.org/jira/browse/SPARK-23788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412360#comment-16412360 ] Apache Spark commented on SPARK-23788: -- User 'jose-torres' has created a pull request for this issue: https://github.com/apache/spark/pull/20896 > Race condition in StreamingQuerySuite > - > > Key: SPARK-23788 > URL: https://issues.apache.org/jira/browse/SPARK-23788 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Minor > > The serializability test uses the same MemoryStream instance for 3 different > queries. If any of those queries ask it to commit before the others have run, > the rest will see empty dataframes. This can fail the test if q3 is affected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23788) Race condition in StreamingQuerySuite
[ https://issues.apache.org/jira/browse/SPARK-23788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23788: Assignee: (was: Apache Spark) > Race condition in StreamingQuerySuite > - > > Key: SPARK-23788 > URL: https://issues.apache.org/jira/browse/SPARK-23788 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Minor > > The serializability test uses the same MemoryStream instance for 3 different > queries. If any of those queries ask it to commit before the others have run, > the rest will see empty dataframes. This can fail the test if q3 is affected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23788) Race condition in StreamingQuerySuite
[ https://issues.apache.org/jira/browse/SPARK-23788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23788: Assignee: Apache Spark > Race condition in StreamingQuerySuite > - > > Key: SPARK-23788 > URL: https://issues.apache.org/jira/browse/SPARK-23788 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Apache Spark >Priority: Minor > > The serializability test uses the same MemoryStream instance for 3 different > queries. If any of those queries ask it to commit before the others have run, > the rest will see empty dataframes. This can fail the test if q3 is affected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23788) Race condition in StreamingQuerySuite
Jose Torres created SPARK-23788: --- Summary: Race condition in StreamingQuerySuite Key: SPARK-23788 URL: https://issues.apache.org/jira/browse/SPARK-23788 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Jose Torres The serializability test uses the same MemoryStream instance for 3 different queries. If any of those queries ask it to commit before the others have run, the rest will see empty dataframes. This can fail the test if q3 is affected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly
[ https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412228#comment-16412228 ] MIN-FU YANG commented on SPARK-22876: - Hi, I also encounter this problem. Please assign it to me, I can fix this issue. > spark.yarn.am.attemptFailuresValidityInterval does not work correctly > - > > Key: SPARK-22876 > URL: https://issues.apache.org/jira/browse/SPARK-22876 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.0 > Environment: hadoop version 2.7.3 >Reporter: Jinhan Zhong >Priority: Minor > > I assume we can use spark.yarn.maxAppAttempts together with > spark.yarn.am.attemptFailuresValidityInterval to make a long running > application avoid stopping after acceptable number of failures. > But after testing, I found that the application always stops after failing n > times ( n is minimum value of spark.yarn.maxAppAttempts and > yarn.resourcemanager.am.max-attempts from client yarn-site.xml) > for example, following setup will allow the application master to fail 20 > times. > * spark.yarn.am.attemptFailuresValidityInterval=1s > * spark.yarn.maxAppAttempts=20 > * yarn client: yarn.resourcemanager.am.max-attempts=20 > * yarn resource manager: yarn.resourcemanager.am.max-attempts=3 > And after checking the source code, I found in source file > ApplicationMaster.scala > https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293 > there's a ShutdownHook that checks the attempt id against the maxAppAttempts, > if attempt id >= maxAppAttempts, it will try to unregister the application > and the application will finish. > is this a expected design or a bug? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23615) Add maxDF Parameter to Python CountVectorizer
[ https://issues.apache.org/jira/browse/SPARK-23615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-23615. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20777 [https://github.com/apache/spark/pull/20777] > Add maxDF Parameter to Python CountVectorizer > - > > Key: SPARK-23615 > URL: https://issues.apache.org/jira/browse/SPARK-23615 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Huaxin Gao >Priority: Minor > Fix For: 2.4.0 > > > The maxDF parameter is for filtering out frequently occurring terms. This > param was recently added to the Scala CountVectorizer and needs to be added > to Python also. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23615) Add maxDF Parameter to Python CountVectorizer
[ https://issues.apache.org/jira/browse/SPARK-23615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-23615: Assignee: Huaxin Gao > Add maxDF Parameter to Python CountVectorizer > - > > Key: SPARK-23615 > URL: https://issues.apache.org/jira/browse/SPARK-23615 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Huaxin Gao >Priority: Minor > Fix For: 2.4.0 > > > The maxDF parameter is for filtering out frequently occurring terms. This > param was recently added to the Scala CountVectorizer and needs to be added > to Python also. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22513) Provide build profile for hadoop 2.8
[ https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412218#comment-16412218 ] Nicholas Chammas commented on SPARK-22513: -- Fair enough. Just as an alternate confirmation, [~ste...@apache.org] can you comment on whether there might be any issues running Spark built against Hadoop 2.7 with, say, HDFS 2.8? > Provide build profile for hadoop 2.8 > > > Key: SPARK-22513 > URL: https://issues.apache.org/jira/browse/SPARK-22513 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: Christine Koppelt >Priority: Major > > hadoop 2.8 comes with a patch which is necessary to make it run on NixOS [1]. > Therefore it would be cool to have a Spark version pre-built for Hadoop 2.8. > [1] > https://github.com/apache/hadoop/commit/5231c527aaf19fb3f4bd59dcd2ab19bfb906d377#diff-19821342174c77119be4a99dc3f3618d -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23787) SparkSubmitSuite::"download remote resource if it is not supported by yarn" fails on Hadoop 2.9
[ https://issues.apache.org/jira/browse/SPARK-23787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23787: Assignee: Apache Spark > SparkSubmitSuite::"download remote resource if it is not supported by yarn" > fails on Hadoop 2.9 > --- > > Key: SPARK-23787 > URL: https://issues.apache.org/jira/browse/SPARK-23787 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > {noformat} > [info] - download list of files to local (10 milliseconds) > [info] > "http:///work/apache/spark/target/tmp/spark-a17bc160-641b-41e1-95be-a2e31b175e09/testJar3393247632492201277.jar"; > did not start with substring "file:" (SparkSubmitSuite.scala:1022) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340) > [info] at > org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668) > [info] at > org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704) > [info] at > org.apache.spark.deploy.SparkSubmitSuite.org$apache$spark$deploy$SparkSubmitSuite$$testRemoteResources(SparkSubmitSuite.scala:1022) > [info] at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply$mcV$sp(SparkSubmitSuite.scala:962) > [info] at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962) > [info] at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962) > {noformat} > That's because Hadoop 2.9 supports http as a file system, and the test > expects the Hadoop libraries not to. I also found a couple of other bugs in > the test (although the code itself for the feature is fine). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23787) SparkSubmitSuite::"download remote resource if it is not supported by yarn" fails on Hadoop 2.9
[ https://issues.apache.org/jira/browse/SPARK-23787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412202#comment-16412202 ] Apache Spark commented on SPARK-23787: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/20895 > SparkSubmitSuite::"download remote resource if it is not supported by yarn" > fails on Hadoop 2.9 > --- > > Key: SPARK-23787 > URL: https://issues.apache.org/jira/browse/SPARK-23787 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > {noformat} > [info] - download list of files to local (10 milliseconds) > [info] > "http:///work/apache/spark/target/tmp/spark-a17bc160-641b-41e1-95be-a2e31b175e09/testJar3393247632492201277.jar"; > did not start with substring "file:" (SparkSubmitSuite.scala:1022) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340) > [info] at > org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668) > [info] at > org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704) > [info] at > org.apache.spark.deploy.SparkSubmitSuite.org$apache$spark$deploy$SparkSubmitSuite$$testRemoteResources(SparkSubmitSuite.scala:1022) > [info] at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply$mcV$sp(SparkSubmitSuite.scala:962) > [info] at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962) > [info] at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962) > {noformat} > That's because Hadoop 2.9 supports http as a file system, and the test > expects the Hadoop libraries not to. I also found a couple of other bugs in > the test (although the code itself for the feature is fine). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23787) SparkSubmitSuite::"download remote resource if it is not supported by yarn" fails on Hadoop 2.9
[ https://issues.apache.org/jira/browse/SPARK-23787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23787: Assignee: (was: Apache Spark) > SparkSubmitSuite::"download remote resource if it is not supported by yarn" > fails on Hadoop 2.9 > --- > > Key: SPARK-23787 > URL: https://issues.apache.org/jira/browse/SPARK-23787 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > {noformat} > [info] - download list of files to local (10 milliseconds) > [info] > "http:///work/apache/spark/target/tmp/spark-a17bc160-641b-41e1-95be-a2e31b175e09/testJar3393247632492201277.jar"; > did not start with substring "file:" (SparkSubmitSuite.scala:1022) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340) > [info] at > org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668) > [info] at > org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704) > [info] at > org.apache.spark.deploy.SparkSubmitSuite.org$apache$spark$deploy$SparkSubmitSuite$$testRemoteResources(SparkSubmitSuite.scala:1022) > [info] at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply$mcV$sp(SparkSubmitSuite.scala:962) > [info] at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962) > [info] at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962) > {noformat} > That's because Hadoop 2.9 supports http as a file system, and the test > expects the Hadoop libraries not to. I also found a couple of other bugs in > the test (although the code itself for the feature is fine). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23787) SparkSubmitSuite::"download remote resource if it is not supported by yarn" fails on Hadoop 2.9
[ https://issues.apache.org/jira/browse/SPARK-23787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-23787: --- Summary: SparkSubmitSuite::"download remote resource if it is not supported by yarn" fails on Hadoop 2.9 (was: SparkSubmitSuite::"download list of files to local" fails on Hadoop 2.9) > SparkSubmitSuite::"download remote resource if it is not supported by yarn" > fails on Hadoop 2.9 > --- > > Key: SPARK-23787 > URL: https://issues.apache.org/jira/browse/SPARK-23787 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > {noformat} > [info] - download list of files to local (10 milliseconds) > [info] > "http:///work/apache/spark/target/tmp/spark-a17bc160-641b-41e1-95be-a2e31b175e09/testJar3393247632492201277.jar"; > did not start with substring "file:" (SparkSubmitSuite.scala:1022) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340) > [info] at > org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668) > [info] at > org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704) > [info] at > org.apache.spark.deploy.SparkSubmitSuite.org$apache$spark$deploy$SparkSubmitSuite$$testRemoteResources(SparkSubmitSuite.scala:1022) > [info] at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply$mcV$sp(SparkSubmitSuite.scala:962) > [info] at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962) > [info] at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962) > {noformat} > That's because Hadoop 2.9 supports http as a file system, and the test > expects the Hadoop libraries not to. I also found a couple of other bugs in > the test (although the code itself for the feature is fine). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23787) SparkSubmitSuite::"" fails on Hadoop 2.9
Marcelo Vanzin created SPARK-23787: -- Summary: SparkSubmitSuite::"" fails on Hadoop 2.9 Key: SPARK-23787 URL: https://issues.apache.org/jira/browse/SPARK-23787 Project: Spark Issue Type: Bug Components: Spark Core, Tests Affects Versions: 2.4.0 Reporter: Marcelo Vanzin {noformat} [info] - download list of files to local (10 milliseconds) [info] "http:///work/apache/spark/target/tmp/spark-a17bc160-641b-41e1-95be-a2e31b175e09/testJar3393247632492201277.jar"; did not start with substring "file:" (SparkSubmitSuite.scala:1022) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340) [info] at org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668) [info] at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704) [info] at org.apache.spark.deploy.SparkSubmitSuite.org$apache$spark$deploy$SparkSubmitSuite$$testRemoteResources(SparkSubmitSuite.scala:1022) [info] at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply$mcV$sp(SparkSubmitSuite.scala:962) [info] at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962) [info] at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962) {noformat} That's because Hadoop 2.9 supports http as a file system, and the test expects the Hadoop libraries not to. I also found a couple of other bugs in the test (although the code itself for the feature is fine). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23787) SparkSubmitSuite::"download list of files to local" fails on Hadoop 2.9
[ https://issues.apache.org/jira/browse/SPARK-23787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-23787: --- Summary: SparkSubmitSuite::"download list of files to local" fails on Hadoop 2.9 (was: SparkSubmitSuite::"" fails on Hadoop 2.9) > SparkSubmitSuite::"download list of files to local" fails on Hadoop 2.9 > --- > > Key: SPARK-23787 > URL: https://issues.apache.org/jira/browse/SPARK-23787 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > {noformat} > [info] - download list of files to local (10 milliseconds) > [info] > "http:///work/apache/spark/target/tmp/spark-a17bc160-641b-41e1-95be-a2e31b175e09/testJar3393247632492201277.jar"; > did not start with substring "file:" (SparkSubmitSuite.scala:1022) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340) > [info] at > org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668) > [info] at > org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704) > [info] at > org.apache.spark.deploy.SparkSubmitSuite.org$apache$spark$deploy$SparkSubmitSuite$$testRemoteResources(SparkSubmitSuite.scala:1022) > [info] at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply$mcV$sp(SparkSubmitSuite.scala:962) > [info] at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962) > [info] at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$18.apply(SparkSubmitSuite.scala:962) > {noformat} > That's because Hadoop 2.9 supports http as a file system, and the test > expects the Hadoop libraries not to. I also found a couple of other bugs in > the test (although the code itself for the feature is fine). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23785) LauncherBackend doesn't check state of connection before setting state
[ https://issues.apache.org/jira/browse/SPARK-23785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412130#comment-16412130 ] Marcelo Vanzin commented on SPARK-23785: This is a little trickier than just the checks you have in the PR. The check that is triggering in Hive is on the {{LauncherBackend}} side. So it has somehow already been closed, and a {{setState}} call happens. That can happen if there are two calls to {{LocalSchedulerBackend.stop}}, which can happen if someone with a launcher handle calls {{stop()}} on the handle. But the code should be safe against that and just ignore subsequent calls. The race you describe also exists; it's not what the exception in the Hive bug shows, though. So perhaps it's better to do a few different things: - add the checks in your PR - in LauncherBackend.BackendConnection, set "isDisconnected" before calling super.close() - in that same class, override the "send()" method to ignore "SocketException", to handle the second race. > LauncherBackend doesn't check state of connection before setting state > -- > > Key: SPARK-23785 > URL: https://issues.apache.org/jira/browse/SPARK-23785 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Sahil Takiar >Priority: Major > > Found in HIVE-18533 while trying to integration with the > {{InProcessLauncher}}. {{LauncherBackend}} doesn't check the state of its > connection to the {{LauncherServer}} before trying to run {{setState}} - > which sends a {{SetState}} message on the connection. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23786) CSV schema validation - column names are not checked
[ https://issues.apache.org/jira/browse/SPARK-23786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23786: Assignee: Apache Spark > CSV schema validation - column names are not checked > > > Key: SPARK-23786 > URL: https://issues.apache.org/jira/browse/SPARK-23786 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > Here is a csv file contains two columns of the same type: > {code} > $cat marina.csv > depth, temperature > 10.2, 9.0 > 5.5, 12.3 > {code} > If we define the schema with correct types but wrong column names (reversed > order): > {code:scala} > val schema = new StructType().add("temperature", DoubleType).add("depth", > DoubleType) > {code} > Spark reads the csv file without any errors: > {code:scala} > val ds = spark.read.schema(schema).option("header", "true").csv("marina.csv") > ds.show > {code} > and outputs wrong result: > {code} > +---+-+ > |temperature|depth| > +---+-+ > | 10.2| 9.0| > |5.5| 12.3| > +---+-+ > {code} > The correct behavior would be either output error or read columns according > its names in the schema. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23786) CSV schema validation - column names are not checked
[ https://issues.apache.org/jira/browse/SPARK-23786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412088#comment-16412088 ] Apache Spark commented on SPARK-23786: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/20894 > CSV schema validation - column names are not checked > > > Key: SPARK-23786 > URL: https://issues.apache.org/jira/browse/SPARK-23786 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > Here is a csv file contains two columns of the same type: > {code} > $cat marina.csv > depth, temperature > 10.2, 9.0 > 5.5, 12.3 > {code} > If we define the schema with correct types but wrong column names (reversed > order): > {code:scala} > val schema = new StructType().add("temperature", DoubleType).add("depth", > DoubleType) > {code} > Spark reads the csv file without any errors: > {code:scala} > val ds = spark.read.schema(schema).option("header", "true").csv("marina.csv") > ds.show > {code} > and outputs wrong result: > {code} > +---+-+ > |temperature|depth| > +---+-+ > | 10.2| 9.0| > |5.5| 12.3| > +---+-+ > {code} > The correct behavior would be either output error or read columns according > its names in the schema. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23786) CSV schema validation - column names are not checked
[ https://issues.apache.org/jira/browse/SPARK-23786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23786: Assignee: (was: Apache Spark) > CSV schema validation - column names are not checked > > > Key: SPARK-23786 > URL: https://issues.apache.org/jira/browse/SPARK-23786 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > Here is a csv file contains two columns of the same type: > {code} > $cat marina.csv > depth, temperature > 10.2, 9.0 > 5.5, 12.3 > {code} > If we define the schema with correct types but wrong column names (reversed > order): > {code:scala} > val schema = new StructType().add("temperature", DoubleType).add("depth", > DoubleType) > {code} > Spark reads the csv file without any errors: > {code:scala} > val ds = spark.read.schema(schema).option("header", "true").csv("marina.csv") > ds.show > {code} > and outputs wrong result: > {code} > +---+-+ > |temperature|depth| > +---+-+ > | 10.2| 9.0| > |5.5| 12.3| > +---+-+ > {code} > The correct behavior would be either output error or read columns according > its names in the schema. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23786) CSV schema validation - column names are not checked
Maxim Gekk created SPARK-23786: -- Summary: CSV schema validation - column names are not checked Key: SPARK-23786 URL: https://issues.apache.org/jira/browse/SPARK-23786 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Maxim Gekk Here is a csv file contains two columns of the same type: {code} $cat marina.csv depth, temperature 10.2, 9.0 5.5, 12.3 {code} If we define the schema with correct types but wrong column names (reversed order): {code:scala} val schema = new StructType().add("temperature", DoubleType).add("depth", DoubleType) {code} Spark reads the csv file without any errors: {code:scala} val ds = spark.read.schema(schema).option("header", "true").csv("marina.csv") ds.show {code} and outputs wrong result: {code} +---+-+ |temperature|depth| +---+-+ | 10.2| 9.0| |5.5| 12.3| +---+-+ {code} The correct behavior would be either output error or read columns according its names in the schema. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23785) LauncherBackend doesn't check state of connection before setting state
[ https://issues.apache.org/jira/browse/SPARK-23785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412042#comment-16412042 ] Sahil Takiar commented on SPARK-23785: -- [~vanzin] opened a PR that just checks if {{isConnected}} is true before writing to the connection, but not sure this will fix the issue from HIVE-18533. Looking through the code it seems like this only gets set when the {{LauncherBackend}} closes the connection, but doesn't detect when the {{LauncherServer}} closes the connection. Unless, I'm missing something. > LauncherBackend doesn't check state of connection before setting state > -- > > Key: SPARK-23785 > URL: https://issues.apache.org/jira/browse/SPARK-23785 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Sahil Takiar >Priority: Major > > Found in HIVE-18533 while trying to integration with the > {{InProcessLauncher}}. {{LauncherBackend}} doesn't check the state of its > connection to the {{LauncherServer}} before trying to run {{setState}} - > which sends a {{SetState}} message on the connection. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23785) LauncherBackend doesn't check state of connection before setting state
[ https://issues.apache.org/jira/browse/SPARK-23785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23785: Assignee: Apache Spark > LauncherBackend doesn't check state of connection before setting state > -- > > Key: SPARK-23785 > URL: https://issues.apache.org/jira/browse/SPARK-23785 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Sahil Takiar >Assignee: Apache Spark >Priority: Major > > Found in HIVE-18533 while trying to integration with the > {{InProcessLauncher}}. {{LauncherBackend}} doesn't check the state of its > connection to the {{LauncherServer}} before trying to run {{setState}} - > which sends a {{SetState}} message on the connection. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23785) LauncherBackend doesn't check state of connection before setting state
[ https://issues.apache.org/jira/browse/SPARK-23785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412004#comment-16412004 ] Apache Spark commented on SPARK-23785: -- User 'sahilTakiar' has created a pull request for this issue: https://github.com/apache/spark/pull/20893 > LauncherBackend doesn't check state of connection before setting state > -- > > Key: SPARK-23785 > URL: https://issues.apache.org/jira/browse/SPARK-23785 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Sahil Takiar >Priority: Major > > Found in HIVE-18533 while trying to integration with the > {{InProcessLauncher}}. {{LauncherBackend}} doesn't check the state of its > connection to the {{LauncherServer}} before trying to run {{setState}} - > which sends a {{SetState}} message on the connection. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23772) Provide an option to ignore column of all null values or empty map/array during JSON schema inference
[ https://issues.apache.org/jira/browse/SPARK-23772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412007#comment-16412007 ] Reynold Xin commented on SPARK-23772: - This is a good change to do! > Provide an option to ignore column of all null values or empty map/array > during JSON schema inference > - > > Key: SPARK-23772 > URL: https://issues.apache.org/jira/browse/SPARK-23772 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiangrui Meng >Priority: Major > > It is common that we convert data from JSON source to structured format > periodically. In the initial batch of JSON data, if a field's values are > always null, Spark infers this field as StringType. However, in the second > batch, one non-null value appears in this field and its type turns out to be > not StringType. Then merge schema failed because schema inconsistency. > This also applies to empty arrays and empty objects. My proposal is providing > an option in Spark JSON source to omit those fields until we see a non-null > value. > This is similar to SPARK-12436 but the proposed solution is different. > cc: [~rxin] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23785) LauncherBackend doesn't check state of connection before setting state
[ https://issues.apache.org/jira/browse/SPARK-23785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23785: Assignee: (was: Apache Spark) > LauncherBackend doesn't check state of connection before setting state > -- > > Key: SPARK-23785 > URL: https://issues.apache.org/jira/browse/SPARK-23785 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Sahil Takiar >Priority: Major > > Found in HIVE-18533 while trying to integration with the > {{InProcessLauncher}}. {{LauncherBackend}} doesn't check the state of its > connection to the {{LauncherServer}} before trying to run {{setState}} - > which sends a {{SetState}} message on the connection. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23654) Cut jets3t as a dependency of spark-core; exclude it from hadoop-cloud module as incompatible
[ https://issues.apache.org/jira/browse/SPARK-23654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-23654: --- Summary: Cut jets3t as a dependency of spark-core; exclude it from hadoop-cloud module as incompatible (was: cut jets3t as a dependency of spark-core; exclude it from hadoop-cloud module as incompatible) > Cut jets3t as a dependency of spark-core; exclude it from hadoop-cloud module > as incompatible > - > > Key: SPARK-23654 > URL: https://issues.apache.org/jira/browse/SPARK-23654 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Priority: Minor > > Spark core declares a dependency on Jets3t, which pulls in other cruft > # the hadoop-cloud module pulls in the hadoop-aws module with the > jets3t-compatible connectors, and the relevant dependencies: the spark-core > dependency is incomplete if that module isn't built, and superflous or > inconsistent if it is. > # We've cut out s3n/s3 and all dependencies on jets3t entirely from hadoop > 3.x in favour we're willing to maintain. > JetS3t was wonderful when it came out, but now the amazon SDKs massively > exceed it in functionality, albeit at the expense of week-to-week stability > and JAR binary compatibility -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23785) LauncherBackend doesn't check state of connection before setting state
Sahil Takiar created SPARK-23785: Summary: LauncherBackend doesn't check state of connection before setting state Key: SPARK-23785 URL: https://issues.apache.org/jira/browse/SPARK-23785 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: Sahil Takiar Found in HIVE-18533 while trying to integration with the {{InProcessLauncher}}. {{LauncherBackend}} doesn't check the state of its connection to the {{LauncherServer}} before trying to run {{setState}} - which sends a {{SetState}} message on the connection. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23784) Cannot use custom Aggregator with groupBy/agg
Joshua Howard created SPARK-23784: - Summary: Cannot use custom Aggregator with groupBy/agg Key: SPARK-23784 URL: https://issues.apache.org/jira/browse/SPARK-23784 Project: Spark Issue Type: Bug Components: Project Infra Affects Versions: 2.3.0 Reporter: Joshua Howard {{I have code [here|[https://stackoverflow.com/questions/49440766/trouble-getting-spark-aggregators-to-work],] where I am trying to use an Aggregator with both the `select` and `agg` functions. I cannot seem to get this to work in Spark 2.3.0. [Here|https://docs.cloud.databricks.com/docs/spark/1.6/examples/Dataset%20Aggregator.html] is a blog post that appears to be using this functionality in Spark 1.6, but It appears to no longer work. }} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23783) Add new generic export trait for ML pipelines
[ https://issues.apache.org/jira/browse/SPARK-23783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-23783. - Resolution: Fixed Fix Version/s: 2.4.0 > Add new generic export trait for ML pipelines > - > > Key: SPARK-23783 > URL: https://issues.apache.org/jira/browse/SPARK-23783 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.4.0 >Reporter: holdenk >Assignee: holdenk >Priority: Major > Fix For: 2.4.0 > > > Add a new generic export trait for ML pipelines so that we can support more > than just our internal format. API design based off of the > DataFrameReader/Writer design -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23783) Add new generic export trait for ML pipelines
[ https://issues.apache.org/jira/browse/SPARK-23783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk reassigned SPARK-23783: --- Assignee: holdenk > Add new generic export trait for ML pipelines > - > > Key: SPARK-23783 > URL: https://issues.apache.org/jira/browse/SPARK-23783 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.4.0 >Reporter: holdenk >Assignee: holdenk >Priority: Major > Fix For: 2.4.0 > > > Add a new generic export trait for ML pipelines so that we can support more > than just our internal format. API design based off of the > DataFrameReader/Writer design -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11239) PMML export for ML linear regression
[ https://issues.apache.org/jira/browse/SPARK-11239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk reassigned SPARK-11239: --- Assignee: holdenk > PMML export for ML linear regression > > > Key: SPARK-11239 > URL: https://issues.apache.org/jira/browse/SPARK-11239 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: holdenk >Assignee: holdenk >Priority: Major > Fix For: 2.4.0 > > > Add PMML export for linear regression models form the ML pipeline. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11239) PMML export for ML linear regression
[ https://issues.apache.org/jira/browse/SPARK-11239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-11239. - Resolution: Fixed Fix Version/s: 2.4.0 > PMML export for ML linear regression > > > Key: SPARK-11239 > URL: https://issues.apache.org/jira/browse/SPARK-11239 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: holdenk >Assignee: holdenk >Priority: Major > Fix For: 2.4.0 > > > Add PMML export for linear regression models form the ML pipeline. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21834) Incorrect executor request in case of dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-21834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411897#comment-16411897 ] Imran Rashid commented on SPARK-21834: -- SPARK-23365 is basically a duplicate of this, though they both have changes associated with them (though I didn't realize it at the time, SPARK-23365 is not strictly necessary on top of this, but does improve code clarity). > Incorrect executor request in case of dynamic allocation > > > Key: SPARK-21834 > URL: https://issues.apache.org/jira/browse/SPARK-21834 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Sital Kedia >Priority: Major > Fix For: 2.1.2, 2.2.1, 2.3.0 > > > killExecutor api currently does not allow killing an executor without > updating the total number of executors needed. In case of dynamic allocation > is turned on and the allocator tries to kill an executor, the scheduler > reduces the total number of executors needed ( see > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635) > which is incorrect because the allocator already takes care of setting the > required number of executors itself. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23365) DynamicAllocation with failure in straggler task can lead to a hung spark job
[ https://issues.apache.org/jira/browse/SPARK-23365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411894#comment-16411894 ] Imran Rashid commented on SPARK-23365: -- This is mostly a duplicate of SPARK-21834, though I'm not marking it as such since both had committed changes, and I think this change is still good as useful cleanup. The change from SPARK-21834 is actually sufficient to solve the problem described above. > DynamicAllocation with failure in straggler task can lead to a hung spark job > - > > Key: SPARK-23365 > URL: https://issues.apache.org/jira/browse/SPARK-23365 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.2, 2.2.1, 2.3.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > Dynamic Allocation can lead to a spark app getting stuck with 0 executors > requested when the executors in the last tasks of a taskset fail (eg. with an > OOM). > This happens when {{ExecutorAllocationManager}} s internal target number of > executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target > number. {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many > tasks are active or pending in submitted stages, and computes how many > executors would be needed for them. And as tasks finish, it will actively > decrease that count, informing the {{CGSB}} along the way. (2) When it > decides executors are inactive for long enough, then it requests that > {{CGSB}} kill the executors -- this also tells the {{CGSB}} to update its > target number of executors: > https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622 > So when there is just one task left, you could have the following sequence of > events: > (1) the {{EAM}} sets the desired number of executors to 1, and updates the > {{CGSB}} too > (2) while that final task is still running, the other executors cross the > idle timeout, and the {{EAM}} requests the {{CGSB}} kill them > (3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target > of 0 executors > If the final task completed normally now, everything would be OK; the next > taskset would get submitted, the {{EAM}} would increase the target number of > executors and it would update the {{CGSB}}. > But if the executor for that final task failed (eg. an OOM), then the {{EAM}} > thinks it [doesn't need to update > anything|https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L384-L386], > because its target is already 1, which is all it needs for that final task; > and the {{CGSB}} doesn't update anything either since its target is 0. > I think you can determine if this is the cause of a stuck app by looking for > {noformat} > yarn.YarnAllocator: Driver requested a total number of 0 executor(s). > {noformat} > in the logs of the ApplicationMaster (at least on yarn). > You can reproduce this with this test app, run with {{--conf > "spark.dynamicAllocation.minExecutors=1" --conf > "spark.dynamicAllocation.maxExecutors=5" --conf > "spark.dynamicAllocation.executorIdleTimeout=5s"}} > {code} > import org.apache.spark.SparkEnv > sc.setLogLevel("INFO") > sc.parallelize(1 to 1, 1000).count() > val execs = sc.parallelize(1 to 1000, 1000).map { _ => > SparkEnv.get.executorId}.collect().toSet > val badExec = execs.head > println("will kill exec " + badExec) > sc.parallelize(1 to 5, 5).mapPartitions { itr => > val exec = SparkEnv.get.executorId > if (exec == badExec) { > Thread.sleep(2) // long enough that all the other tasks finish, and > the executors cross the idle timeout > // now cause the executor to oom > var buffers = Seq[Array[Byte]]() > while(true) { > buffers :+= new Array[Byte](1e8.toInt) > } > itr > } else { > itr > } > }.collect() > {code} > *EDIT*: I adjusted the repro to cause an OOM on the bad executor, since > {{sc.killExecutor}} doesn't play nice with dynamic allocation in other ways. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23783) Add new generic export trait for ML pipelines
[ https://issues.apache.org/jira/browse/SPARK-23783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23783: Assignee: Apache Spark > Add new generic export trait for ML pipelines > - > > Key: SPARK-23783 > URL: https://issues.apache.org/jira/browse/SPARK-23783 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.4.0 >Reporter: holdenk >Assignee: Apache Spark >Priority: Major > > Add a new generic export trait for ML pipelines so that we can support more > than just our internal format. API design based off of the > DataFrameReader/Writer design -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23365) DynamicAllocation with failure in straggler task can lead to a hung spark job
[ https://issues.apache.org/jira/browse/SPARK-23365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-23365: - Description: Dynamic Allocation can lead to a spark app getting stuck with 0 executors requested when the executors in the last tasks of a taskset fail (eg. with an OOM). This happens when {{ExecutorAllocationManager}} s internal target number of executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target number. {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many tasks are active or pending in submitted stages, and computes how many executors would be needed for them. And as tasks finish, it will actively decrease that count, informing the {{CGSB}} along the way. (2) When it decides executors are inactive for long enough, then it requests that {{CGSB}} kill the executors -- this also tells the {{CGSB}} to update its target number of executors: https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622 So when there is just one task left, you could have the following sequence of events: (1) the {{EAM}} sets the desired number of executors to 1, and updates the {{CGSB}} too (2) while that final task is still running, the other executors cross the idle timeout, and the {{EAM}} requests the {{CGSB}} kill them (3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target of 0 executors If the final task completed normally now, everything would be OK; the next taskset would get submitted, the {{EAM}} would increase the target number of executors and it would update the {{CGSB}}. But if the executor for that final task failed (eg. an OOM), then the {{EAM}} thinks it [doesn't need to update anything|https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L384-L386], because its target is already 1, which is all it needs for that final task; and the {{CGSB}} doesn't update anything either since its target is 0. I think you can determine if this is the cause of a stuck app by looking for {noformat} yarn.YarnAllocator: Driver requested a total number of 0 executor(s). {noformat} in the logs of the ApplicationMaster (at least on yarn). You can reproduce this with this test app, run with {{--conf "spark.dynamicAllocation.minExecutors=1" --conf "spark.dynamicAllocation.maxExecutors=5" --conf "spark.dynamicAllocation.executorIdleTimeout=5s"}} {code} import org.apache.spark.SparkEnv sc.setLogLevel("INFO") sc.parallelize(1 to 1, 1000).count() val execs = sc.parallelize(1 to 1000, 1000).map { _ => SparkEnv.get.executorId}.collect().toSet val badExec = execs.head println("will kill exec " + badExec) sc.parallelize(1 to 5, 5).mapPartitions { itr => val exec = SparkEnv.get.executorId if (exec == badExec) { Thread.sleep(2) // long enough that all the other tasks finish, and the executors cross the idle timeout // now cause the executor to oom var buffers = Seq[Array[Byte]]() while(true) { buffers :+= new Array[Byte](1e8.toInt) } itr } else { itr } }.collect() {code} *EDIT*: I adjusted the repro to cause an OOM on the bad executor, since {{sc.killExecutor}} doesn't play nice with dynamic allocation in other ways. was: Dynamic Allocation can lead to a spark app getting stuck with 0 executors requested when the executors in the last tasks of a taskset fail (eg. with an OOM). This happens when {{ExecutorAllocationManager}} s internal target number of executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target number. {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many tasks are active or pending in submitted stages, and computes how many executors would be needed for them. And as tasks finish, it will actively decrease that count, informing the {{CGSB}} along the way. (2) When it decides executors are inactive for long enough, then it requests that {{CGSB}} kill the executors -- this also tells the {{CGSB}} to update its target number of executors: https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622 So when there is just one task left, you could have the following sequence of events: (1) the {{EAM}} sets the desired number of executors to 1, and updates the {{CGSB}} too (2) while that final task is still running, the other executors cross the idle timeout, and the {{EAM}} requests the {{CGSB}} kill them (3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target of 0 executors If the final task completed normally now, everything would be OK; the next taskset would get submitted, the {{EAM}} would increase th
[jira] [Assigned] (SPARK-23783) Add new generic export trait for ML pipelines
[ https://issues.apache.org/jira/browse/SPARK-23783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23783: Assignee: (was: Apache Spark) > Add new generic export trait for ML pipelines > - > > Key: SPARK-23783 > URL: https://issues.apache.org/jira/browse/SPARK-23783 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Major > > Add a new generic export trait for ML pipelines so that we can support more > than just our internal format. API design based off of the > DataFrameReader/Writer design -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23783) Add new generic export trait for ML pipelines
[ https://issues.apache.org/jira/browse/SPARK-23783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411891#comment-16411891 ] Apache Spark commented on SPARK-23783: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/19876 > Add new generic export trait for ML pipelines > - > > Key: SPARK-23783 > URL: https://issues.apache.org/jira/browse/SPARK-23783 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Major > > Add a new generic export trait for ML pipelines so that we can support more > than just our internal format. API design based off of the > DataFrameReader/Writer design -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23783) Add new generic export trait for ML pipelines
holdenk created SPARK-23783: --- Summary: Add new generic export trait for ML pipelines Key: SPARK-23783 URL: https://issues.apache.org/jira/browse/SPARK-23783 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.4.0 Reporter: holdenk Add a new generic export trait for ML pipelines so that we can support more than just our internal format. API design based off of the DataFrameReader/Writer design -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21685) Params isSet in scala Transformer triggered by _setDefault in pyspark
[ https://issues.apache.org/jira/browse/SPARK-21685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk reassigned SPARK-21685: --- Assignee: Bryan Cutler > Params isSet in scala Transformer triggered by _setDefault in pyspark > - > > Key: SPARK-21685 > URL: https://issues.apache.org/jira/browse/SPARK-21685 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Ratan Rai Sur >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.4.0 > > > I'm trying to write a PySpark wrapper for a Transformer whose transform > method includes the line > {code:java} > require(!(isSet(outputNodeName) && isSet(outputNodeIndex)), "Can't set both > outputNodeName and outputNodeIndex") > {code} > This should only throw an exception when both of these parameters are > explicitly set. > In the PySpark wrapper for the Transformer, there is this line in ___init___ > {code:java} > self._setDefault(outputNodeIndex=0) > {code} > Here is the line in the main python script showing how it is being configured > {code:java} > cntkModel = > CNTKModel().setInputCol("images").setOutputCol("output").setModelLocation(spark, > model.uri).setOutputNodeName("z") > {code} > As you can see, only setOutputNodeName is being explicitly set but the > exception is still being thrown. > If you need more context, > https://github.com/RatanRSur/mmlspark/tree/default-cntkmodel-output is the > branch with the code, the files I'm referring to here that are tracked are > the following: > src/cntk-model/src/main/scala/CNTKModel.scala > notebooks/tests/301 - CIFAR10 CNTK CNN Evaluation.ipynb > The pyspark wrapper code is autogenerated -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21685) Params isSet in scala Transformer triggered by _setDefault in pyspark
[ https://issues.apache.org/jira/browse/SPARK-21685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-21685. - Resolution: Fixed Fix Version/s: 2.4.0 > Params isSet in scala Transformer triggered by _setDefault in pyspark > - > > Key: SPARK-21685 > URL: https://issues.apache.org/jira/browse/SPARK-21685 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Ratan Rai Sur >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.4.0 > > > I'm trying to write a PySpark wrapper for a Transformer whose transform > method includes the line > {code:java} > require(!(isSet(outputNodeName) && isSet(outputNodeIndex)), "Can't set both > outputNodeName and outputNodeIndex") > {code} > This should only throw an exception when both of these parameters are > explicitly set. > In the PySpark wrapper for the Transformer, there is this line in ___init___ > {code:java} > self._setDefault(outputNodeIndex=0) > {code} > Here is the line in the main python script showing how it is being configured > {code:java} > cntkModel = > CNTKModel().setInputCol("images").setOutputCol("output").setModelLocation(spark, > model.uri).setOutputNodeName("z") > {code} > As you can see, only setOutputNodeName is being explicitly set but the > exception is still being thrown. > If you need more context, > https://github.com/RatanRSur/mmlspark/tree/default-cntkmodel-output is the > branch with the code, the files I'm referring to here that are tracked are > the following: > src/cntk-model/src/main/scala/CNTKModel.scala > notebooks/tests/301 - CIFAR10 CNTK CNN Evaluation.ipynb > The pyspark wrapper code is autogenerated -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23700) Cleanup unused imports
[ https://issues.apache.org/jira/browse/SPARK-23700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411860#comment-16411860 ] Apache Spark commented on SPARK-23700: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/20892 > Cleanup unused imports > -- > > Key: SPARK-23700 > URL: https://issues.apache.org/jira/browse/SPARK-23700 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Priority: Major > > I've noticed a fair amount of unused imports in pyspark, I'll take a look > through and try to clean them up -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23700) Cleanup unused imports
[ https://issues.apache.org/jira/browse/SPARK-23700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23700: Assignee: Apache Spark > Cleanup unused imports > -- > > Key: SPARK-23700 > URL: https://issues.apache.org/jira/browse/SPARK-23700 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Major > > I've noticed a fair amount of unused imports in pyspark, I'll take a look > through and try to clean them up -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23700) Cleanup unused imports
[ https://issues.apache.org/jira/browse/SPARK-23700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23700: Assignee: (was: Apache Spark) > Cleanup unused imports > -- > > Key: SPARK-23700 > URL: https://issues.apache.org/jira/browse/SPARK-23700 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Priority: Major > > I've noticed a fair amount of unused imports in pyspark, I'll take a look > through and try to clean them up -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23776) pyspark-sql tests should display build instructions when components are missing
[ https://issues.apache.org/jira/browse/SPARK-23776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411854#comment-16411854 ] Bruce Robbins commented on SPARK-23776: --- As it turns out, the building-spark page does have maven instructions to build with hive support before running pyspark tests. However, it does not include instructions if you are building with sbt, which needs more than just building with hive support (you need to also compile the test classes). > pyspark-sql tests should display build instructions when components are > missing > --- > > Key: SPARK-23776 > URL: https://issues.apache.org/jira/browse/SPARK-23776 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Priority: Minor > > This is a follow up to SPARK-23417. > The pyspark-streaming tests print useful build instructions when certain > components are missing in the build. > pyspark-sql's udf and readwrite tests also have specific build requirements: > the build must compile test scala files, and the build must also create the > Hive assembly. When those class or jar files are not created, the tests throw > only partially helpful exceptions, e.g.: > {noformat} > AnalysisException: u'Can not load class > test.org.apache.spark.sql.JavaStringLength, please make sure it is on the > classpath;' > {noformat} > or > {noformat} > IllegalArgumentException: u"Error while instantiating > 'org.apache.spark.sql.hive.HiveExternalCatalog':" > {noformat} > You end up in this situation when you follow Spark's build instructions and > then attempt to run the pyspark tests. > It would be nice if pyspark-sql tests provide helpful build instructions in > these cases. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23782) SHS should not show applications to user without read permission
[ https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411851#comment-16411851 ] Marcelo Vanzin commented on SPARK-23782: More discussion at: https://github.com/apache/spark/pull/17582 > SHS should not show applications to user without read permission > > > Key: SPARK-23782 > URL: https://issues.apache.org/jira/browse/SPARK-23782 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > The History Server shows all the applications to all the users, even though > they have no permission to read them. They cannot read the details of the > applications they cannot access, but still anybody can list all the > applications submitted by all users. > For instance, if we have an admin user {{admin}} and two normal users {{u1}} > and {{u2}}, and each of them submitted one application, all of them can see > in the main page of SHS: > ||App ID||App Name|| ... ||Spark User|| ... || > |app-123456789|The Admin App| .. |admin| ... | > |app-123456790|u1 secret app| .. |u1| ... | > |app-123456791|u2 secret app| .. |u2| ... | > Then clicking on each application, the proper permissions are applied and > each user can see only the applications he has the read permission for. > Instead, each user should see only the applications he has the permission to > read and he/she should not be able to see applications he has not the > permissions for. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23782) SHS should not show applications to user without read permission
[ https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411843#comment-16411843 ] Marcelo Vanzin edited comment on SPARK-23782 at 3/23/18 6:23 PM: - bq. This seems a security hole to me What sensitive information is being exposed to users that should not see it? Won't you get that same info if you go to the resource manager's page and look at what applications have run? was (Author: vanzin): bq. This seems a security hole to me What sensitive information if being exposed to users that should not see it? Won't you get that same info if you go to the resource manager's page and look at what applications have run? > SHS should not show applications to user without read permission > > > Key: SPARK-23782 > URL: https://issues.apache.org/jira/browse/SPARK-23782 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > The History Server shows all the applications to all the users, even though > they have no permission to read them. They cannot read the details of the > applications they cannot access, but still anybody can list all the > applications submitted by all users. > For instance, if we have an admin user {{admin}} and two normal users {{u1}} > and {{u2}}, and each of them submitted one application, all of them can see > in the main page of SHS: > ||App ID||App Name|| ... ||Spark User|| ... || > |app-123456789|The Admin App| .. |admin| ... | > |app-123456790|u1 secret app| .. |u1| ... | > |app-123456791|u2 secret app| .. |u2| ... | > Then clicking on each application, the proper permissions are applied and > each user can see only the applications he has the read permission for. > Instead, each user should see only the applications he has the permission to > read and he/she should not be able to see applications he has not the > permissions for. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23782) SHS should not show applications to user without read permission
[ https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411843#comment-16411843 ] Marcelo Vanzin commented on SPARK-23782: bq. This seems a security hole to me What sensitive information if being exposed to users that should not see it? Won't you get that same info if you go to the resource manager's page and look at what applications have run? > SHS should not show applications to user without read permission > > > Key: SPARK-23782 > URL: https://issues.apache.org/jira/browse/SPARK-23782 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > The History Server shows all the applications to all the users, even though > they have no permission to read them. They cannot read the details of the > applications they cannot access, but still anybody can list all the > applications submitted by all users. > For instance, if we have an admin user {{admin}} and two normal users {{u1}} > and {{u2}}, and each of them submitted one application, all of them can see > in the main page of SHS: > ||App ID||App Name|| ... ||Spark User|| ... || > |app-123456789|The Admin App| .. |admin| ... | > |app-123456790|u1 secret app| .. |u1| ... | > |app-123456791|u2 secret app| .. |u2| ... | > Then clicking on each application, the proper permissions are applied and > each user can see only the applications he has the read permission for. > Instead, each user should see only the applications he has the permission to > read and he/she should not be able to see applications he has not the > permissions for. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23782) SHS should not show applications to user without read permission
[ https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411836#comment-16411836 ] Marco Gaido commented on SPARK-23782: - [~vanzin] sorry but I have not been able to find any JIRA related to this. Probably it is my fault and I just missed it. Anyway, it sounds pretty weird to me that listing applications doesn't need to be filtered. This seems a security hole to me. I can't think of any reason/any tool which lets users to list things the user has no read permissions for. > SHS should not show applications to user without read permission > > > Key: SPARK-23782 > URL: https://issues.apache.org/jira/browse/SPARK-23782 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > The History Server shows all the applications to all the users, even though > they have no permission to read them. They cannot read the details of the > applications they cannot access, but still anybody can list all the > applications submitted by all users. > For instance, if we have an admin user {{admin}} and two normal users {{u1}} > and {{u2}}, and each of them submitted one application, all of them can see > in the main page of SHS: > ||App ID||App Name|| ... ||Spark User|| ... || > |app-123456789|The Admin App| .. |admin| ... | > |app-123456790|u1 secret app| .. |u1| ... | > |app-123456791|u2 secret app| .. |u2| ... | > Then clicking on each application, the proper permissions are applied and > each user can see only the applications he has the read permission for. > Instead, each user should see only the applications he has the permission to > read and he/she should not be able to see applications he has not the > permissions for. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23782) SHS should not show applications to user without read permission
[ https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23782: Assignee: Apache Spark > SHS should not show applications to user without read permission > > > Key: SPARK-23782 > URL: https://issues.apache.org/jira/browse/SPARK-23782 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Apache Spark >Priority: Major > > The History Server shows all the applications to all the users, even though > they have no permission to read them. They cannot read the details of the > applications they cannot access, but still anybody can list all the > applications submitted by all users. > For instance, if we have an admin user {{admin}} and two normal users {{u1}} > and {{u2}}, and each of them submitted one application, all of them can see > in the main page of SHS: > ||App ID||App Name|| ... ||Spark User|| ... || > |app-123456789|The Admin App| .. |admin| ... | > |app-123456790|u1 secret app| .. |u1| ... | > |app-123456791|u2 secret app| .. |u2| ... | > Then clicking on each application, the proper permissions are applied and > each user can see only the applications he has the read permission for. > Instead, each user should see only the applications he has the permission to > read and he/she should not be able to see applications he has not the > permissions for. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23782) SHS should not show applications to user without read permission
[ https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411824#comment-16411824 ] Apache Spark commented on SPARK-23782: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/20891 > SHS should not show applications to user without read permission > > > Key: SPARK-23782 > URL: https://issues.apache.org/jira/browse/SPARK-23782 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > The History Server shows all the applications to all the users, even though > they have no permission to read them. They cannot read the details of the > applications they cannot access, but still anybody can list all the > applications submitted by all users. > For instance, if we have an admin user {{admin}} and two normal users {{u1}} > and {{u2}}, and each of them submitted one application, all of them can see > in the main page of SHS: > ||App ID||App Name|| ... ||Spark User|| ... || > |app-123456789|The Admin App| .. |admin| ... | > |app-123456790|u1 secret app| .. |u1| ... | > |app-123456791|u2 secret app| .. |u2| ... | > Then clicking on each application, the proper permissions are applied and > each user can see only the applications he has the read permission for. > Instead, each user should see only the applications he has the permission to > read and he/she should not be able to see applications he has not the > permissions for. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23782) SHS should not show applications to user without read permission
[ https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23782: Assignee: (was: Apache Spark) > SHS should not show applications to user without read permission > > > Key: SPARK-23782 > URL: https://issues.apache.org/jira/browse/SPARK-23782 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > The History Server shows all the applications to all the users, even though > they have no permission to read them. They cannot read the details of the > applications they cannot access, but still anybody can list all the > applications submitted by all users. > For instance, if we have an admin user {{admin}} and two normal users {{u1}} > and {{u2}}, and each of them submitted one application, all of them can see > in the main page of SHS: > ||App ID||App Name|| ... ||Spark User|| ... || > |app-123456789|The Admin App| .. |admin| ... | > |app-123456790|u1 secret app| .. |u1| ... | > |app-123456791|u2 secret app| .. |u2| ... | > Then clicking on each application, the proper permissions are applied and > each user can see only the applications he has the read permission for. > Instead, each user should see only the applications he has the permission to > read and he/she should not be able to see applications he has not the > permissions for. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23782) SHS should not show applications to user without read permission
[ https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411811#comment-16411811 ] Marcelo Vanzin commented on SPARK-23782: I'm pretty sure this was discussed before and the decision was that listing information does not need to be filtered per user, but I'm too lazy to do a jira search right now. > SHS should not show applications to user without read permission > > > Key: SPARK-23782 > URL: https://issues.apache.org/jira/browse/SPARK-23782 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > The History Server shows all the applications to all the users, even though > they have no permission to read them. They cannot read the details of the > applications they cannot access, but still anybody can list all the > applications submitted by all users. > For instance, if we have an admin user {{admin}} and two normal users {{u1}} > and {{u2}}, and each of them submitted one application, all of them can see > in the main page of SHS: > ||App ID||App Name|| ... ||Spark User|| ... || > |app-123456789|The Admin App| .. |admin| ... | > |app-123456790|u1 secret app| .. |u1| ... | > |app-123456791|u2 secret app| .. |u2| ... | > Then clicking on each application, the proper permissions are applied and > each user can see only the applications he has the read permission for. > Instead, each user should see only the applications he has the permission to > read and he/she should not be able to see applications he has not the > permissions for. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23782) SHS should not show applications to user without read permission
Marco Gaido created SPARK-23782: --- Summary: SHS should not show applications to user without read permission Key: SPARK-23782 URL: https://issues.apache.org/jira/browse/SPARK-23782 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.4.0 Reporter: Marco Gaido The History Server shows all the applications to all the users, even though they have no permission to read them. They cannot read the details of the applications they cannot access, but still anybody can list all the applications submitted by all users. For instance, if we have an admin user {{admin}} and two normal users {{u1}} and {{u2}}, and each of them submitted one application, all of them can see in the main page of SHS: ||App ID||App Name|| ... ||Spark User|| ... || |app-123456789|The Admin App| .. |admin| ... | |app-123456790|u1 secret app| .. |u1| ... | |app-123456791|u2 secret app| .. |u2| ... | Then clicking on each application, the proper permissions are applied and each user can see only the applications he has the read permission for. Instead, each user should see only the applications he has the permission to read and he/she should not be able to see applications he has not the permissions for. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22342) refactor schedulerDriver registration
[ https://issues.apache.org/jira/browse/SPARK-22342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411780#comment-16411780 ] Susan X. Huynh commented on SPARK-22342: The multiple re-registration issue can lead to blacklisting and starvation when there are multiple executors per host. For example, suppose I have a host with 8 cpu, and I specify spark.executor.cores=4. Then 2 Executors could potentially get allocated on that host. If they both receive a TASK_LOST, that host will get blacklisted (since MAX_SLAVE_FAILURES=2). If this happens on every host, the app will be starved. I have hit this bug a lot when running on large machines (16-64 cpus) and specifying a small executor size, spark.executor.cores=4. > refactor schedulerDriver registration > - > > Key: SPARK-22342 > URL: https://issues.apache.org/jira/browse/SPARK-22342 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.2.0 >Reporter: Stavros Kontopoulos >Priority: Major > > This is an umbrella issue for working on: > https://github.com/apache/spark/pull/13143 > and handle the multiple re-registration issue which invalidates an offer. > To test: > dcos spark run --verbose --name=spark-nohive --submit-args="--driver-cores > 1 --conf spark.cores.max=1 --driver-memory 512M --class > org.apache.spark.examples.SparkPi http://.../spark-examples_2.11-2.2.0.jar"; > master log: > I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3085 hierarchical.cpp:303] Added framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3085 hierarchical.cpp:412] Deactivated framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3090 hierarchical.cpp:380] Activated framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi > with checkpointing disabled and capabilities [ ] > I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3087 master.cpp:3083] Framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark > Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed > over > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for > framework 'Spark Pi' at > scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi > with checkpointing disabled and capabilities [ ] > I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3087 master.cpp:3083] Framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark > Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed > over > I1020 13:49:05.00 3087 master.cpp:7662] Sending 6 offers to framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark > Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 > I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi > with checkpointing disabled and capabilities [ ] > I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 > I1020 13:49:05.00 3087 master.cpp:3083] Framework > 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark > Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed > over > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10039 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10038 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c187851b34-O10037 > I1020 13:49:05.00 3087 master.cpp:9159] Removing offer > 9764beab-c90a-4b4f-b0ff-44c
[jira] [Assigned] (SPARK-23759) Unable to bind Spark UI to specific host name / IP
[ https://issues.apache.org/jira/browse/SPARK-23759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23759: -- Assignee: Felix > Unable to bind Spark UI to specific host name / IP > -- > > Key: SPARK-23759 > URL: https://issues.apache.org/jira/browse/SPARK-23759 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.2.0 >Reporter: Felix >Assignee: Felix >Priority: Major > Fix For: 2.2.2, 2.3.1, 2.4.0 > > > Ideally, exporting SPARK_LOCAL_IP= in spark2 > environment should allow Spark2 History server to bind to private interface > however this is not working in spark 2.2.0 > > Spark2 history server still listens on 0.0.0.0 > {code:java} > [root@sparknode1 ~]# netstat -tulapn|grep 18081 > tcp0 0 0.0.0.0:18081 0.0.0.0:* > LISTEN 21313/java > tcp0 0 172.26.104.151:39126172.26.104.151:18081 > TIME_WAIT - > {code} > On earlier versions this change was working fine: > {code:java} > [root@dwphive1 ~]# netstat -tulapn|grep 18081 > tcp0 0 172.26.113.55:18081 0.0.0.0:* > LISTEN 2565/java > {code} > > This issue not only affects SHS but also Spark UI in general > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23759) Unable to bind Spark UI to specific host name / IP
[ https://issues.apache.org/jira/browse/SPARK-23759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23759. Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 2.2.2 Issue resolved by pull request 20883 [https://github.com/apache/spark/pull/20883] > Unable to bind Spark UI to specific host name / IP > -- > > Key: SPARK-23759 > URL: https://issues.apache.org/jira/browse/SPARK-23759 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.2.0 >Reporter: Felix >Priority: Major > Fix For: 2.2.2, 2.4.0, 2.3.1 > > > Ideally, exporting SPARK_LOCAL_IP= in spark2 > environment should allow Spark2 History server to bind to private interface > however this is not working in spark 2.2.0 > > Spark2 history server still listens on 0.0.0.0 > {code:java} > [root@sparknode1 ~]# netstat -tulapn|grep 18081 > tcp0 0 0.0.0.0:18081 0.0.0.0:* > LISTEN 21313/java > tcp0 0 172.26.104.151:39126172.26.104.151:18081 > TIME_WAIT - > {code} > On earlier versions this change was working fine: > {code:java} > [root@dwphive1 ~]# netstat -tulapn|grep 18081 > tcp0 0 172.26.113.55:18081 0.0.0.0:* > LISTEN 2565/java > {code} > > This issue not only affects SHS but also Spark UI in general > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23781) Merge YARN and Mesos token renewal code
Marcelo Vanzin created SPARK-23781: -- Summary: Merge YARN and Mesos token renewal code Key: SPARK-23781 URL: https://issues.apache.org/jira/browse/SPARK-23781 Project: Spark Issue Type: Improvement Components: yan, Mesos Affects Versions: 2.4.0 Reporter: Marcelo Vanzin With the fix for SPARK-23361, the code that handles delegation tokens in Mesos and YARN ends up being very similar. We shouyld refactor that code so that both backends are sharing the same code, which also would make it easier for other cluster managers to use that code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23365) DynamicAllocation with failure in straggler task can lead to a hung spark job
[ https://issues.apache.org/jira/browse/SPARK-23365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-23365: - Description: Dynamic Allocation can lead to a spark app getting stuck with 0 executors requested when the executors in the last tasks of a taskset fail (eg. with an OOM). This happens when {{ExecutorAllocationManager}} s internal target number of executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target number. {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many tasks are active or pending in submitted stages, and computes how many executors would be needed for them. And as tasks finish, it will actively decrease that count, informing the {{CGSB}} along the way. (2) When it decides executors are inactive for long enough, then it requests that {{CGSB}} kill the executors -- this also tells the {{CGSB}} to update its target number of executors: https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622 So when there is just one task left, you could have the following sequence of events: (1) the {{EAM}} sets the desired number of executors to 1, and updates the {{CGSB}} too (2) while that final task is still running, the other executors cross the idle timeout, and the {{EAM}} requests the {{CGSB}} kill them (3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target of 0 executors If the final task completed normally now, everything would be OK; the next taskset would get submitted, the {{EAM}} would increase the target number of executors and it would update the {{CGSB}}. But if the executor for that final task failed (eg. an OOM), then the {{EAM}} thinks it [doesn't need to update anything|https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L384-L386], because its target is already 1, which is all it needs for that final task; and the {{CGSB}} doesn't update anything either since its target is 0. I think you can determine if this is the cause of a stuck app by looking for {noformat} yarn.YarnAllocator: Driver requested a total number of 0 executor(s). {noformat} in the logs of the ApplicationMaster (at least on yarn). You can reproduce this with this test app, run with {{--conf "spark.dynamicAllocation.minExecutors=1" --conf "spark.dynamicAllocation.maxExecutors=5" --conf "spark.dynamicAllocation.executorIdleTimeout=5s"}} {code} import org.apache.spark.SparkEnv sc.setLogLevel("INFO") sc.parallelize(1 to 1, 1000).count() val execs = sc.parallelize(1 to 1000, 1000).map { _ => SparkEnv.get.executorId}.collect().toSet val badExec = execs.head println("will kill exec " + badExec) new Thread() { override def run(): Unit = { Thread.sleep(1) println("about to kill exec " + badExec) sc.killExecutor(badExec) } }.start() sc.parallelize(1 to 5, 5).mapPartitions { itr => val exec = SparkEnv.get.executorId if (exec == badExec) { Thread.sleep(2) // long enough that all the other tasks finish, and the executors cross the idle timeout // meanwhile, something else should kill this executor itr } else { itr } }.collect() {code} was: Dynamic Allocation can lead to a spark app getting stuck with 0 executors requested when the executors in the last tasks of a taskset fail (eg. with an OOM). This happens when {{ExecutorAllocationManager}} s internal target number of executors gets out of sync with {{CoarseGrainedSchedulerBackend}} s target number. {{EAM}} updates the {{CGSB}} in two ways: (1) it tracks how many tasks are active or pending in submitted stages, and computes how many executors would be needed for them. And as tasks finish, it will actively decrease that count, informing the {{CGSB}} along the way. (2) When it decides executors are inactive for long enough, then it requests that {{CGSB}} kill the executors -- this also tells the {{CGSB}} to update its target number of executors: https://github.com/apache/spark/blob/4df84c3f818aa536515729b442601e08c253ed35/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L622 So when there is just one task left, you could have the following sequence of events: (1) the {{EAM}} sets the desired number of executors to 1, and updates the {{CGSB}} too (2) while that final task is still running, the other executors cross the idle timeout, and the {{EAM}} requests the {{CGSB}} kill them (3) now the {{EAM}} has a target of 1 executor, and the {{CGSB}} has a target of 0 executors If the final task completed normally now, everything would be OK; the next taskset would get submitted, the {{EAM}} would increase the target number of executors and it would update the {{CGSB}}. But if the ex
[jira] [Created] (SPARK-23780) Failed to use googleVis library with new SparkR
Ivan Dzikovsky created SPARK-23780: -- Summary: Failed to use googleVis library with new SparkR Key: SPARK-23780 URL: https://issues.apache.org/jira/browse/SPARK-23780 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.2.1 Reporter: Ivan Dzikovsky I've tried to use googleVis library with Spark 2.2.1, and faced with problem. Steps to reproduce: # Install R with googleVis library. # Run SparkR: {code} sparkR --master yarn --deploy-mode client {code} # Run code that uses googleVis: {code} library(googleVis) df=data.frame(country=c("US", "GB", "BR"), val1=c(10,13,14), val2=c(23,12,32)) Bar <- gvisBarChart(df) cat("%html ", Bar$html$chart) {code} Than I got following error message: {code} Error : .onLoad failed in loadNamespace() for 'googleVis', details: call: rematchDefinition(definition, fdef, mnames, fnames, signature) error: methods can add arguments to the generic 'toJSON' only if '...' is an argument to the generic Error : package or namespace load failed for 'googleVis' {code} But expected result is to get some HTML code output, as it was with Spark 2.1.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23739) Spark structured streaming long running problem
[ https://issues.apache.org/jira/browse/SPARK-23739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411503#comment-16411503 ] Cody Koeninger commented on SPARK-23739: I meant the version of the org.apache.kafka kafka-clients artifact, not the org.apache.spark artifact. You can also unzip the assembly jar, it should have a license file for kafka-clients that has the version number appended to it, and that way you can also verify that the class file for that class is in the assembly. > Spark structured streaming long running problem > --- > > Key: SPARK-23739 > URL: https://issues.apache.org/jira/browse/SPARK-23739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Florencio >Priority: Critical > Labels: spark, streaming, structured > > I had a problem with long running spark structured streaming in spark 2.1. > Caused by: java.lang.ClassNotFoundException: > org.apache.kafka.common.requests.LeaveGroupResponse. > The detailed error is the following: > 18/03/16 16:10:57 INFO StreamExecution: Committed offsets for batch 2110. > Metadata OffsetSeqMetadata(0,1521216656590) > 18/03/16 16:10:57 INFO KafkaSource: GetBatch called with start = > Some(\{"TopicName":{"2":5520197,"1":5521045,"3":5522054,"0":5527915}}), end = > \{"TopicName":{"2":5522730,"1":5523577,"3":5524586,"0":5530441}} > 18/03/16 16:10:57 INFO KafkaSource: Partitions added: Map() > 18/03/16 16:10:57 ERROR StreamExecution: Query [id = > a233b9ff-cc39-44d3-b953-a255986c04bf, runId = > 8520e3c0-2455-4ac1-9021-8518fb58b3f8] terminated with error > java.util.zip.ZipException: invalid code lengths set > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354) > at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303) > at org.apache.spark.util.Utils$.copyStream(Utils.scala:362) > at > org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:45) > at > org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:83) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:173) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2101) > at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370) > at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.map(RDD.scala:369) > at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:287) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:503) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:499) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > 18/03/16 16:10:57 ERROR ClientUtils: Failed to close coordinator > java.lang.NoClassDefFoundError: > org/apache/kafka/common/requests/LeaveGroupResponse > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.sendLeaveGroupRequest(AbstractCoordinator.java:575) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.maybeLeaveGroup(AbstractCoordinator.java:566) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.close(AbstractCoordinator.java:555) > at > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.close(ConsumerCoordinator.java:377) > at org.apache.kaf
[jira] [Commented] (SPARK-23655) Add support for type aclitem (PostgresDialect)
[ https://issues.apache.org/jira/browse/SPARK-23655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411485#comment-16411485 ] Diego da Silva Colombo commented on SPARK-23655: [~maropu] I feel it too, it's look like a awkward type for spark, but unfortunately the table I'm trying to access is managed by postgres, so I cannot change it type. I've made a workaround by creating a custom JdbcDialect, and returning a StringType on toCatalystType. For my case it worked, but if you try to access the aclItem column an exception occurs. If you don't think this is a useful thing for spark, feel free to close the issue. Thanks for the attention > Add support for type aclitem (PostgresDialect) > -- > > Key: SPARK-23655 > URL: https://issues.apache.org/jira/browse/SPARK-23655 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Diego da Silva Colombo >Priority: Major > > When I try to load the data of pg_database, an exception occurs: > `java.lang.RuntimeException: java.sql.SQLException: Unsupported type 2003` > It's happens because the typeName of the column is *aclitem,* and there is no > match case for thist type on toCatalystType -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22239) User-defined window functions with pandas udf
[ https://issues.apache.org/jira/browse/SPARK-22239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411465#comment-16411465 ] Li Jin commented on SPARK-22239: Yeah unbounded windows are really just "groupby" in this case. I need to think more about bounded windows. I will send out some doc/ideas before implementing bounded windows. For now I will just do unbounded windows. > User-defined window functions with pandas udf > - > > Key: SPARK-22239 > URL: https://issues.apache.org/jira/browse/SPARK-22239 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.2.0 > Environment: >Reporter: Li Jin >Priority: Major > > Window function is another place we can benefit from vectored udf and add > another useful function to the pandas_udf suite. > Example usage (preliminary): > {code:java} > w = Window.partitionBy('id').orderBy('time').rangeBetween(-200, 0) > @pandas_udf(DoubleType()) > def ema(v1): > return v1.ewm(alpha=0.5).mean().iloc[-1] > df.withColumn('v1_ema', ema(df.v1).over(window)) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23779) TaskMemoryManager and UnsafeSorter use MemoryBlock
[ https://issues.apache.org/jira/browse/SPARK-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411425#comment-16411425 ] Apache Spark commented on SPARK-23779: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/20890 > TaskMemoryManager and UnsafeSorter use MemoryBlock > -- > > Key: SPARK-23779 > URL: https://issues.apache.org/jira/browse/SPARK-23779 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > This JIRA entry tries to use {{MemoryBlock}} in {TaskMemoryManager} and > classes related to {{UnsafeSorter}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23779) TaskMemoryManager and UnsafeSorter use MemoryBlock
[ https://issues.apache.org/jira/browse/SPARK-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23779: Assignee: Apache Spark > TaskMemoryManager and UnsafeSorter use MemoryBlock > -- > > Key: SPARK-23779 > URL: https://issues.apache.org/jira/browse/SPARK-23779 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark >Priority: Major > > This JIRA entry tries to use {{MemoryBlock}} in {TaskMemoryManager} and > classes related to {{UnsafeSorter}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23779) TaskMemoryManager and UnsafeSorter use MemoryBlock
[ https://issues.apache.org/jira/browse/SPARK-23779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23779: Assignee: (was: Apache Spark) > TaskMemoryManager and UnsafeSorter use MemoryBlock > -- > > Key: SPARK-23779 > URL: https://issues.apache.org/jira/browse/SPARK-23779 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > This JIRA entry tries to use {{MemoryBlock}} in {TaskMemoryManager} and > classes related to {{UnsafeSorter}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23779) TaskMemoryManager and UnsafeSorter use MemoryBlock
Kazuaki Ishizaki created SPARK-23779: Summary: TaskMemoryManager and UnsafeSorter use MemoryBlock Key: SPARK-23779 URL: https://issues.apache.org/jira/browse/SPARK-23779 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Kazuaki Ishizaki This JIRA entry tries to use {{MemoryBlock}} in {TaskMemoryManager} and classes related to {{UnsafeSorter}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23739) Spark structured streaming long running problem
[ https://issues.apache.org/jira/browse/SPARK-23739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411406#comment-16411406 ] Florencio commented on SPARK-23739: --- Thanks for the information. The kafka version is "org.apache.spark" % "spark-sql-kafka-0-10_2.11" % "2.1.0" > Spark structured streaming long running problem > --- > > Key: SPARK-23739 > URL: https://issues.apache.org/jira/browse/SPARK-23739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Florencio >Priority: Critical > Labels: spark, streaming, structured > > I had a problem with long running spark structured streaming in spark 2.1. > Caused by: java.lang.ClassNotFoundException: > org.apache.kafka.common.requests.LeaveGroupResponse. > The detailed error is the following: > 18/03/16 16:10:57 INFO StreamExecution: Committed offsets for batch 2110. > Metadata OffsetSeqMetadata(0,1521216656590) > 18/03/16 16:10:57 INFO KafkaSource: GetBatch called with start = > Some(\{"TopicName":{"2":5520197,"1":5521045,"3":5522054,"0":5527915}}), end = > \{"TopicName":{"2":5522730,"1":5523577,"3":5524586,"0":5530441}} > 18/03/16 16:10:57 INFO KafkaSource: Partitions added: Map() > 18/03/16 16:10:57 ERROR StreamExecution: Query [id = > a233b9ff-cc39-44d3-b953-a255986c04bf, runId = > 8520e3c0-2455-4ac1-9021-8518fb58b3f8] terminated with error > java.util.zip.ZipException: invalid code lengths set > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354) > at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303) > at org.apache.spark.util.Utils$.copyStream(Utils.scala:362) > at > org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:45) > at > org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:83) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:173) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2101) > at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370) > at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.map(RDD.scala:369) > at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:287) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:503) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:499) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > 18/03/16 16:10:57 ERROR ClientUtils: Failed to close coordinator > java.lang.NoClassDefFoundError: > org/apache/kafka/common/requests/LeaveGroupResponse > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.sendLeaveGroupRequest(AbstractCoordinator.java:575) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.maybeLeaveGroup(AbstractCoordinator.java:566) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.close(AbstractCoordinator.java:555) > at > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.close(ConsumerCoordinator.java:377) > at org.apache.kafka.clients.ClientUtils.closeQuietly(ClientUtils.java:66) > at > org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1383) > at > org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.ja
[jira] [Comment Edited] (SPARK-23650) Slow SparkR udf (dapply)
[ https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411246#comment-16411246 ] Deepansh edited comment on SPARK-23650 at 3/23/18 1:43 PM: --- R environment inside the thread for applying UDF is not getting reused(i think cached is not the right word for this context). It is created and destroyed with each query. {code:R} kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = "10.117.172.48:9092", topic = "source") lines<- select(kafka, cast(kafka$value, "string")) schema<-schema(lines) library(caret) df4<-dapply(lines,function(x){ print(system.time(library(caret))) x },schema) q2 <- write.stream(df4,"kafka", checkpointLocation = loc, topic = "sink", kafka.bootstrap.servers = "10.117.172.48:9092") awaitTermination(q2) {code} For the above code, for every new stream my output is, 18/03/23 11:08:10 INFO BufferedStreamThread: Loading required package: lattice 18/03/23 11:08:10 INFO BufferedStreamThread: 18/03/23 11:08:10 INFO BufferedStreamThread: Attaching package: ‘lattice’ 18/03/23 11:08:10 INFO BufferedStreamThread: 18/03/23 11:08:10 INFO BufferedStreamThread: The following object is masked from ‘package:SparkR’: 18/03/23 11:08:10 INFO BufferedStreamThread: 18/03/23 11:08:10 INFO BufferedStreamThread: histogram 18/03/23 11:08:10 INFO BufferedStreamThread: 18/03/23 11:08:10 INFO BufferedStreamThread: Loading required package: ggplot2 18/03/23 11:08:12 INFO BufferedStreamThread:user system elapsed 18/03/23 11:08:12 INFO BufferedStreamThread: 1.937 0.062 1.999 18/03/23 11:08:12 INFO RRunner: Times: boot = 0.009 s, init = 0.017 s, broadcast = 0.001 s, read-input = 0.001 s, compute = 2.064 s, write-output = 0.001 s, total = 2.093 s PFA: rest log file. For every new coming stream, the packages are loaded again inside the thread, which means R environment inside the thread is not getting reused, it is created and destroyed every time. The model(iris model), on which I am testing requires caret package. So, when I use the readRDS method, caret package is also loaded, which adds an overhead of (~2s) every time. The same problem is with the broadcast. Broadcasting the model doesn't take time, but when it deserializes the model it loads caret package which adds 2s overhead. Ideally, the packages shouldn't load again. Is there a way around to this problem? was (Author: litup): R environment inside the thread for applying UDF is not getting cached. It is created and destroyed with each query. {code:R} kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = "10.117.172.48:9092", topic = "source") lines<- select(kafka, cast(kafka$value, "string")) schema<-schema(lines) library(caret) df4<-dapply(lines,function(x){ print(system.time(library(caret))) x },schema) q2 <- write.stream(df4,"kafka", checkpointLocation = loc, topic = "sink", kafka.bootstrap.servers = "10.117.172.48:9092") awaitTermination(q2) {code} For the above code, for every new stream my output is, 18/03/23 11:08:10 INFO BufferedStreamThread: Loading required package: lattice 18/03/23 11:08:10 INFO BufferedStreamThread: 18/03/23 11:08:10 INFO BufferedStreamThread: Attaching package: ‘lattice’ 18/03/23 11:08:10 INFO BufferedStreamThread: 18/03/23 11:08:10 INFO BufferedStreamThread: The following object is masked from ‘package:SparkR’: 18/03/23 11:08:10 INFO BufferedStreamThread: 18/03/23 11:08:10 INFO BufferedStreamThread: histogram 18/03/23 11:08:10 INFO BufferedStreamThread: 18/03/23 11:08:10 INFO BufferedStreamThread: Loading required package: ggplot2 18/03/23 11:08:12 INFO BufferedStreamThread:user system elapsed 18/03/23 11:08:12 INFO BufferedStreamThread: 1.937 0.062 1.999 18/03/23 11:08:12 INFO RRunner: Times: boot = 0.009 s, init = 0.017 s, broadcast = 0.001 s, read-input = 0.001 s, compute = 2.064 s, write-output = 0.001 s, total = 2.093 s PFA: rest log file. For every new coming stream, the packages are loaded again inside the thread, which means R environment inside the thread is not getting reused, it is created and destroyed every time. The model(iris model), on which I am testing requires caret package. So, when I use the readRDS method, caret package is also loaded, which adds an overhead of (~2s) every time. The same problem is with the broadcast. Broadcasting the model doesn't take time, but when it deserializes the model it loads caret package which adds 2s overhead. Ideally, the packages shouldn't load again. Is there a way around to this problem? > Slow SparkR udf (dapply) > > > Key: SPARK-23650 > URL: https://issues.apache.org/jira/browse/SPARK-23650 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, SparkR, Structured Streaming >Affects V
[jira] [Commented] (SPARK-23739) Spark structured streaming long running problem
[ https://issues.apache.org/jira/browse/SPARK-23739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411396#comment-16411396 ] Cody Koeninger commented on SPARK-23739: What version of the org.apache.kafka artifact is in the assembly? > Spark structured streaming long running problem > --- > > Key: SPARK-23739 > URL: https://issues.apache.org/jira/browse/SPARK-23739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Florencio >Priority: Critical > Labels: spark, streaming, structured > > I had a problem with long running spark structured streaming in spark 2.1. > Caused by: java.lang.ClassNotFoundException: > org.apache.kafka.common.requests.LeaveGroupResponse. > The detailed error is the following: > 18/03/16 16:10:57 INFO StreamExecution: Committed offsets for batch 2110. > Metadata OffsetSeqMetadata(0,1521216656590) > 18/03/16 16:10:57 INFO KafkaSource: GetBatch called with start = > Some(\{"TopicName":{"2":5520197,"1":5521045,"3":5522054,"0":5527915}}), end = > \{"TopicName":{"2":5522730,"1":5523577,"3":5524586,"0":5530441}} > 18/03/16 16:10:57 INFO KafkaSource: Partitions added: Map() > 18/03/16 16:10:57 ERROR StreamExecution: Query [id = > a233b9ff-cc39-44d3-b953-a255986c04bf, runId = > 8520e3c0-2455-4ac1-9021-8518fb58b3f8] terminated with error > java.util.zip.ZipException: invalid code lengths set > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354) > at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303) > at org.apache.spark.util.Utils$.copyStream(Utils.scala:362) > at > org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:45) > at > org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:83) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:173) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2101) > at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370) > at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.map(RDD.scala:369) > at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:287) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:503) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:499) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > 18/03/16 16:10:57 ERROR ClientUtils: Failed to close coordinator > java.lang.NoClassDefFoundError: > org/apache/kafka/common/requests/LeaveGroupResponse > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.sendLeaveGroupRequest(AbstractCoordinator.java:575) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.maybeLeaveGroup(AbstractCoordinator.java:566) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.close(AbstractCoordinator.java:555) > at > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.close(ConsumerCoordinator.java:377) > at org.apache.kafka.clients.ClientUtils.closeQuietly(ClientUtils.java:66) > at > org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1383) > at > org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1364) > at org.apache.spark.s
[jira] [Commented] (SPARK-23739) Spark structured streaming long running problem
[ https://issues.apache.org/jira/browse/SPARK-23739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411377#comment-16411377 ] Marco Gaido commented on SPARK-23739: - [~zsxwing] [~joseph.torres] [~c...@koeninger.org] I am not very familiar with Structured Streaming, but this seems a bug to me. Do you have any idea? Thanks. > Spark structured streaming long running problem > --- > > Key: SPARK-23739 > URL: https://issues.apache.org/jira/browse/SPARK-23739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Florencio >Priority: Critical > Labels: spark, streaming, structured > > I had a problem with long running spark structured streaming in spark 2.1. > Caused by: java.lang.ClassNotFoundException: > org.apache.kafka.common.requests.LeaveGroupResponse. > The detailed error is the following: > 18/03/16 16:10:57 INFO StreamExecution: Committed offsets for batch 2110. > Metadata OffsetSeqMetadata(0,1521216656590) > 18/03/16 16:10:57 INFO KafkaSource: GetBatch called with start = > Some(\{"TopicName":{"2":5520197,"1":5521045,"3":5522054,"0":5527915}}), end = > \{"TopicName":{"2":5522730,"1":5523577,"3":5524586,"0":5530441}} > 18/03/16 16:10:57 INFO KafkaSource: Partitions added: Map() > 18/03/16 16:10:57 ERROR StreamExecution: Query [id = > a233b9ff-cc39-44d3-b953-a255986c04bf, runId = > 8520e3c0-2455-4ac1-9021-8518fb58b3f8] terminated with error > java.util.zip.ZipException: invalid code lengths set > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354) > at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303) > at org.apache.spark.util.Utils$.copyStream(Utils.scala:362) > at > org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:45) > at > org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:83) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:173) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2101) > at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370) > at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.map(RDD.scala:369) > at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:287) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:503) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:499) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > 18/03/16 16:10:57 ERROR ClientUtils: Failed to close coordinator > java.lang.NoClassDefFoundError: > org/apache/kafka/common/requests/LeaveGroupResponse > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.sendLeaveGroupRequest(AbstractCoordinator.java:575) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.maybeLeaveGroup(AbstractCoordinator.java:566) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.close(AbstractCoordinator.java:555) > at > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.close(ConsumerCoordinator.java:377) > at org.apache.kafka.clients.ClientUtils.closeQuietly(ClientUtils.java:66) > at > org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1383) > at > org.apache.kafka
[jira] [Updated] (SPARK-23778) SparkContext.emptyRDD confuses SparkContext.union
[ https://issues.apache.org/jira/browse/SPARK-23778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefano Pettini updated SPARK-23778: Attachment: as_it_should_be.png > SparkContext.emptyRDD confuses SparkContext.union > - > > Key: SPARK-23778 > URL: https://issues.apache.org/jira/browse/SPARK-23778 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0, 2.3.0 >Reporter: Stefano Pettini >Priority: Minor > Attachments: as_it_should_be.png, > partitioner_lost_and_unneeded_extra_stage.png > > > SparkContext.emptyRDD is an unpartitioned RDD. Clearly it's empty so whether > it's partitioned or not should be just a academic debate. Unfortunately it > doesn't seem to be like this and the issue has side effects. > Namely, it confuses the RDD union. > When there are N classic RDDs partitioned the same way, the union is > implemented with the optimized PartitionerAwareUnionRDD, that retains the > common partitioner in the result. If one of the N RDDs happens to be an > emptyRDD, as it doesn't have a partitioner, the union is implemented by just > appending all the partitions of the N RDDs, dropping the partitioner. But > there's no need for this, as the emptyRDD contains no elements. This results > in further unneeded shuffles once the result of the union is used. > See for example: > {{val p = new HashPartitioner(3)}} > {{val a = context.parallelize(List(10, 20, 30, 40, 50)).keyBy(_ / > 10).partitionBy(p)}} > {{val b1 = a.mapValues(_ + 1)}} > {{val b2 = a.mapValues(_ - 1)}} > {{val e = context.emptyRDD[(Int, Int)]}} > {{val x = context.union(a, b1, b2, e)}} > {{val y = x.reduceByKey(_ + _)}} > {{assert(x.partitioner.contains(p))}} > {{y.collect()}} > The assert fails. Disabling it, it's possible to see that reduceByKey > introduced a shuffles, although all the input RDDs are already partitioned > the same way, but the emptyRDD. > Forcing a partitioner on the emptyRDD: > {{val e = context.emptyRDD[(Int, Int)].partitionBy(p)}} > solves the problem with the assert and doesn't introduce the unneeded extra > stage and shuffle. > Union implementation should be changed to ignore the partitioner of emptyRDDs > and consider those as _partitioned in a way compatible with any partitioner_, > basically ignoring them. > Present since 1.3 at least. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23778) SparkContext.emptyRDD confuses SparkContext.union
Stefano Pettini created SPARK-23778: --- Summary: SparkContext.emptyRDD confuses SparkContext.union Key: SPARK-23778 URL: https://issues.apache.org/jira/browse/SPARK-23778 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0, 1.3.0 Reporter: Stefano Pettini Attachments: partitioner_lost_and_unneeded_extra_stage.png SparkContext.emptyRDD is an unpartitioned RDD. Clearly it's empty so whether it's partitioned or not should be just a academic debate. Unfortunately it doesn't seem to be like this and the issue has side effects. Namely, it confuses the RDD union. When there are N classic RDDs partitioned the same way, the union is implemented with the optimized PartitionerAwareUnionRDD, that retains the common partitioner in the result. If one of the N RDDs happens to be an emptyRDD, as it doesn't have a partitioner, the union is implemented by just appending all the partitions of the N RDDs, dropping the partitioner. But there's no need for this, as the emptyRDD contains no elements. This results in further unneeded shuffles once the result of the union is used. See for example: {{val p = new HashPartitioner(3)}} {{val a = context.parallelize(List(10, 20, 30, 40, 50)).keyBy(_ / 10).partitionBy(p)}} {{val b1 = a.mapValues(_ + 1)}} {{val b2 = a.mapValues(_ - 1)}} {{val e = context.emptyRDD[(Int, Int)]}} {{val x = context.union(a, b1, b2, e)}} {{val y = x.reduceByKey(_ + _)}} {{assert(x.partitioner.contains(p))}} {{y.collect()}} The assert fails. Disabling it, it's possible to see that reduceByKey introduced a shuffles, although all the input RDDs are already partitioned the same way, but the emptyRDD. Forcing a partitioner on the emptyRDD: {{val e = context.emptyRDD[(Int, Int)].partitionBy(p)}} solves the problem with the assert and doesn't introduce the unneeded extra stage and shuffle. Union implementation should be changed to ignore the partitioner of emptyRDDs and consider those as _partitioned in a way compatible with any partitioner_, basically ignoring them. Present since 1.3 at least. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23778) SparkContext.emptyRDD confuses SparkContext.union
[ https://issues.apache.org/jira/browse/SPARK-23778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefano Pettini updated SPARK-23778: Attachment: partitioner_lost_and_unneeded_extra_stage.png > SparkContext.emptyRDD confuses SparkContext.union > - > > Key: SPARK-23778 > URL: https://issues.apache.org/jira/browse/SPARK-23778 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0, 2.3.0 >Reporter: Stefano Pettini >Priority: Minor > Attachments: partitioner_lost_and_unneeded_extra_stage.png > > > SparkContext.emptyRDD is an unpartitioned RDD. Clearly it's empty so whether > it's partitioned or not should be just a academic debate. Unfortunately it > doesn't seem to be like this and the issue has side effects. > Namely, it confuses the RDD union. > When there are N classic RDDs partitioned the same way, the union is > implemented with the optimized PartitionerAwareUnionRDD, that retains the > common partitioner in the result. If one of the N RDDs happens to be an > emptyRDD, as it doesn't have a partitioner, the union is implemented by just > appending all the partitions of the N RDDs, dropping the partitioner. But > there's no need for this, as the emptyRDD contains no elements. This results > in further unneeded shuffles once the result of the union is used. > See for example: > {{val p = new HashPartitioner(3)}} > {{val a = context.parallelize(List(10, 20, 30, 40, 50)).keyBy(_ / > 10).partitionBy(p)}} > {{val b1 = a.mapValues(_ + 1)}} > {{val b2 = a.mapValues(_ - 1)}} > {{val e = context.emptyRDD[(Int, Int)]}} > {{val x = context.union(a, b1, b2, e)}} > {{val y = x.reduceByKey(_ + _)}} > {{assert(x.partitioner.contains(p))}} > {{y.collect()}} > The assert fails. Disabling it, it's possible to see that reduceByKey > introduced a shuffles, although all the input RDDs are already partitioned > the same way, but the emptyRDD. > Forcing a partitioner on the emptyRDD: > {{val e = context.emptyRDD[(Int, Int)].partitionBy(p)}} > solves the problem with the assert and doesn't introduce the unneeded extra > stage and shuffle. > Union implementation should be changed to ignore the partitioner of emptyRDDs > and consider those as _partitioned in a way compatible with any partitioner_, > basically ignoring them. > Present since 1.3 at least. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23769) Remove unnecessary scalastyle check disabling
[ https://issues.apache.org/jira/browse/SPARK-23769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23769. -- Resolution: Fixed Assignee: Riaas Mokiem Fix Version/s: 2.4.0 2.3.1 Fixed in https://github.com/apache/spark/pull/20880 > Remove unnecessary scalastyle check disabling > - > > Key: SPARK-23769 > URL: https://issues.apache.org/jira/browse/SPARK-23769 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Riaas Mokiem >Assignee: Riaas Mokiem >Priority: Minor > Fix For: 2.3.1, 2.4.0 > > > In `org/apache/spark/util/CompletionIterator.scala` the Scalastyle checker is > disabled for 1 line of code. However, this line of code doesn't seem to > violate any of the currently active rules for Scalastyle. So the Scalastyle > checker doesn't need to be disabled there. > I've tested this by removing the comments that disable the checker and > running the checker with `build/mv scalastyle:check`. With the comments > removed (so with the checker active for that line) the build still succeeds > and no violations are shown in `core/target/scalastyle-output.xml`. I'll > create a pull request to remove the comments that disable the checker. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23777) Missing DAG arrows between stages
Stefano Pettini created SPARK-23777: --- Summary: Missing DAG arrows between stages Key: SPARK-23777 URL: https://issues.apache.org/jira/browse/SPARK-23777 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.3.0, 1.3.0 Reporter: Stefano Pettini Attachments: Screenshot-2018-3-23 RDDTestApp - Details for Job 0.png In the Spark UI DAGs, sometimes there are missing arrows between stages. It seems to happen when the same RDD is shuffled twice. For example in this case: {{val a = context.parallelize(List(10, 20, 30, 40, 50)).keyBy(_ / 10)}} {{val b = a join a}} {{b.collect()}} There's a missing arrow from stage 1 to 2. _This is an old one, since 1.3.0 at least, still reproducible in 2.3.0._ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23777) Missing DAG arrows between stages
[ https://issues.apache.org/jira/browse/SPARK-23777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefano Pettini updated SPARK-23777: Attachment: Screenshot-2018-3-23 RDDTestApp - Details for Job 0.png > Missing DAG arrows between stages > - > > Key: SPARK-23777 > URL: https://issues.apache.org/jira/browse/SPARK-23777 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.3.0, 2.3.0 >Reporter: Stefano Pettini >Priority: Trivial > Attachments: Screenshot-2018-3-23 RDDTestApp - Details for Job 0.png > > > In the Spark UI DAGs, sometimes there are missing arrows between stages. It > seems to happen when the same RDD is shuffled twice. > For example in this case: > {{val a = context.parallelize(List(10, 20, 30, 40, 50)).keyBy(_ / 10)}} > {{val b = a join a}} > {{b.collect()}} > There's a missing arrow from stage 1 to 2. > _This is an old one, since 1.3.0 at least, still reproducible in 2.3.0._ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23650) Slow SparkR udf (dapply)
[ https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deepansh updated SPARK-23650: - Attachment: packageReload.txt > Slow SparkR udf (dapply) > > > Key: SPARK-23650 > URL: https://issues.apache.org/jira/browse/SPARK-23650 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, SparkR, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Deepansh >Priority: Major > Attachments: packageReload.txt, read_model_in_udf.txt, > sparkR_log2.txt, sparkRlag.txt > > > For eg, I am getting streams from Kafka and I want to implement a model made > in R for those streams. For this, I am using dapply. > My code is: > iris_model <- readRDS("./iris_model.rds") > randomBr <- SparkR:::broadcast(sc, iris_model) > kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = > "localhost:9092", topic = "source") > lines<- select(kafka, cast(kafka$value, "string")) > schema<-schema(lines) > df1<-dapply(lines,function(x){ > i_model<-SparkR:::value(randomMatBr) > for (row in 1:nrow(x)) > { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) > y<-toJSON(y) x[row,"value"] = y } > x > },schema) > Every time when Kafka streams are fetched the dapply method creates new > runner thread and ships the variables again, which causes a huge lag(~2s for > shipping model) every time. I even tried without broadcast variables but it > takes same time to ship variables. Can some other techniques be applied to > improve its performance? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23650) Slow SparkR udf (dapply)
[ https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411246#comment-16411246 ] Deepansh commented on SPARK-23650: -- R environment inside the thread for applying UDF is not getting cached. It is created and destroyed with each query. {code:R} kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = "10.117.172.48:9092", topic = "source") lines<- select(kafka, cast(kafka$value, "string")) schema<-schema(lines) library(caret) df4<-dapply(lines,function(x){ print(system.time(library(caret))) x },schema) q2 <- write.stream(df4,"kafka", checkpointLocation = loc, topic = "sink", kafka.bootstrap.servers = "10.117.172.48:9092") awaitTermination(q2) {code} For the above code, for every new stream my output is, 18/03/23 11:08:10 INFO BufferedStreamThread: Loading required package: lattice 18/03/23 11:08:10 INFO BufferedStreamThread: 18/03/23 11:08:10 INFO BufferedStreamThread: Attaching package: ‘lattice’ 18/03/23 11:08:10 INFO BufferedStreamThread: 18/03/23 11:08:10 INFO BufferedStreamThread: The following object is masked from ‘package:SparkR’: 18/03/23 11:08:10 INFO BufferedStreamThread: 18/03/23 11:08:10 INFO BufferedStreamThread: histogram 18/03/23 11:08:10 INFO BufferedStreamThread: 18/03/23 11:08:10 INFO BufferedStreamThread: Loading required package: ggplot2 18/03/23 11:08:12 INFO BufferedStreamThread:user system elapsed 18/03/23 11:08:12 INFO BufferedStreamThread: 1.937 0.062 1.999 18/03/23 11:08:12 INFO RRunner: Times: boot = 0.009 s, init = 0.017 s, broadcast = 0.001 s, read-input = 0.001 s, compute = 2.064 s, write-output = 0.001 s, total = 2.093 s PFA: rest log file. For every new coming stream, the packages are loaded again inside the thread, which means R environment inside the thread is not getting reused, it is created and destroyed every time. The model(iris model), on which I am testing requires caret package. So, when I use the readRDS method, caret package is also loaded, which adds an overhead of (~2s) every time. The same problem is with the broadcast. Broadcasting the model doesn't take time, but when it deserializes the model it loads caret package which adds 2s overhead. Ideally, the packages shouldn't load again. Is there a way around to this problem? > Slow SparkR udf (dapply) > > > Key: SPARK-23650 > URL: https://issues.apache.org/jira/browse/SPARK-23650 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, SparkR, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Deepansh >Priority: Major > Attachments: read_model_in_udf.txt, sparkR_log2.txt, sparkRlag.txt > > > For eg, I am getting streams from Kafka and I want to implement a model made > in R for those streams. For this, I am using dapply. > My code is: > iris_model <- readRDS("./iris_model.rds") > randomBr <- SparkR:::broadcast(sc, iris_model) > kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = > "localhost:9092", topic = "source") > lines<- select(kafka, cast(kafka$value, "string")) > schema<-schema(lines) > df1<-dapply(lines,function(x){ > i_model<-SparkR:::value(randomMatBr) > for (row in 1:nrow(x)) > { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) > y<-toJSON(y) x[row,"value"] = y } > x > },schema) > Every time when Kafka streams are fetched the dapply method creates new > runner thread and ships the variables again, which causes a huge lag(~2s for > shipping model) every time. I even tried without broadcast variables but it > takes same time to ship variables. Can some other techniques be applied to > improve its performance? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23685) Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
[ https://issues.apache.org/jira/browse/SPARK-23685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16410939#comment-16410939 ] Gabor Somogyi commented on SPARK-23685: --- Jira assignment is not required. You can write a comment when you're working on a jira but your PR is not ready. Or a PR makes it also clear that you're dealing with it. > Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive > Offsets (i.e. Log Compaction) > - > > Key: SPARK-23685 > URL: https://issues.apache.org/jira/browse/SPARK-23685 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: sirisha >Priority: Major > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaSourceRDD & CachedKafkaConsumer assumes that the next offset will always > be just an increment of 1 .If not, it throws the below exception: > > "Cannot fetch records in [5589, 5693) (GroupId: XXX, TopicPartition:). > Some data may have been lost because they are not available in Kafka any > more; either the data was aged out by Kafka or the topic may have been > deleted before all the data in the topic was processed. If you don't want > your streaming query to fail on such cases, set the source option > "failOnDataLoss" to "false". " > > FYI: This bug is related to https://issues.apache.org/jira/browse/SPARK-17147 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23734) InvalidSchemaException While Saving ALSModel
[ https://issues.apache.org/jira/browse/SPARK-23734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16410914#comment-16410914 ] Liang-Chi Hsieh commented on SPARK-23734: - I use the latest master branch and can't reproduce the reported issue. > InvalidSchemaException While Saving ALSModel > > > Key: SPARK-23734 > URL: https://issues.apache.org/jira/browse/SPARK-23734 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 > Environment: macOS 10.13.2 > Scala 2.11.8 > Spark 2.3.0 >Reporter: Stanley Poon >Priority: Major > Labels: ALS, parquet, persistence > > After fitting an ALSModel, get following error while saving the model: > Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can > not be empty. Parquet does not support empty group without leaves. Empty > group: spark_schema > Exactly the same code ran ok on 2.2.1. > Same issue also occurs on other ALSModels we have. > h2. *To reproduce* > Get ALSExample: > [https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala] > and add the following line to save the model right before "spark.stop". > {quote} model.write.overwrite().save("SparkExampleALSModel") > {quote} > h2. Stack Trace > Exception in thread "main" java.lang.ExceptionInInitializerError > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:112) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225) > at > org.apache.spark.ml.recommendation.ALSModel$ALSModelWriter.saveImpl(ALS.scala:510) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:103) > at com.vitalmove.model.ALSExample$.main(ALSExample.scala:83) > at com.vitalmove.model.ALSExample.main(ALSExample.scala) > Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can > not be empty. Parquet does not support empty group without leaves. Empty > group: spark_schema > at org.apache.parquet.schema.GroupType.(GroupType.java:92) > at org.apache.parquet.schema.GroupType.(GroupType.java:48) > at org.apache.parquet.schema.MessageType.(MessageType.java:50) > at org.apache.parquet.schema.Types$MessageTypeBuilder.named(Types.java:1256) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.(ParquetSchemaConverter.scala:567) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.(ParquetSchemaConverter.scala) > -- This