[jira] [Created] (SPARK-26533) Support query auto cancel on thriftserver
zhoukang created SPARK-26533: Summary: Support query auto cancel on thriftserver Key: SPARK-26533 URL: https://issues.apache.org/jira/browse/SPARK-26533 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: zhoukang Support query auto cancelling when running too long on thriftserver. For some cases,we use thriftserver as long-running applications. Some times we want all the query need not to run more than given time. In these cases,we can enable auto cancel for time-consumed query.Which can let us release resources for other queries to run. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26126) Should put scala-library deps into root pom instead of spark-tags module
[ https://issues.apache.org/jira/browse/SPARK-26126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733848#comment-16733848 ] Hyukjin Kwon commented on SPARK-26126: -- There are sometimes some reasons and it needs a lot of efforts to check the history of, in particular, old codes. For instance, some modules could no need for {{spark-tags}} intentionally, and use module specific annotation. If you can track the history, check why it was added there, and are able to elaborate why changing is correct, please go ahead for a PR. Don't forget to double check SBT build as well. Otherwise, let's just leave this resolved. > Should put scala-library deps into root pom instead of spark-tags module > > > Key: SPARK-26126 > URL: https://issues.apache.org/jira/browse/SPARK-26126 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.3.0, 2.4.0 >Reporter: liupengcheng >Priority: Minor > > When I do some backport in our custom spark, I notice some strange code from > spark-tags module: > {code:java} > > > org.scala-lang > scala-library > ${scala.version} > > > {code} > As i known, should spark-tags only contains some annotation related classes > or deps? > should we put the scala-library deps to root pom? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26339) Behavior of reading files that start with underscore is confusing
[ https://issues.apache.org/jira/browse/SPARK-26339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733844#comment-16733844 ] Apache Spark commented on SPARK-26339: -- User 'KeiichiHirobe' has created a pull request for this issue: https://github.com/apache/spark/pull/23446 > Behavior of reading files that start with underscore is confusing > - > > Key: SPARK-26339 > URL: https://issues.apache.org/jira/browse/SPARK-26339 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Keiichi Hirobe >Assignee: Keiichi Hirobe >Priority: Minor > > Behavior of reading files that start with underscore is as follows. > # spark.read (no schema) throws exception which message is confusing. > # spark.read (userSpecificationSchema) succesfully reads, but content is > emtpy. > Example of files are as follows. > The same behavior occured when I read json files. > {code:bash} > $ cat test.csv > test1,10 > test2,20 > $ cp test.csv _test.csv > $ ./bin/spark-shell --master local[2] > {code} > Spark shell snippet for reproduction: > {code:java} > scala> val df=spark.read.csv("test.csv") > df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string] > scala> df.show() > +-+---+ > | _c0|_c1| > +-+---+ > |test1| 10| > |test2| 20| > +-+---+ > scala> val df = spark.read.schema("test STRING, number INT").csv("test.csv") > df: org.apache.spark.sql.DataFrame = [test: string, number: int] > scala> df.show() > +-+--+ > | test|number| > +-+--+ > |test1|10| > |test2|20| > +-+--+ > scala> val df=spark.read.csv("_test.csv") > org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It > must be specified manually.; > at > org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$13(DataSource.scala:185) > at scala.Option.getOrElse(Option.scala:138) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:185) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:625) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:478) > ... 49 elided > scala> val df=spark.read.schema("test STRING, number INT").csv("_test.csv") > df: org.apache.spark.sql.DataFrame = [test: string, number: int] > scala> df.show() > ++--+ > |test|number| > ++--+ > ++--+ > {code} > I noticed that spark cannot read files that start with underscore after I > read some codes.(I could not find any documents about file name limitation) > Above behavior is not good especially userSpecificationSchema case, I think. > I suggest to throw exception which message is "Path does not exist" in both > cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26339) Behavior of reading files that start with underscore is confusing
[ https://issues.apache.org/jira/browse/SPARK-26339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733846#comment-16733846 ] Apache Spark commented on SPARK-26339: -- User 'KeiichiHirobe' has created a pull request for this issue: https://github.com/apache/spark/pull/23446 > Behavior of reading files that start with underscore is confusing > - > > Key: SPARK-26339 > URL: https://issues.apache.org/jira/browse/SPARK-26339 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Keiichi Hirobe >Assignee: Keiichi Hirobe >Priority: Minor > > Behavior of reading files that start with underscore is as follows. > # spark.read (no schema) throws exception which message is confusing. > # spark.read (userSpecificationSchema) succesfully reads, but content is > emtpy. > Example of files are as follows. > The same behavior occured when I read json files. > {code:bash} > $ cat test.csv > test1,10 > test2,20 > $ cp test.csv _test.csv > $ ./bin/spark-shell --master local[2] > {code} > Spark shell snippet for reproduction: > {code:java} > scala> val df=spark.read.csv("test.csv") > df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string] > scala> df.show() > +-+---+ > | _c0|_c1| > +-+---+ > |test1| 10| > |test2| 20| > +-+---+ > scala> val df = spark.read.schema("test STRING, number INT").csv("test.csv") > df: org.apache.spark.sql.DataFrame = [test: string, number: int] > scala> df.show() > +-+--+ > | test|number| > +-+--+ > |test1|10| > |test2|20| > +-+--+ > scala> val df=spark.read.csv("_test.csv") > org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It > must be specified manually.; > at > org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$13(DataSource.scala:185) > at scala.Option.getOrElse(Option.scala:138) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:185) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:625) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:478) > ... 49 elided > scala> val df=spark.read.schema("test STRING, number INT").csv("_test.csv") > df: org.apache.spark.sql.DataFrame = [test: string, number: int] > scala> df.show() > ++--+ > |test|number| > ++--+ > ++--+ > {code} > I noticed that spark cannot read files that start with underscore after I > read some codes.(I could not find any documents about file name limitation) > Above behavior is not good especially userSpecificationSchema case, I think. > I suggest to throw exception which message is "Path does not exist" in both > cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26126) Should put scala-library deps into root pom instead of spark-tags module
[ https://issues.apache.org/jira/browse/SPARK-26126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733836#comment-16733836 ] liupengcheng edited comment on SPARK-26126 at 1/4/19 5:41 AM: -- [~hyukjin.kwon] Yes, there is no actual problem happening, but put the dependency of scala-library into spark-tags module is confusing. If you agree that we shall put it into root pom for better understanding, I can put a PR for this issue or you can just leave it resovled. was (Author: liupengcheng): [~hyukjin.kwon] Yes, there is no actual problem happening, but put the dependency of scala-library into spark-tags module is confusing. If you agree that we shall put it into root pom for better understanding, I can put a PR for this issue. > Should put scala-library deps into root pom instead of spark-tags module > > > Key: SPARK-26126 > URL: https://issues.apache.org/jira/browse/SPARK-26126 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.3.0, 2.4.0 >Reporter: liupengcheng >Priority: Minor > > When I do some backport in our custom spark, I notice some strange code from > spark-tags module: > {code:java} > > > org.scala-lang > scala-library > ${scala.version} > > > {code} > As i known, should spark-tags only contains some annotation related classes > or deps? > should we put the scala-library deps to root pom? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26126) Should put scala-library deps into root pom instead of spark-tags module
[ https://issues.apache.org/jira/browse/SPARK-26126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733836#comment-16733836 ] liupengcheng commented on SPARK-26126: -- [~hyukjin.kwon] Yes, there is no actual problem happening, but put the dependency of scala-library into spark-tags module is confusing. If you agree that we shall put it into root pom for better understanding, I can put a PR for this issue. > Should put scala-library deps into root pom instead of spark-tags module > > > Key: SPARK-26126 > URL: https://issues.apache.org/jira/browse/SPARK-26126 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.3.0, 2.4.0 >Reporter: liupengcheng >Priority: Minor > > When I do some backport in our custom spark, I notice some strange code from > spark-tags module: > {code:java} > > > org.scala-lang > scala-library > ${scala.version} > > > {code} > As i known, should spark-tags only contains some annotation related classes > or deps? > should we put the scala-library deps to root pom? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26532) repartitionByRange reads source files twice
[ https://issues.apache.org/jira/browse/SPARK-26532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Dias updated SPARK-26532: -- Component/s: SQL > repartitionByRange reads source files twice > --- > > Key: SPARK-26532 > URL: https://issues.apache.org/jira/browse/SPARK-26532 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.3.2, 2.4.0 >Reporter: Mike Dias >Priority: Minor > Attachments: repartition Stages.png, repartitionByRange Stages.png > > > When using repartitionByRange in Structured Stream API for reading then write > files, it reads the source files twice. > Example: > {code:java} > val ds = spark.readStream. > format("text"). > option("path", "data/streaming"). > load > val q = ds. > repartitionByRange(10, $"value"). > writeStream. > format("parquet"). > option("path", "/tmp/output"). > option("checkpointLocation", "/tmp/checkpoint"). > start() > {code} > This execution creates 3 stages: 2 for reading and 1 for writing, reading the > source twice. It's easy to see it in a large dataset where the reading > process time is doubled. > > {code:java} > $ curl -s -XGET > http://localhost:4040/api/v1/applications//stages > {code} > > > This is very different from the repartition strategy, which creates 2 stages: > 1 for reading and 1 for writing. > {code:java} > val ds = spark.readStream. > format("text"). > option("path", "data/streaming"). > load > val q = ds. > repartition(10, $"value"). > writeStream. > format("parquet"). > option("path", "/tmp/output"). > option("checkpointLocation", "/tmp/checkpoint"). > start(){code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26532) repartitionByRange reads source files twice
[ https://issues.apache.org/jira/browse/SPARK-26532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Dias updated SPARK-26532: -- Summary: repartitionByRange reads source files twice (was: repartitionByRange strategy reads source files twice) > repartitionByRange reads source files twice > --- > > Key: SPARK-26532 > URL: https://issues.apache.org/jira/browse/SPARK-26532 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.2, 2.4.0 >Reporter: Mike Dias >Priority: Minor > Attachments: repartition Stages.png, repartitionByRange Stages.png > > > When using repartitionByRange in Structured Stream API for reading then write > files, it reads the source files twice. > Example: > {code:java} > val ds = spark.readStream. > format("text"). > option("path", "data/streaming"). > load > val q = ds. > repartitionByRange(10, $"value"). > writeStream. > format("parquet"). > option("path", "/tmp/output"). > option("checkpointLocation", "/tmp/checkpoint"). > start() > {code} > This execution creates 3 stages: 2 for reading and 1 for writing, reading the > source twice. It's easy to see it in a large dataset where the reading > process time is doubled. > > {code:java} > $ curl -s -XGET > http://localhost:4040/api/v1/applications//stages > {code} > > > This is very different from the repartition strategy, which creates 2 stages: > 1 for reading and 1 for writing. > {code:java} > val ds = spark.readStream. > format("text"). > option("path", "data/streaming"). > load > val q = ds. > repartition(10, $"value"). > writeStream. > format("parquet"). > option("path", "/tmp/output"). > option("checkpointLocation", "/tmp/checkpoint"). > start(){code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26532) repartitionByRange strategy reads source files twice
Mike Dias created SPARK-26532: - Summary: repartitionByRange strategy reads source files twice Key: SPARK-26532 URL: https://issues.apache.org/jira/browse/SPARK-26532 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.0, 2.3.2 Reporter: Mike Dias Attachments: repartition Stages.png, repartitionByRange Stages.png When using repartitionByRange in Structured Stream API for reading then write files, it reads the source files twice. Example: {code:java} val ds = spark.readStream. format("text"). option("path", "data/streaming"). load val q = ds. repartitionByRange(10, $"value"). writeStream. format("parquet"). option("path", "/tmp/output"). option("checkpointLocation", "/tmp/checkpoint"). start() {code} This execution creates 3 stages: 2 for reading and 1 for writing, reading the source twice. It's easy to see it in a large dataset where the reading process time is doubled. {code:java} $ curl -s -XGET http://localhost:4040/api/v1/applications//stages {code} This is very different from the repartition strategy, which creates 2 stages: 1 for reading and 1 for writing. {code:java} val ds = spark.readStream. format("text"). option("path", "data/streaming"). load val q = ds. repartition(10, $"value"). writeStream. format("parquet"). option("path", "/tmp/output"). option("checkpointLocation", "/tmp/checkpoint"). start(){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26532) repartitionByRange strategy reads source files twice
[ https://issues.apache.org/jira/browse/SPARK-26532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Dias updated SPARK-26532: -- Attachment: repartitionByRange Stages.png repartition Stages.png > repartitionByRange strategy reads source files twice > > > Key: SPARK-26532 > URL: https://issues.apache.org/jira/browse/SPARK-26532 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.2, 2.4.0 >Reporter: Mike Dias >Priority: Minor > Attachments: repartition Stages.png, repartitionByRange Stages.png > > > When using repartitionByRange in Structured Stream API for reading then write > files, it reads the source files twice. > Example: > {code:java} > val ds = spark.readStream. > format("text"). > option("path", "data/streaming"). > load > val q = ds. > repartitionByRange(10, $"value"). > writeStream. > format("parquet"). > option("path", "/tmp/output"). > option("checkpointLocation", "/tmp/checkpoint"). > start() > {code} > This execution creates 3 stages: 2 for reading and 1 for writing, reading the > source twice. It's easy to see it in a large dataset where the reading > process time is doubled. > > {code:java} > $ curl -s -XGET > http://localhost:4040/api/v1/applications//stages > {code} > > > This is very different from the repartition strategy, which creates 2 stages: > 1 for reading and 1 for writing. > {code:java} > val ds = spark.readStream. > format("text"). > option("path", "data/streaming"). > load > val q = ds. > repartition(10, $"value"). > writeStream. > format("parquet"). > option("path", "/tmp/output"). > option("checkpointLocation", "/tmp/checkpoint"). > start(){code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26530) Validate heartheat arguments in SparkSubmitArguments
[ https://issues.apache.org/jira/browse/SPARK-26530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26530: Assignee: Apache Spark > Validate heartheat arguments in SparkSubmitArguments > > > Key: SPARK-26530 > URL: https://issues.apache.org/jira/browse/SPARK-26530 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core >Affects Versions: 2.3.2, 2.4.0 >Reporter: liupengcheng >Assignee: Apache Spark >Priority: Major > > Currently, heartbeat related arguments is not validated in spark, so if these > args are inproperly specified, the Application may run for a while and not > failed until the max executor failures reached(especially with > spark.dynamicAllocation.enabled=true), thus may incurs resources waste. > We shall do validation before submit to cluster. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26530) Validate heartheat arguments in SparkSubmitArguments
[ https://issues.apache.org/jira/browse/SPARK-26530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26530: Assignee: (was: Apache Spark) > Validate heartheat arguments in SparkSubmitArguments > > > Key: SPARK-26530 > URL: https://issues.apache.org/jira/browse/SPARK-26530 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core >Affects Versions: 2.3.2, 2.4.0 >Reporter: liupengcheng >Priority: Major > > Currently, heartbeat related arguments is not validated in spark, so if these > args are inproperly specified, the Application may run for a while and not > failed until the max executor failures reached(especially with > spark.dynamicAllocation.enabled=true), thus may incurs resources waste. > We shall do validation before submit to cluster. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26531) How to setup Apache Spark to use local hard disk when data does not fit in RAM in local mode?
Lê Văn Thanh created SPARK-26531: Summary: How to setup Apache Spark to use local hard disk when data does not fit in RAM in local mode? Key: SPARK-26531 URL: https://issues.apache.org/jira/browse/SPARK-26531 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 2.2.0 Environment: Ubuntu 16.04 Hive 2.3.0 Spark 2.0.0 Reporter: Lê Văn Thanh Hello , I have a table with 10GB data and 4GB free RAM , I tried to select data from this table to other table with ORC format but I got an error about memory limit ( I'm running the query SELECT in console ) . How to setup Apache Spark to use local hard disk when data does not fit in RAM in local mode? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26530) Validate heartheat arguments in SparkSubmitArguments
liupengcheng created SPARK-26530: Summary: Validate heartheat arguments in SparkSubmitArguments Key: SPARK-26530 URL: https://issues.apache.org/jira/browse/SPARK-26530 Project: Spark Issue Type: Improvement Components: Deploy, Spark Core Affects Versions: 2.4.0, 2.3.2 Reporter: liupengcheng Currently, heartbeat related arguments is not validated in spark, so if these args are inproperly specified, the Application may run for a while and not failed until the max executor failures reached(especially with spark.dynamicAllocation.enabled=true), thus may incurs resources waste. We shall do validation before submit to cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26389) temp checkpoint folder at executor should be deleted on graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-26389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733815#comment-16733815 ] Fengyu Cao edited comment on SPARK-26389 at 1/4/19 4:03 AM: {quote}Temp checkpoint can be used in one-node scenario and deleted only if the query didn't fail. {quote} Yes, and there're no logs or error msgs says that we *must* set a non-temp checkpoint if we run a framework non-local And if we do this(run non-local with temp checkpoint), the checkpoint dir on executor consume lots of space and not be deleted if the query fails, and this checkpoint can't be used to recover as I mentioned above. I just think that spark either should prohibits users from using temp checkpoints when their frameworks are non-local, or should be responsible for cleaning up this useless checkpoint directory even if the query fails. was (Author: camper42): {quote}Temp checkpoint can be used in one-node scenario and deleted only if the query didn't fail. {quote} Yes, and there're no logs or error msgs says that we *must* set a non-temp checkpoint if we run a framework non-local And if we do this(run non-local with temp checkpoint), the checkpoint dir on executor consume lots of space and not be deleted if the query if fail, and this checkpoint can't be used to recover as I mentioned above. I just think that spark either should prohibits users from using temp checkpoints when their frameworks are non-local, or should be responsible for cleaning up this useless checkpoint directory even if the query fails. > temp checkpoint folder at executor should be deleted on graceful shutdown > - > > Key: SPARK-26389 > URL: https://issues.apache.org/jira/browse/SPARK-26389 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Fengyu Cao >Priority: Major > > {{spark-submit --master mesos:// -conf > spark.streaming.stopGracefullyOnShutdown=true framework>}} > CTRL-C, framework shutdown > {{18/12/18 10:27:36 ERROR MicroBatchExecution: Query [id = > f512e17a-df88-4414-a5cd-a23550cf1e7f, runId = > 24d99723-8d61-48c0-beab-af432f7a19d3] terminated with error > org.apache.spark.SparkException: Writing job aborted.}} > {{/tmp/temporary- on executor not deleted due to > org.apache.spark.SparkException: Writing job aborted., and this temp > checkpoint can't used to recovery.}} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26389) temp checkpoint folder at executor should be deleted on graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-26389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733815#comment-16733815 ] Fengyu Cao commented on SPARK-26389: {quote}Temp checkpoint can be used in one-node scenario and deleted only if the query didn't fail. {quote} Yes, and there're no logs or error msgs says that we *must* set a non-temp checkpoint if we run a framework non-local And if we do this(run non-local with temp checkpoint), the checkpoint dir on executor consume lots of space and not be deleted if the query if fail, and this checkpoint can't be used to recover as I mentioned above. I just think that spark either should prohibits users from using temp checkpoints when their frameworks are non-local, or should be responsible for cleaning up this useless checkpoint directory even if the query fails. > temp checkpoint folder at executor should be deleted on graceful shutdown > - > > Key: SPARK-26389 > URL: https://issues.apache.org/jira/browse/SPARK-26389 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Fengyu Cao >Priority: Major > > {{spark-submit --master mesos:// -conf > spark.streaming.stopGracefullyOnShutdown=true framework>}} > CTRL-C, framework shutdown > {{18/12/18 10:27:36 ERROR MicroBatchExecution: Query [id = > f512e17a-df88-4414-a5cd-a23550cf1e7f, runId = > 24d99723-8d61-48c0-beab-af432f7a19d3] terminated with error > org.apache.spark.SparkException: Writing job aborted.}} > {{/tmp/temporary- on executor not deleted due to > org.apache.spark.SparkException: Writing job aborted., and this temp > checkpoint can't used to recovery.}} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26529) Add logs for IOException when preparing local resource
[ https://issues.apache.org/jira/browse/SPARK-26529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26529: Assignee: Apache Spark > Add logs for IOException when preparing local resource > --- > > Key: SPARK-26529 > URL: https://issues.apache.org/jira/browse/SPARK-26529 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core >Affects Versions: 2.3.2, 2.4.0 >Reporter: liupengcheng >Assignee: Apache Spark >Priority: Major > > Currently, `Client#createConfArchive` do not handle IOException, and some > detail info is not provided in logs. Sometimes, this may delay the time of > locating the root cause of io error. > A case happened in our production environment is that local disk is full, and > the following exception is thrown but no detail path info provided. we have > to investigate all the local disk of the machine to find out the root cause. > {code:java} > Exception in thread "main" java.io.IOException: No space left on device > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:345) > at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253) > at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:238) > at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:343) > at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:238) > at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:360) > at org.apache.spark.deploy.yarn.Client.createConfArchive(Client.scala:769) > at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:657) > at > org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:895) > at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:177) > at org.apache.spark.deploy.yarn.Client.run(Client.scala:1202) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1261) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:767) > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:189) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:214) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:128) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > It make sense for us to catch the IOException and print some useful > information. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26529) Add logs for IOException when preparing local resource
[ https://issues.apache.org/jira/browse/SPARK-26529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26529: Assignee: (was: Apache Spark) > Add logs for IOException when preparing local resource > --- > > Key: SPARK-26529 > URL: https://issues.apache.org/jira/browse/SPARK-26529 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core >Affects Versions: 2.3.2, 2.4.0 >Reporter: liupengcheng >Priority: Major > > Currently, `Client#createConfArchive` do not handle IOException, and some > detail info is not provided in logs. Sometimes, this may delay the time of > locating the root cause of io error. > A case happened in our production environment is that local disk is full, and > the following exception is thrown but no detail path info provided. we have > to investigate all the local disk of the machine to find out the root cause. > {code:java} > Exception in thread "main" java.io.IOException: No space left on device > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:345) > at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253) > at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:238) > at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:343) > at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:238) > at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:360) > at org.apache.spark.deploy.yarn.Client.createConfArchive(Client.scala:769) > at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:657) > at > org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:895) > at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:177) > at org.apache.spark.deploy.yarn.Client.run(Client.scala:1202) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1261) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:767) > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:189) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:214) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:128) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > It make sense for us to catch the IOException and print some useful > information. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26075) Cannot broadcast the table that is larger than 8GB : Spark 2.3
[ https://issues.apache.org/jira/browse/SPARK-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733799#comment-16733799 ] Hyukjin Kwon edited comment on SPARK-26075 at 1/4/19 3:27 AM: -- [~bhadani.neeraj...@gmail.com] can you post reproducible codes that you ran? Let me take a quick look when i'm available. was (Author: hyukjin.kwon): [~bhadani.neeraj...@gmail.com] can you post reproducible codes that you ran? > Cannot broadcast the table that is larger than 8GB : Spark 2.3 > -- > > Key: SPARK-26075 > URL: https://issues.apache.org/jira/browse/SPARK-26075 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Neeraj Bhadani >Priority: Major > > I am trying to use the broadcast join but getting below error in Spark 2.3. > However, the same code is working fine in Spark 2.2 > > Upon checking the size of the dataframes its merely 50 MB and I have set the > threshold to 200 MB as well. As I mentioned above same code is working fine > in Spark 2.2 > > {{Error: "Cannot broadcast the table that is larger than 8GB". }} > However, Disabling the broadcasting is working fine. > {{'spark.sql.autoBroadcastJoinThreshold': '-1'}} > > {{Regards,}} > {{Neeraj}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26075) Cannot broadcast the table that is larger than 8GB : Spark 2.3
[ https://issues.apache.org/jira/browse/SPARK-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733799#comment-16733799 ] Hyukjin Kwon commented on SPARK-26075: -- [~bhadani.neeraj...@gmail.com] can you post reproducible codes that you ran? > Cannot broadcast the table that is larger than 8GB : Spark 2.3 > -- > > Key: SPARK-26075 > URL: https://issues.apache.org/jira/browse/SPARK-26075 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Neeraj Bhadani >Priority: Major > > I am trying to use the broadcast join but getting below error in Spark 2.3. > However, the same code is working fine in Spark 2.2 > > Upon checking the size of the dataframes its merely 50 MB and I have set the > threshold to 200 MB as well. As I mentioned above same code is working fine > in Spark 2.2 > > {{Error: "Cannot broadcast the table that is larger than 8GB". }} > However, Disabling the broadcasting is working fine. > {{'spark.sql.autoBroadcastJoinThreshold': '-1'}} > > {{Regards,}} > {{Neeraj}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26529) Add logs for IOException when preparing local resource
liupengcheng created SPARK-26529: Summary: Add logs for IOException when preparing local resource Key: SPARK-26529 URL: https://issues.apache.org/jira/browse/SPARK-26529 Project: Spark Issue Type: Improvement Components: Deploy, Spark Core Affects Versions: 2.4.0, 2.3.2 Reporter: liupengcheng Currently, `Client#createConfArchive` do not handle IOException, and some detail info is not provided in logs. Sometimes, this may delay the time of locating the root cause of io error. A case happened in our production environment is that local disk is full, and the following exception is thrown but no detail path info provided. we have to investigate all the local disk of the machine to find out the root cause. {code:java} Exception in thread "main" java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:345) at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253) at java.util.zip.ZipOutputStream.closeEntry(ZipOutputStream.java:238) at java.util.zip.ZipOutputStream.finish(ZipOutputStream.java:343) at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:238) at java.util.zip.ZipOutputStream.close(ZipOutputStream.java:360) at org.apache.spark.deploy.yarn.Client.createConfArchive(Client.scala:769) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:657) at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:895) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:177) at org.apache.spark.deploy.yarn.Client.run(Client.scala:1202) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1261) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:767) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:214) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:128) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} It make sense for us to catch the IOException and print some useful information. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26126) Should put scala-library deps into root pom instead of spark-tags module
[ https://issues.apache.org/jira/browse/SPARK-26126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733796#comment-16733796 ] Hyukjin Kwon commented on SPARK-26126: -- If there's no actual problem hapenning, I would leave this resolved. > Should put scala-library deps into root pom instead of spark-tags module > > > Key: SPARK-26126 > URL: https://issues.apache.org/jira/browse/SPARK-26126 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.3.0, 2.4.0 >Reporter: liupengcheng >Priority: Minor > > When I do some backport in our custom spark, I notice some strange code from > spark-tags module: > {code:java} > > > org.scala-lang > scala-library > ${scala.version} > > > {code} > As i known, should spark-tags only contains some annotation related classes > or deps? > should we put the scala-library deps to root pom? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26126) Should put scala-library deps into root pom instead of spark-tags module
[ https://issues.apache.org/jira/browse/SPARK-26126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733795#comment-16733795 ] Hyukjin Kwon commented on SPARK-26126: -- It matters since JIRA's supposed to file an issue. Questions should go to mailing list. What problem does it cause? > Should put scala-library deps into root pom instead of spark-tags module > > > Key: SPARK-26126 > URL: https://issues.apache.org/jira/browse/SPARK-26126 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.3.0, 2.4.0 >Reporter: liupengcheng >Priority: Minor > > When I do some backport in our custom spark, I notice some strange code from > spark-tags module: > {code:java} > > > org.scala-lang > scala-library > ${scala.version} > > > {code} > As i known, should spark-tags only contains some annotation related classes > or deps? > should we put the scala-library deps to root pom? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26528) FsHistoryProviderSuite failed in IDEA because not exist "spark.testing" property
[ https://issues.apache.org/jira/browse/SPARK-26528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] deshanxiao updated SPARK-26528: --- Priority: Minor (was: Major) > FsHistoryProviderSuite failed in IDEA because not exist "spark.testing" > property > - > > Key: SPARK-26528 > URL: https://issues.apache.org/jira/browse/SPARK-26528 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.4.0 >Reporter: deshanxiao >Priority: Minor > > Running the FsHistoryProviderSuite in idea failled because the property > "spark.testing" not exist.In this situation, replay executor may replay a > file twice. > {code:java} > private val replayExecutor: ExecutorService = { > if (!conf.contains("spark.testing")) { > ThreadUtils.newDaemonFixedThreadPool(NUM_PROCESSING_THREADS, > "log-replay-executor") > } else { > MoreExecutors.sameThreadExecutor() > } > } > {code} > {code:java} > "SPARK-3697: ignore files that cannot be read." > 2 was not equal to 1 > ScalaTestFailureLocation: > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12 at > (FsHistoryProviderSuite.scala:179) > Expected :1 > Actual :2 > > org.scalatest.exceptions.TestFailedException: 2 was not equal to 1 > at > org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340) > at > org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668) > at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:179) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51) > at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite$class.run(Suite.scala:1147) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$run(FsHistoryProviderSuite.scala:51) > at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:258) > at > org.apache.spark.deploy.history.FsHisto
[jira] [Assigned] (SPARK-26526) test case for SPARK-10316 is not valid any more
[ https://issues.apache.org/jira/browse/SPARK-26526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-26526: --- Assignee: Liu, Linhong > test case for SPARK-10316 is not valid any more > --- > > Key: SPARK-26526 > URL: https://issues.apache.org/jira/browse/SPARK-26526 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Liu, Linhong >Assignee: Liu, Linhong >Priority: Major > > Test case in [SPARK-10316|https://github.com/apache/spark/pull/8486] is used > to make sure non-deterministic `Filter` won't be pushed through `Project` > But in current code base this test case can't cover this purpose. > Change LogicalRDD to HadoopFsRelation can fix this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26526) test case for SPARK-10316 is not valid any more
[ https://issues.apache.org/jira/browse/SPARK-26526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26526. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23440 [https://github.com/apache/spark/pull/23440] > test case for SPARK-10316 is not valid any more > --- > > Key: SPARK-26526 > URL: https://issues.apache.org/jira/browse/SPARK-26526 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Liu, Linhong >Assignee: Liu, Linhong >Priority: Major > Fix For: 3.0.0 > > > Test case in [SPARK-10316|https://github.com/apache/spark/pull/8486] is used > to make sure non-deterministic `Filter` won't be pushed through `Project` > But in current code base this test case can't cover this purpose. > Change LogicalRDD to HadoopFsRelation can fix this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26528) FsHistoryProviderSuite failed in IDEA because not exist "spark.testing" property
[ https://issues.apache.org/jira/browse/SPARK-26528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26528: Assignee: Apache Spark > FsHistoryProviderSuite failed in IDEA because not exist "spark.testing" > property > - > > Key: SPARK-26528 > URL: https://issues.apache.org/jira/browse/SPARK-26528 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.4.0 >Reporter: deshanxiao >Assignee: Apache Spark >Priority: Major > > Running the FsHistoryProviderSuite in idea failled because the property > "spark.testing" not exist.In this situation, replay executor may replay a > file twice. > {code:java} > private val replayExecutor: ExecutorService = { > if (!conf.contains("spark.testing")) { > ThreadUtils.newDaemonFixedThreadPool(NUM_PROCESSING_THREADS, > "log-replay-executor") > } else { > MoreExecutors.sameThreadExecutor() > } > } > {code} > {code:java} > "SPARK-3697: ignore files that cannot be read." > 2 was not equal to 1 > ScalaTestFailureLocation: > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12 at > (FsHistoryProviderSuite.scala:179) > Expected :1 > Actual :2 > > org.scalatest.exceptions.TestFailedException: 2 was not equal to 1 > at > org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340) > at > org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668) > at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:179) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51) > at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite$class.run(Suite.scala:1147) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$run(FsHistoryProviderSuite.scala:51) > at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:258) > at >
[jira] [Assigned] (SPARK-26528) FsHistoryProviderSuite failed in IDEA because not exist "spark.testing" property
[ https://issues.apache.org/jira/browse/SPARK-26528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26528: Assignee: (was: Apache Spark) > FsHistoryProviderSuite failed in IDEA because not exist "spark.testing" > property > - > > Key: SPARK-26528 > URL: https://issues.apache.org/jira/browse/SPARK-26528 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.4.0 >Reporter: deshanxiao >Priority: Major > > Running the FsHistoryProviderSuite in idea failled because the property > "spark.testing" not exist.In this situation, replay executor may replay a > file twice. > {code:java} > private val replayExecutor: ExecutorService = { > if (!conf.contains("spark.testing")) { > ThreadUtils.newDaemonFixedThreadPool(NUM_PROCESSING_THREADS, > "log-replay-executor") > } else { > MoreExecutors.sameThreadExecutor() > } > } > {code} > {code:java} > "SPARK-3697: ignore files that cannot be read." > 2 was not equal to 1 > ScalaTestFailureLocation: > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12 at > (FsHistoryProviderSuite.scala:179) > Expected :1 > Actual :2 > > org.scalatest.exceptions.TestFailedException: 2 was not equal to 1 > at > org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340) > at > org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668) > at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:179) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51) > at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite$class.run(Suite.scala:1147) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at > org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$run(FsHistoryProviderSuite.scala:51) > at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:258) > at > org.apache.spark.deploy.
[jira] [Created] (SPARK-26528) FsHistoryProviderSuite failed in IDEA because not exist "spark.testing" property
deshanxiao created SPARK-26528: -- Summary: FsHistoryProviderSuite failed in IDEA because not exist "spark.testing" property Key: SPARK-26528 URL: https://issues.apache.org/jira/browse/SPARK-26528 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0, 2.3.2 Reporter: deshanxiao Running the FsHistoryProviderSuite in idea failled because the property "spark.testing" not exist.In this situation, replay executor may replay a file twice. {code:java} private val replayExecutor: ExecutorService = { if (!conf.contains("spark.testing")) { ThreadUtils.newDaemonFixedThreadPool(NUM_PROCESSING_THREADS, "log-replay-executor") } else { MoreExecutors.sameThreadExecutor() } } {code} {code:java} "SPARK-3697: ignore files that cannot be read." 2 was not equal to 1 ScalaTestFailureLocation: org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12 at (FsHistoryProviderSuite.scala:179) Expected :1 Actual :2 org.scalatest.exceptions.TestFailedException: 2 was not equal to 1 at org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340) at org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668) at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704) at org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:179) at org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) at org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203) at org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) at scala.collection.immutable.List.foreach(List.scala:381) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) at org.scalatest.Suite$class.run(Suite.scala:1147) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) at org.scalatest.SuperEngine.runImpl(Engine.scala:521) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) at org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$run(FsHistoryProviderSuite.scala:51) at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:258) at org.apache.spark.deploy.history.FsHistoryProviderSuite.run(FsHistoryProviderSuite.scala:51) at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45) at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$1.apply(Runner.scala:1340) at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$1.apply(Runner.scala:1334) at scala.collection.immut
[jira] [Assigned] (SPARK-26527) Let acquireUnrollMemory fail fast if required space exceeds memory limit
[ https://issues.apache.org/jira/browse/SPARK-26527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-26527: - Assignee: SongYadong > Let acquireUnrollMemory fail fast if required space exceeds memory limit > > > Key: SPARK-26527 > URL: https://issues.apache.org/jira/browse/SPARK-26527 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: SongYadong >Assignee: SongYadong >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > When acquiring unroll memory from {{StaticMemoryManager}}, let it fail fast > if required space exceeds memory limit, just like acquiring storage memory. > I think this may reduce some computation and memory evicting costs especially > when required space({{numBytes}}) is very big. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-26494) 【spark sql】Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be found,
[ https://issues.apache.org/jira/browse/SPARK-26494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 秦坤 reopened SPARK-26494: > 【spark sql】Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type > can't be found, > -- > > Key: SPARK-26494 > URL: https://issues.apache.org/jira/browse/SPARK-26494 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: 秦坤 >Priority: Minor > > Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be > found, > When the data type is TIMESTAMP(6) WITH LOCAL TIME ZONE > At this point, the sqlType value of the function getCatalystType in the > JdbcUtils class is -102. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26494) 【spark sql】Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be found,
[ https://issues.apache.org/jira/browse/SPARK-26494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 秦坤 updated SPARK-26494: --- Description: Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be found, When the data type is TIMESTAMP(6) WITH LOCAL TIME ZONE At this point, the sqlType value of the function getCatalystType in the JdbcUtils class is -102. was: 使用spark读取oracle TIMESTAMP(6) WITH LOCAL TIME ZONE 类型找不到, 当数据类型为TIMESTAMP(6) WITH LOCAL TIME ZONE 此时JdbcUtils 类 里面函数getCatalystType的 sqlType 数值为-102 Summary: 【spark sql】Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be found, (was: 【spark sql】 使用spark读取oracle TIMESTAMP(6) WITH LOCAL TIME ZONE 类型找不到) > 【spark sql】Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type > can't be found, > -- > > Key: SPARK-26494 > URL: https://issues.apache.org/jira/browse/SPARK-26494 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: 秦坤 >Priority: Minor > > Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be > found, > When the data type is TIMESTAMP(6) WITH LOCAL TIME ZONE > At this point, the sqlType value of the function getCatalystType in the > JdbcUtils class is -102. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26512) Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10?
[ https://issues.apache.org/jira/browse/SPARK-26512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733740#comment-16733740 ] Saisai Shao commented on SPARK-26512: - Please list the problems you saw, any log or exception. We can't tell anything with above information. > Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10? > --- > > Key: SPARK-26512 > URL: https://issues.apache.org/jira/browse/SPARK-26512 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, YARN >Affects Versions: 2.4.0 > Environment: operating system : Windows 10 > Spark Version : 2.4.0 > Hadoop Version : 2.8.3 >Reporter: Anubhav Jain >Priority: Minor > Labels: windows > Attachments: log.png > > > I have installed Hadoop version 2.8.3 in my windows 10 environment and its > working fine. Now when i try to install Apache Spark(version 2.4.0) with yarn > as cluster manager and its not working. When i try to submit a spark job > using spark-submit for testing , so its coming under ACCEPTED tab in YARN UI > after that it fail -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26126) Should put scala-library deps into root pom instead of spark-tags module
[ https://issues.apache.org/jira/browse/SPARK-26126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733738#comment-16733738 ] liupengcheng commented on SPARK-26126: -- [~hyukjin.kwon] it's an issue, it's really doesn't matter, but it's just confusing. > Should put scala-library deps into root pom instead of spark-tags module > > > Key: SPARK-26126 > URL: https://issues.apache.org/jira/browse/SPARK-26126 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.3.0, 2.4.0 >Reporter: liupengcheng >Priority: Minor > > When I do some backport in our custom spark, I notice some strange code from > spark-tags module: > {code:java} > > > org.scala-lang > scala-library > ${scala.version} > > > {code} > As i known, should spark-tags only contains some annotation related classes > or deps? > should we put the scala-library deps to root pom? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26452) Suppressing exception in finally: Java heap space java.lang.OutOfMemoryError: Java heap space
[ https://issues.apache.org/jira/browse/SPARK-26452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733715#comment-16733715 ] Jungtaek Lim commented on SPARK-26452: -- Maybe better to let driver (force) shutting down if uncaught exception occurs in DAG event loop? I guess it is one of the main role in driver so driver looks to be malfunctioning to end users when DAG event thread is killed. > Suppressing exception in finally: Java heap space java.lang.OutOfMemoryError: > Java heap space > - > > Key: SPARK-26452 > URL: https://issues.apache.org/jira/browse/SPARK-26452 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.2.0 >Reporter: tommy duan >Priority: Major > > In spark2.2.0 structured streaming program,the shell code of submit as follow: > {code:java} > spark-submit.sh \ > --driver-memory 3g \ > --executor-nums 10 \ > --exucutor-memory 16g \ > {code} > Fake death occurred after running for a day,the Executor is running status, > and the Driver exists also.But not task assigned ,The error message of > program print as follow: > {code:java} > [Stage 1852:===>(896 + 3) / > 900] > [Stage 1852:===>(897 + 3) / > 900] > [Stage 1852:===>(899 + 1) / > 900] > [Stage 1853:> (0 + 0) / 900] > 18/12/27 06:03:45 WARN util.Utils: Suppressing exception in finally: Java > heap space java.lang.OutOfMemoryError: Java heap space > at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) > at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75) > at > net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205) > at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) > at > java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822) > at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719) > at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740) > at > org.apache.spark.serializer.JavaSerializationStream.close(JavaSerializer.scala:57) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$1.apply$mcV$sp(TorrentBroadcast.scala:278) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346) > at > org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:277) > at > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:126) > at > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) > at > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56) > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1488) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1006) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:776) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:775) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:775) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1259) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1711) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) > Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: > Java heap space > at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) > at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonf
[jira] [Created] (SPARK-26527) Let acquireUnrollMemory fail fast if required space exceeds memory limit
SongYadong created SPARK-26527: -- Summary: Let acquireUnrollMemory fail fast if required space exceeds memory limit Key: SPARK-26527 URL: https://issues.apache.org/jira/browse/SPARK-26527 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: SongYadong When acquiring unroll memory from {{StaticMemoryManager}}, let it fail fast if required space exceeds memory limit, just like acquiring storage memory. I think this may reduce some computation and memory evicting costs especially when required space({{numBytes}}) is very big. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26527) Let acquireUnrollMemory fail fast if required space exceeds memory limit
[ https://issues.apache.org/jira/browse/SPARK-26527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733708#comment-16733708 ] Apache Spark commented on SPARK-26527: -- User 'SongYadong' has created a pull request for this issue: https://github.com/apache/spark/pull/23426 > Let acquireUnrollMemory fail fast if required space exceeds memory limit > > > Key: SPARK-26527 > URL: https://issues.apache.org/jira/browse/SPARK-26527 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: SongYadong >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > When acquiring unroll memory from {{StaticMemoryManager}}, let it fail fast > if required space exceeds memory limit, just like acquiring storage memory. > I think this may reduce some computation and memory evicting costs especially > when required space({{numBytes}}) is very big. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26527) Let acquireUnrollMemory fail fast if required space exceeds memory limit
[ https://issues.apache.org/jira/browse/SPARK-26527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26527: Assignee: (was: Apache Spark) > Let acquireUnrollMemory fail fast if required space exceeds memory limit > > > Key: SPARK-26527 > URL: https://issues.apache.org/jira/browse/SPARK-26527 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: SongYadong >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > When acquiring unroll memory from {{StaticMemoryManager}}, let it fail fast > if required space exceeds memory limit, just like acquiring storage memory. > I think this may reduce some computation and memory evicting costs especially > when required space({{numBytes}}) is very big. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26527) Let acquireUnrollMemory fail fast if required space exceeds memory limit
[ https://issues.apache.org/jira/browse/SPARK-26527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26527: Assignee: Apache Spark > Let acquireUnrollMemory fail fast if required space exceeds memory limit > > > Key: SPARK-26527 > URL: https://issues.apache.org/jira/browse/SPARK-26527 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: SongYadong >Assignee: Apache Spark >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > When acquiring unroll memory from {{StaticMemoryManager}}, let it fail fast > if required space exceeds memory limit, just like acquiring storage memory. > I think this may reduce some computation and memory evicting costs especially > when required space({{numBytes}}) is very big. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26489) Use ConfigEntry for hardcoded configs for python/r categories.
[ https://issues.apache.org/jira/browse/SPARK-26489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-26489: -- Assignee: Jungtaek Lim > Use ConfigEntry for hardcoded configs for python/r categories. > -- > > Key: SPARK-26489 > URL: https://issues.apache.org/jira/browse/SPARK-26489 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Assignee: Jungtaek Lim >Priority: Major > > Make the following hardcoded configs to use ConfigEntry. > {code} > spark.python > spark.r > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26489) Use ConfigEntry for hardcoded configs for python/r categories.
[ https://issues.apache.org/jira/browse/SPARK-26489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-26489. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23428 [https://github.com/apache/spark/pull/23428] > Use ConfigEntry for hardcoded configs for python/r categories. > -- > > Key: SPARK-26489 > URL: https://issues.apache.org/jira/browse/SPARK-26489 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > Make the following hardcoded configs to use ConfigEntry. > {code} > spark.python > spark.r > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25873) Date corruption when Spark and Hive both are on different timezones
[ https://issues.apache.org/jira/browse/SPARK-25873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733558#comment-16733558 ] Pablo Langa Blanco commented on SPARK-25873: Hello, it seems duplicated with SPARK-25919 that has been resolved. Could it be closed too? Thank you Regards > Date corruption when Spark and Hive both are on different timezones > --- > > Key: SPARK-25873 > URL: https://issues.apache.org/jira/browse/SPARK-25873 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 2.2.1 >Reporter: Pawan >Priority: Major > > There is date alteration when loading date from one table to another in hive > through spark. This happens when Hive is on a remote machine with timezone > different than the one on which Spark is running. This happens only when the > Source table format is > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > Below are the steps to produce the issue: > 1. Create two tables as below in hive which has a timezone, say in, EST > {code} > CREATE TABLE t_src( > name varchar(10), > dob timestamp > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > {code} > {code} > INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 > 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 00:00:00.0'); > {code} > > {code} > CREATE TABLE t_tgt( > name varchar(10), > dob timestamp > ); > {code} > 2. Copy {{hive-site.xml}} to {{spark-2.2.1-bin-hadoop2.7/conf}} folder, so > that when you create {{sqlContext}} for hive it connects to your remote hive > server. > 3. Start your spark-shell on some other machine whose timezone is different > than that of Hive, say, PDT > 4. Execute below code: > {code} > import org.apache.spark.sql.hive.HiveContext > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > val q0 = "TRUNCATE table t_tgt" > val q1 = "SELECT CAST(alias.name AS String) as a0, alias.dob as a1 FROM t_src > alias" > val q2 = "INSERT OVERWRITE TABLE t_tgt SELECT tbl0.a0 as c0, tbl0.a1 as c1 > FROM tbl0" > sqlContext.sql(q0) > sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0") > sqlContext.sql(q2) > {code} > 5. Now navigate to hive and check the contents of the {{TARGET table > (t_tgt)}}. The dob field will have incorrect values. > > Is this a known issue? Is there any work around on this? Can it be fixed? > > Thanks & regards, > Pawan Lawale -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-25873) Date corruption when Spark and Hive both are on different timezones
[ https://issues.apache.org/jira/browse/SPARK-25873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Langa Blanco updated SPARK-25873: --- Comment: was deleted (was: Hi [~pawanlawale] Timestamp in hive is saved as a long value without timezone associated to this value. It has no sense that spark access to the timezone of the remote cluster because different timestamps could be generated from different timezone (different from hive cluster timezone) The best option is the application manage it’s own timezone associated to each timestamp if its requiered. ¿don’t you think? Regards Pablo) > Date corruption when Spark and Hive both are on different timezones > --- > > Key: SPARK-25873 > URL: https://issues.apache.org/jira/browse/SPARK-25873 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 2.2.1 >Reporter: Pawan >Priority: Major > > There is date alteration when loading date from one table to another in hive > through spark. This happens when Hive is on a remote machine with timezone > different than the one on which Spark is running. This happens only when the > Source table format is > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > Below are the steps to produce the issue: > 1. Create two tables as below in hive which has a timezone, say in, EST > {code} > CREATE TABLE t_src( > name varchar(10), > dob timestamp > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > {code} > {code} > INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 > 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 00:00:00.0'); > {code} > > {code} > CREATE TABLE t_tgt( > name varchar(10), > dob timestamp > ); > {code} > 2. Copy {{hive-site.xml}} to {{spark-2.2.1-bin-hadoop2.7/conf}} folder, so > that when you create {{sqlContext}} for hive it connects to your remote hive > server. > 3. Start your spark-shell on some other machine whose timezone is different > than that of Hive, say, PDT > 4. Execute below code: > {code} > import org.apache.spark.sql.hive.HiveContext > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > val q0 = "TRUNCATE table t_tgt" > val q1 = "SELECT CAST(alias.name AS String) as a0, alias.dob as a1 FROM t_src > alias" > val q2 = "INSERT OVERWRITE TABLE t_tgt SELECT tbl0.a0 as c0, tbl0.a1 as c1 > FROM tbl0" > sqlContext.sql(q0) > sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0") > sqlContext.sql(q2) > {code} > 5. Now navigate to hive and check the contents of the {{TARGET table > (t_tgt)}}. The dob field will have incorrect values. > > Is this a known issue? Is there any work around on this? Can it be fixed? > > Thanks & regards, > Pawan Lawale -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25873) Date corruption when Spark and Hive both are on different timezones
[ https://issues.apache.org/jira/browse/SPARK-25873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733538#comment-16733538 ] Pablo Langa Blanco commented on SPARK-25873: Hi [~pawanlawale] Timestamp in hive is saved as a long value without timezone associated to this value. It has no sense that spark access to the timezone of the remote cluster because different timestamps could be generated from different timezone (different from hive cluster timezone) The best option is the application manage it’s own timezone associated to each timestamp if its requiered. ¿don’t you think? Regards Pablo > Date corruption when Spark and Hive both are on different timezones > --- > > Key: SPARK-25873 > URL: https://issues.apache.org/jira/browse/SPARK-25873 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 2.2.1 >Reporter: Pawan >Priority: Major > > There is date alteration when loading date from one table to another in hive > through spark. This happens when Hive is on a remote machine with timezone > different than the one on which Spark is running. This happens only when the > Source table format is > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > Below are the steps to produce the issue: > 1. Create two tables as below in hive which has a timezone, say in, EST > {code} > CREATE TABLE t_src( > name varchar(10), > dob timestamp > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > {code} > {code} > INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 > 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 00:00:00.0'); > {code} > > {code} > CREATE TABLE t_tgt( > name varchar(10), > dob timestamp > ); > {code} > 2. Copy {{hive-site.xml}} to {{spark-2.2.1-bin-hadoop2.7/conf}} folder, so > that when you create {{sqlContext}} for hive it connects to your remote hive > server. > 3. Start your spark-shell on some other machine whose timezone is different > than that of Hive, say, PDT > 4. Execute below code: > {code} > import org.apache.spark.sql.hive.HiveContext > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > val q0 = "TRUNCATE table t_tgt" > val q1 = "SELECT CAST(alias.name AS String) as a0, alias.dob as a1 FROM t_src > alias" > val q2 = "INSERT OVERWRITE TABLE t_tgt SELECT tbl0.a0 as c0, tbl0.a1 as c1 > FROM tbl0" > sqlContext.sql(q0) > sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0") > sqlContext.sql(q2) > {code} > 5. Now navigate to hive and check the contents of the {{TARGET table > (t_tgt)}}. The dob field will have incorrect values. > > Is this a known issue? Is there any work around on this? Can it be fixed? > > Thanks & regards, > Pawan Lawale -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26349) Pyspark should not accept insecure p4yj gateways
[ https://issues.apache.org/jira/browse/SPARK-26349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26349: Assignee: Apache Spark > Pyspark should not accept insecure p4yj gateways > > > Key: SPARK-26349 > URL: https://issues.apache.org/jira/browse/SPARK-26349 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: Apache Spark >Priority: Blocker > > Pyspark normally produces a secure py4j connection between python & the jvm, > but it also lets users provide their own connection. Spark should fail fast > if that connection is insecure. > Note this is breaking a public api, which is why this is targeted at 3.0.0. > SPARK-26019 is about adding a warning, but still allowing the old behavior to > work, in 2.x -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26349) Pyspark should not accept insecure p4yj gateways
[ https://issues.apache.org/jira/browse/SPARK-26349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26349: Assignee: (was: Apache Spark) > Pyspark should not accept insecure p4yj gateways > > > Key: SPARK-26349 > URL: https://issues.apache.org/jira/browse/SPARK-26349 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Blocker > > Pyspark normally produces a secure py4j connection between python & the jvm, > but it also lets users provide their own connection. Spark should fail fast > if that connection is insecure. > Note this is breaking a public api, which is why this is targeted at 3.0.0. > SPARK-26019 is about adding a warning, but still allowing the old behavior to > work, in 2.x -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19261) Support `ALTER TABLE table_name ADD COLUMNS(..)` statement
[ https://issues.apache.org/jira/browse/SPARK-19261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733445#comment-16733445 ] suman gorantla commented on SPARK-19261: Please understand that we have command running successful in HIVE/IMPALA but not in spark while using spark.sql , Please let me know why the commands supported by hive are not being supported by spark sql > Support `ALTER TABLE table_name ADD COLUMNS(..)` statement > -- > > Key: SPARK-19261 > URL: https://issues.apache.org/jira/browse/SPARK-19261 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: StanZhai >Assignee: Xin Wu >Priority: Major > Fix For: 2.2.0 > > > We should support `ALTER TABLE table_name ADD COLUMNS(..)` statement, which > already be used in version < 2.x. > This is very useful for those who want to upgrade there Spark version to 2.x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25401) Reorder the required ordering to match the table's output ordering for bucket join
[ https://issues.apache.org/jira/browse/SPARK-25401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733431#comment-16733431 ] David Vrba commented on SPARK-25401: [~gwang3] could you please recommend someone who would take a look at this PR? Thank you > Reorder the required ordering to match the table's output ordering for bucket > join > -- > > Key: SPARK-25401 > URL: https://issues.apache.org/jira/browse/SPARK-25401 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wang, Gang >Priority: Major > > Currently, we check if SortExec is needed between a operator and its child > operator in method orderingSatisfies, and method orderingSatisfies require > the order in the SortOrders are all the same. > While, take the following case into consideration. > * Table a is bucketed by (a1, a2), sorted by (a2, a1), and buckets number is > 200. > * Table b is bucketed by (b1, b2), sorted by (b2, b1), and buckets number is > 200. > * Table a join table b on (a1=b1, a2=b2) > In this case, if the join is sort merge join, the query planner won't add > exchange on both sides, while, sort will be added on both sides. Actually, > sort is also unnecessary, since in the same bucket, like bucket 1 of table a, > and bucket 1 of table b, (a1=b1, a2=b2) is equivalent to (a2=b2, a1=b1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25253) Refactor pyspark connection & authentication
[ https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-25253: - Fix Version/s: 2.3.3 > Refactor pyspark connection & authentication > > > Key: SPARK-25253 > URL: https://issues.apache.org/jira/browse/SPARK-25253 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > Fix For: 2.2.3, 2.3.3, 2.4.0 > > > We've got a few places in pyspark that connect to local sockets, with varying > levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. > should be pretty easy to clean this up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25253) Refactor pyspark connection & authentication
[ https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733428#comment-16733428 ] Dongjoon Hyun commented on SPARK-25253: --- Yep. Thanks, [~irashid]. :) > Refactor pyspark connection & authentication > > > Key: SPARK-25253 > URL: https://issues.apache.org/jira/browse/SPARK-25253 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > Fix For: 2.2.3, 2.3.3, 2.4.0 > > > We've got a few places in pyspark that connect to local sockets, with varying > levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. > should be pretty easy to clean this up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25253) Refactor pyspark connection & authentication
[ https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733424#comment-16733424 ] Dongjoon Hyun edited comment on SPARK-25253 at 1/3/19 7:33 PM: --- -I don't see this in `branch-2.2`?- Sorry, I was confused with other JIRA. Yes. This is in `branch-2.2` as I commented yesterday. was (Author: dongjoon): ~I don't see this in `branch-2.2`?~ Sorry, I was confused with other JIRA. Yes. This is in `branch-2.2` as I commented yesterday. > Refactor pyspark connection & authentication > > > Key: SPARK-25253 > URL: https://issues.apache.org/jira/browse/SPARK-25253 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > Fix For: 2.2.3, 2.3.3, 2.4.0 > > > We've got a few places in pyspark that connect to local sockets, with varying > levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. > should be pretty easy to clean this up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25253) Refactor pyspark connection & authentication
[ https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733424#comment-16733424 ] Dongjoon Hyun edited comment on SPARK-25253 at 1/3/19 7:32 PM: --- ~I don't see this in `branch-2.2`?~ Sorry, I was confused with other JIRA. Yes. This is in `branch-2.2` as I commented yesterday. was (Author: dongjoon): I don't see this in `branch-2.2`? > Refactor pyspark connection & authentication > > > Key: SPARK-25253 > URL: https://issues.apache.org/jira/browse/SPARK-25253 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > Fix For: 2.2.3, 2.3.3, 2.4.0 > > > We've got a few places in pyspark that connect to local sockets, with varying > levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. > should be pretty easy to clean this up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25253) Refactor pyspark connection & authentication
[ https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733427#comment-16733427 ] Imran Rashid commented on SPARK-25253: -- in branch-2.2: https://github.com/apache/spark/commit/fc1c4e7d24f7d0afb3b79d66aa9812e7dddc2f38 in branch-2.3: https://github.com/apache/spark/commit/a2a54a5f49364a1825932c9f04eb0ff82dd7d465 > Refactor pyspark connection & authentication > > > Key: SPARK-25253 > URL: https://issues.apache.org/jira/browse/SPARK-25253 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > Fix For: 2.2.3, 2.3.3, 2.4.0 > > > We've got a few places in pyspark that connect to local sockets, with varying > levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. > should be pretty easy to clean this up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25253) Refactor pyspark connection & authentication
[ https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733424#comment-16733424 ] Dongjoon Hyun commented on SPARK-25253: --- I don't see this in `branch-2.2`? > Refactor pyspark connection & authentication > > > Key: SPARK-25253 > URL: https://issues.apache.org/jira/browse/SPARK-25253 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > Fix For: 2.2.3, 2.3.3, 2.4.0 > > > We've got a few places in pyspark that connect to local sockets, with varying > levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. > should be pretty easy to clean this up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25253) Refactor pyspark connection & authentication
[ https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733419#comment-16733419 ] Imran Rashid commented on SPARK-25253: -- [~hyukjin.kwon] yes, it looks like I did as part of other changes ... sorry I forgot to update this jira > Refactor pyspark connection & authentication > > > Key: SPARK-25253 > URL: https://issues.apache.org/jira/browse/SPARK-25253 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > Fix For: 2.2.3, 2.3.3, 2.4.0 > > > We've got a few places in pyspark that connect to local sockets, with varying > levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. > should be pretty easy to clean this up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26519) spark sql CHANGE COLUMN not working
[ https://issues.apache.org/jira/browse/SPARK-26519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733413#comment-16733413 ] Dongjoon Hyun commented on SPARK-26519: --- +1 for [~hyukjin.kwon]. [~sumanGorantla]. `Operation not allowed` means this is a new feature to Spark instead of `BUG`. We usually don't say like Hive/Impala has a bug when they don't support Spark commands like `CACHE TABLE` syntax. > spark sql CHANGE COLUMN not working > -- > > Key: SPARK-26519 > URL: https://issues.apache.org/jira/browse/SPARK-26519 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: !image-2019-01-02-14-25-34-594.png! >Reporter: suman gorantla >Priority: Major > Attachments: sparksql error.PNG > > > Dear Team, > with spark sql I am unable to change the newly added column() position after > an existing column in the table (old_column) of a hive external table please > see the screenshot as in below > scala> sql("ALTER TABLE enterprisedatalakedev.tmptst ADD COLUMNs (new_col > STRING)") > res14: org.apache.spark.sql.DataFrame = [] > sql("ALTER TABLE enterprisedatalakedev.tmptst CHANGE COLUMN new_col new_col > STRING AFTER old_col ") > org.apache.spark.sql.catalyst.parser.ParseException: > Operation not allowed: ALTER TABLE table [PARTITION partition_spec] CHANGE > COLUMN ... FIRST | AFTER otherCol(line 1, pos 0) > == SQL == > ALTER TABLE enterprisedatalakedev.tmptst CHANGE COLUMN new_col new_col > STRING AFTER old_col > ^^^ > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.operationNotAllowed(ParserUtils.scala:39) > at > org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitChangeColumn$1.apply(SparkSqlParser.scala:934) > at > org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitChangeColumn$1.apply(SparkSqlParser.scala:928) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:99) > at > org.apache.spark.sql.execution.SparkSqlAstBuilder.visitChangeColumn(SparkSqlParser.scala:928) > at > org.apache.spark.sql.execution.SparkSqlAstBuilder.visitChangeColumn(SparkSqlParser.scala:55) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$ChangeColumnContext.accept(SqlBaseParser.java:1485) > at > org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:42) > at > org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:99) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:70) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:69) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:68) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:97) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623) > ... 48 elided > !image-2019-01-02-14-25-40-980.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value
[ https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733328#comment-16733328 ] Dongjoon Hyun commented on SPARK-22951: --- This is backported to `branch-2.2` via https://github.com/apache/spark/pull/23434 . > count() after dropDuplicates() on emptyDataFrame returns incorrect value > > > Key: SPARK-22951 > URL: https://issues.apache.org/jira/browse/SPARK-22951 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.2.0, 2.3.0 >Reporter: Michael Dreibelbis >Assignee: Feng Liu >Priority: Major > Labels: correctness > Fix For: 2.2.3, 2.3.0 > > > here is a minimal Spark Application to reproduce: > {code} > import org.apache.spark.sql.SQLContext > import org.apache.spark.{SparkConf, SparkContext} > object DropDupesApp extends App { > > override def main(args: Array[String]): Unit = { > val conf = new SparkConf() > .setAppName("test") > .setMaster("local") > val sc = new SparkContext(conf) > val sql = SQLContext.getOrCreate(sc) > assert(sql.emptyDataFrame.count == 0) // expected > assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected > } > > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value
[ https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-22951: -- Fix Version/s: 2.2.3 > count() after dropDuplicates() on emptyDataFrame returns incorrect value > > > Key: SPARK-22951 > URL: https://issues.apache.org/jira/browse/SPARK-22951 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.2.0, 2.3.0 >Reporter: Michael Dreibelbis >Assignee: Feng Liu >Priority: Major > Labels: correctness > Fix For: 2.2.3, 2.3.0 > > > here is a minimal Spark Application to reproduce: > {code} > import org.apache.spark.sql.SQLContext > import org.apache.spark.{SparkConf, SparkContext} > object DropDupesApp extends App { > > override def main(args: Array[String]): Unit = { > val conf = new SparkConf() > .setAppName("test") > .setMaster("local") > val sc = new SparkContext(conf) > val sql = SQLContext.getOrCreate(sc) > assert(sql.emptyDataFrame.count == 0) // expected > assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected > } > > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26385) YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in cache
[ https://issues.apache.org/jira/browse/SPARK-26385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733318#comment-16733318 ] Gabor Somogyi commented on SPARK-26385: --- Yeah and additional logs would be also good, like drive/executor logs. Without them hard to tell. {quote}token for REMOVED: HDFS_DELEGATION_TOKEN{quote} looks weird at least. > YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in > cache > --- > > Key: SPARK-26385 > URL: https://issues.apache.org/jira/browse/SPARK-26385 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Hadoop 2.6.0, Spark 2.4.0 >Reporter: T M >Priority: Major > > > Hello, > > I have Spark Structured Streaming job which is runnning on YARN(Hadoop 2.6.0, > Spark 2.4.0). After 25-26 hours, my job stops working with following error: > {code:java} > 2018-12-16 22:35:17 ERROR > org.apache.spark.internal.Logging$class.logError(Logging.scala:91): Query > TestQuery[id = a61ce197-1d1b-4e82-a7af-60162953488b, runId = > a56878cf-dfc7-4f6a-ad48-02cf738ccc2f] terminated with error > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (token for REMOVED: HDFS_DELEGATION_TOKEN owner=REMOVED, renewer=yarn, > realUser=, issueDate=1544903057122, maxDate=1545507857122, > sequenceNumber=10314, masterKeyId=344) can't be found in cache at > org.apache.hadoop.ipc.Client.call(Client.java:1470) at > org.apache.hadoop.ipc.Client.call(Client.java:1401) at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752) > at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) at > org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1977) at > org.apache.hadoop.fs.Hdfs.getFileStatus(Hdfs.java:133) at > org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1120) at > org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1116) at > org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at > org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1116) at > org.apache.hadoop.fs.FileContext$Util.exists(FileContext.java:1581) at > org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.exists(CheckpointFileManager.scala:326) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:142) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:544) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:554) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:542) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming
[jira] [Commented] (SPARK-26257) SPIP: Interop Support for Spark Language Extensions
[ https://issues.apache.org/jira/browse/SPARK-26257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733297#comment-16733297 ] Tyson Condie commented on SPARK-26257: -- [~rxin] Any thoughts regarding this SPIP? I have a draft of a detailed design document describing what code changes could be made in order to facilitate a general purpose language interop API. Should that be shared here? > SPIP: Interop Support for Spark Language Extensions > --- > > Key: SPARK-26257 > URL: https://issues.apache.org/jira/browse/SPARK-26257 > Project: Spark > Issue Type: Improvement > Components: PySpark, R, Spark Core >Affects Versions: 2.4.0 >Reporter: Tyson Condie >Priority: Minor > > h2. ** Background and Motivation: > There is a desire for third party language extensions for Apache Spark. Some > notable examples include: > * C#/F# from project Mobius [https://github.com/Microsoft/Mobius] > * Haskell from project sparkle [https://github.com/tweag/sparkle] > * Julia from project Spark.jl [https://github.com/dfdx/Spark.jl] > Presently, Apache Spark supports Python and R via a tightly integrated > interop layer. It would seem that much of that existing interop layer could > be refactored into a clean surface for general (third party) language > bindings, such as the above mentioned. More specifically, could we generalize > the following modules: > * Deploy runners (e.g., PythonRunner and RRunner) > * DataFrame Executors > * RDD operations? > The last being questionable: integrating third party language extensions at > the RDD level may be too heavy-weight and unnecessary given the preference > towards the Dataframe abstraction. > The main goals of this effort would be: > * Provide a clean abstraction for third party language extensions making it > easier to maintain (the language extension) with the evolution of Apache Spark > * Provide guidance to third party language authors on how a language > extension should be implemented > * Provide general reusable libraries that are not specific to any language > extension > * Open the door to developers that prefer alternative languages > * Identify and clean up common code shared between Python and R interops > h2. Target Personas: > Data Scientists, Data Engineers, Library Developers > h2. Goals: > Data scientists and engineers will have the opportunity to work with Spark in > languages other than what’s natively supported. Library developers will be > able to create language extensions for Spark in a clean way. The interop > layer should also provide guidance for developing language extensions. > h2. Non-Goals: > The proposal does not aim to create an actual language extension. Rather, it > aims to provide a stable interop layer for third party language extensions to > dock. > h2. Proposed API Changes: > Much of the work will involve generalizing existing interop APIs for PySpark > and R, specifically for the Dataframe API. For instance, it would be good to > have a general deploy.Runner (similar to PythonRunner) for language extension > efforts. In Spark SQL, it would be good to have a general InteropUDF and > evaluator (similar to BatchEvalPythonExec). > Low-level RDD operations should not be needed in this initial offering; > depending on the success of the interop layer and with proper demand, RDD > interop could be added later. However, one open question is supporting a > subset of low-level functions that are core to ETL e.g., transform. > h2. Optional Design Sketch: > The work would be broken down into two top-level phases: > Phase 1: Introduce general interop API for deploying a driver/application, > running an interop UDF along with any other low-level transformations that > aid with ETL. > Phase 2: Port existing Python and R language extensions to the new interop > layer. This port should be contained solely to the Spark core side, and all > protocols specific to Python and R should not change e.g., Python should > continue to use py4j is the protocol between the Python process and core > Spark. The port itself should be contained to a handful of files e.g., some > examples for Python: PythonRunner, BatchEvalPythonExec, +PythonUDFRunner+, > PythonRDD (possibly), and will mostly involve refactoring common logic > abstract implementations and utilities. > h2. Optional Rejected Designs: > The clear alternative is the status quo; developers that want to provide a > third-party language extension to Spark do so directly; often by extending > existing Python classes and overriding the portions that are relevant to the > new extension. Not only is this not sound code (e.g., an JuliaRDD is not a > PythonRDD, which contains a lot of reusable code), but it runs the great risk > of future revisions making t
[jira] [Resolved] (SPARK-26522) Auth secret error in RBackendAuthHandler
[ https://issues.apache.org/jira/browse/SPARK-26522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-26522. Resolution: Not A Problem Yes, this is an issue with Livy that is fixed in the master branch. Hopefully we can get to a release sometime soon... > Auth secret error in RBackendAuthHandler > > > Key: SPARK-26522 > URL: https://issues.apache.org/jira/browse/SPARK-26522 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1 >Reporter: Euijun >Priority: Minor > > Hi expert, > I try to use livy to connect sparkR backend. > This is related to > [https://stackoverflow.com/questions/53900995/livy-spark-r-issue] > > Error message is, > {code:java} > Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, : > Auth secret not provided in environment.{code} > > caused by, > spark-2.3.1/R/pkg/R/sparkR.R > {code:java} > sparkR.sparkContext <- function( > ... > authSecret <- Sys.getenv("SPARKR_BACKEND_AUTH_SECRET") > if (nchar(authSecret) == 0) { > stop("Auth secret not provided in environment.") > } > ... > ) > {code} > > Best regard. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26517) Avoid duplicate test in ParquetSchemaPruningSuite
[ https://issues.apache.org/jira/browse/SPARK-26517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26517. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23427 [https://github.com/apache/spark/pull/23427] > Avoid duplicate test in ParquetSchemaPruningSuite > - > > Key: SPARK-26517 > URL: https://issues.apache.org/jira/browse/SPARK-26517 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Minor > Fix For: 3.0.0 > > > `testExactCaseQueryPruning` and `testMixedCaseQueryPruning` don't need to set > up `PARQUET_VECTORIZED_READER_ENABLED` config. Because `withMixedCaseData` > will run against both Spark vectorized reader and Parquet-mr reader. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26447) Allow OrcColumnarBatchReader to return less partition columns
[ https://issues.apache.org/jira/browse/SPARK-26447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-26447: --- Assignee: Gengliang Wang > Allow OrcColumnarBatchReader to return less partition columns > - > > Key: SPARK-26447 > URL: https://issues.apache.org/jira/browse/SPARK-26447 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Currently OrcColumnarBatchReader returns all the partition column values in > the batch read. > In data source V2, we can improve it by returning the required partition > column values only. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26359) Spark checkpoint restore fails after query restart
[ https://issues.apache.org/jira/browse/SPARK-26359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733209#comment-16733209 ] Gabor Somogyi commented on SPARK-26359: --- [~Tint] did the suggested workaround work? I would close this because this happened because of S3's read-after-write consistency. [~ste...@apache.org] {quote}With S3 the time to rename is about 6-10MB/s{quote} all I can say wow :) > Spark checkpoint restore fails after query restart > -- > > Key: SPARK-26359 > URL: https://issues.apache.org/jira/browse/SPARK-26359 > Project: Spark > Issue Type: Bug > Components: Spark Submit, Structured Streaming >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 deployed in standalone-client mode > Checkpointing is done to S3 > The Spark application in question is responsible for running 4 different > queries > Queries are written using Structured Streaming > We are using the following algorithm for hopes of better performance: > spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2" # When the > default is 1 >Reporter: Kaspar Tint >Priority: Major > Attachments: driver-redacted, metadata, redacted-offsets, > state-redacted, worker-redacted > > > We had an incident where one of our structured streaming queries could not be > restarted after an usual S3 checkpointing failure. Now to clarify before > everything else - we are aware of the issues with S3 and are working towards > moving to HDFS but this will take time. S3 will cause queries to fail quite > often during peak hours and we have separate logic to handle this that will > attempt to restart the failed queries if possible. > In this particular case we could not restart one of the failed queries. Seems > like between detecting a failure in the query and starting it up again > something went really wrong with Spark and state in checkpoint folder got > corrupted for some reason. > The issue starts with the usual *FileNotFoundException* that happens with S3 > {code:java} > 2018-12-10 21:09:25.785 ERROR MicroBatchExecution: Query feedback [id = > c074233a-2563-40fc-8036-b5e38e2e2c42, runId = > e607eb6e-8431-4269-acab-cc2c1f9f09dd] > terminated with error > java.io.FileNotFoundException: No such file or directory: > s3a://some.domain/spark/checkpoints/49/feedback/offsets/.37 > 348.8227943f-a848-4af5-b5bf-1fea81775b24.tmp > at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088) > at > org.apache.hadoop.fs.FileSystem.getFileLinkStatus(FileSystem.java:2715) > at > org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:131) > at > org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:726) > at > org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:699) > at org.apache.hadoop.fs.FileContext.rename(FileContext.java:965) > at > org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:331) > at > org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:147) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataL > og.scala:126) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply$mcV$sp(MicroBatchExecution.scala:382) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBatchExecution.scala:381) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcZ$sp$3.apply(MicroBat
[jira] [Resolved] (SPARK-26447) Allow OrcColumnarBatchReader to return less partition columns
[ https://issues.apache.org/jira/browse/SPARK-26447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26447. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23387 [https://github.com/apache/spark/pull/23387] > Allow OrcColumnarBatchReader to return less partition columns > - > > Key: SPARK-26447 > URL: https://issues.apache.org/jira/browse/SPARK-26447 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > Currently OrcColumnarBatchReader returns all the partition column values in > the batch read. > In data source V2, we can improve it by returning the required partition > column values only. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26389) temp checkpoint folder at executor should be deleted on graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-26389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733201#comment-16733201 ] Gabor Somogyi commented on SPARK-26389: --- Temp checkpoint can be used in one-node scenario and deleted only if the query didn't fail. > temp checkpoint folder at executor should be deleted on graceful shutdown > - > > Key: SPARK-26389 > URL: https://issues.apache.org/jira/browse/SPARK-26389 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Fengyu Cao >Priority: Major > > {{spark-submit --master mesos:// -conf > spark.streaming.stopGracefullyOnShutdown=true framework>}} > CTRL-C, framework shutdown > {{18/12/18 10:27:36 ERROR MicroBatchExecution: Query [id = > f512e17a-df88-4414-a5cd-a23550cf1e7f, runId = > 24d99723-8d61-48c0-beab-af432f7a19d3] terminated with error > org.apache.spark.SparkException: Writing job aborted.}} > {{/tmp/temporary- on executor not deleted due to > org.apache.spark.SparkException: Writing job aborted., and this temp > checkpoint can't used to recovery.}} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26517) Avoid duplicate test in ParquetSchemaPruningSuite
[ https://issues.apache.org/jira/browse/SPARK-26517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-26517: - Assignee: Liang-Chi Hsieh > Avoid duplicate test in ParquetSchemaPruningSuite > - > > Key: SPARK-26517 > URL: https://issues.apache.org/jira/browse/SPARK-26517 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Minor > > `testExactCaseQueryPruning` and `testMixedCaseQueryPruning` don't need to set > up `PARQUET_VECTORIZED_READER_ENABLED` config. Because `withMixedCaseData` > will run against both Spark vectorized reader and Parquet-mr reader. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26501) Unexpected overriden of exitFn in SparkSubmitSuite
[ https://issues.apache.org/jira/browse/SPARK-26501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26501. --- Resolution: Fixed Assignee: liupengcheng Fix Version/s: 3.0.0 2.4.1 Resolved by https://github.com/apache/spark/pull/23404 > Unexpected overriden of exitFn in SparkSubmitSuite > -- > > Key: SPARK-26501 > URL: https://issues.apache.org/jira/browse/SPARK-26501 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 2.3.2, 2.4.0 >Reporter: liupengcheng >Assignee: liupengcheng >Priority: Minor > Fix For: 2.4.1, 3.0.0 > > > When I run SparkSubmitSuite of spark2.3.2 in intellij IDE, I found that some > tests cannot pass when I run them one by one, but they passed when the whole > SparkSubmitSuite was run. > Failed tests when ran seperately: > > {code:java} > test("SPARK_CONF_DIR overrides spark-defaults.conf") { > forConfDir(Map("spark.executor.memory" -> "2.3g")) { path => > val unusedJar = TestUtils.createJarWithClasses(Seq.empty) > val args = Seq( > "--class", SimpleApplicationTest.getClass.getName.stripSuffix("$"), > "--name", "testApp", > "--master", "local", > unusedJar.toString) > val appArgs = new SparkSubmitArguments(args, Map("SPARK_CONF_DIR" -> > path)) > assert(appArgs.defaultPropertiesFile != null) > assert(appArgs.defaultPropertiesFile.startsWith(path)) > assert(appArgs.propertiesFile == null) > appArgs.executorMemory should be ("2.3g") > } > } > {code} > Failure reason: > {code:java} > Error: Executor Memory cores must be a positive number > Run with --help for usage help or --verbose for debug output > {code} > > After carefully checked the code, I found the exitFn of SparkSubmit is > overrided by front tests in testPrematrueExit call. > Although the above test was fixed by SPARK-22941, but the overriden of exitFn > might cause other problems in the future. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26075) Cannot broadcast the table that is larger than 8GB : Spark 2.3
[ https://issues.apache.org/jira/browse/SPARK-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733173#comment-16733173 ] Neeraj Bhadani commented on SPARK-26075: [~hyukjin.kwon] I have verified with Spark 2.4 and getting the same error. > Cannot broadcast the table that is larger than 8GB : Spark 2.3 > -- > > Key: SPARK-26075 > URL: https://issues.apache.org/jira/browse/SPARK-26075 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Neeraj Bhadani >Priority: Major > > I am trying to use the broadcast join but getting below error in Spark 2.3. > However, the same code is working fine in Spark 2.2 > > Upon checking the size of the dataframes its merely 50 MB and I have set the > threshold to 200 MB as well. As I mentioned above same code is working fine > in Spark 2.2 > > {{Error: "Cannot broadcast the table that is larger than 8GB". }} > However, Disabling the broadcasting is working fine. > {{'spark.sql.autoBroadcastJoinThreshold': '-1'}} > > {{Regards,}} > {{Neeraj}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26514) Introduce a new conf to improve the task parallelism
[ https://issues.apache.org/jira/browse/SPARK-26514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26514. --- Resolution: Won't Fix > Introduce a new conf to improve the task parallelism > > > Key: SPARK-26514 > URL: https://issues.apache.org/jira/browse/SPARK-26514 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.4.0 >Reporter: zhoukang >Priority: Major > > Currently, we have a conf below. > {code:java} > private[spark] val CPUS_PER_TASK = > ConfigBuilder("spark.task.cpus").intConf.createWithDefault(1) > {code} > For some applications which are not cpu-intensive,we may want to let one cpu > to run more than one task. > Like: > {code:java} > private[spark] val TASKS_PER_CPU = > ConfigBuilder("spark.cpu.tasks").intConf.createWithDefault(1) > {code} > Which can improve performance for some applications and can also improve > resource utilization -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26434) disallow ADAPTIVE_EXECUTION_ENABLED&CBO_ENABLED in org.apache.spark.sql.execution.streaming.StreamExecution#runStream, but logWarning in org.apache.spark.sql.streaming.S
[ https://issues.apache.org/jira/browse/SPARK-26434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi updated SPARK-26434: -- Affects Version/s: (was: 2.4.0) 3.0.0 > disallow ADAPTIVE_EXECUTION_ENABLED&CBO_ENABLED in > org.apache.spark.sql.execution.streaming.StreamExecution#runStream, but > logWarning in org.apache.spark.sql.streaming.StreamingQueryManager#createQuery > - > > Key: SPARK-26434 > URL: https://issues.apache.org/jira/browse/SPARK-26434 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: iamhumanbeing >Priority: Trivial > > disallow ADAPTIVE_EXECUTION_ENABLED&CBO_ENABLED in > org.apache.spark.sql.execution.streaming.StreamExecution#runStream, but > logWarning in org.apache.spark.sql.streaming.StreamingQueryManager#createQuery > it is better to logWarning in > org.apache.spark.sql.execution.streaming.StreamExecution#runStream -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26526) test case for SPARK-10316 is not valid any more
Liu, Linhong created SPARK-26526: Summary: test case for SPARK-10316 is not valid any more Key: SPARK-26526 URL: https://issues.apache.org/jira/browse/SPARK-26526 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 2.4.0 Reporter: Liu, Linhong Test case in [SPARK-10316|https://github.com/apache/spark/pull/8486] is used to make sure non-deterministic `Filter` won't be pushed through `Project` But in current code base this test case can't cover this purpose. Change LogicalRDD to HadoopFsRelation can fix this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26526) test case for SPARK-10316 is not valid any more
[ https://issues.apache.org/jira/browse/SPARK-26526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26526: Assignee: Apache Spark > test case for SPARK-10316 is not valid any more > --- > > Key: SPARK-26526 > URL: https://issues.apache.org/jira/browse/SPARK-26526 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Liu, Linhong >Assignee: Apache Spark >Priority: Major > > Test case in [SPARK-10316|https://github.com/apache/spark/pull/8486] is used > to make sure non-deterministic `Filter` won't be pushed through `Project` > But in current code base this test case can't cover this purpose. > Change LogicalRDD to HadoopFsRelation can fix this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26526) test case for SPARK-10316 is not valid any more
[ https://issues.apache.org/jira/browse/SPARK-26526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26526: Assignee: (was: Apache Spark) > test case for SPARK-10316 is not valid any more > --- > > Key: SPARK-26526 > URL: https://issues.apache.org/jira/browse/SPARK-26526 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Liu, Linhong >Priority: Major > > Test case in [SPARK-10316|https://github.com/apache/spark/pull/8486] is used > to make sure non-deterministic `Filter` won't be pushed through `Project` > But in current code base this test case can't cover this purpose. > Change LogicalRDD to HadoopFsRelation can fix this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26454) While creating new UDF with JAR though UDF is created successfully, it throws IllegegalArgument Exception
[ https://issues.apache.org/jira/browse/SPARK-26454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26454: Assignee: (was: Apache Spark) > While creating new UDF with JAR though UDF is created successfully, it throws > IllegegalArgument Exception > - > > Key: SPARK-26454 > URL: https://issues.apache.org/jira/browse/SPARK-26454 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.2 >Reporter: Udbhav Agrawal >Priority: Trivial > Attachments: create_exception.txt > > > 【Test step】: > 1.launch spark-shell > 2. set role admin; > 3. create new function > CREATE FUNCTION Func AS > 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR > 'hdfs:///tmp/super_udf/two_udfs.jar' > 4. Do select on the function > sql("select Func('2018-03-09')").show() > 5.Create new UDF with same JAR > sql("CREATE FUNCTION newFunc AS > 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR > 'hdfs:///tmp/super_udf/two_udfs.jar'") > 6. Do select on the new function created. > sql("select newFunc ('2018-03-09')").show() > 【Output】: > Function is getting created but illegal argument exception is thrown , select > provides result but with illegal argument exception. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26454) While creating new UDF with JAR though UDF is created successfully, it throws IllegegalArgument Exception
[ https://issues.apache.org/jira/browse/SPARK-26454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26454: Assignee: Apache Spark > While creating new UDF with JAR though UDF is created successfully, it throws > IllegegalArgument Exception > - > > Key: SPARK-26454 > URL: https://issues.apache.org/jira/browse/SPARK-26454 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.2 >Reporter: Udbhav Agrawal >Assignee: Apache Spark >Priority: Trivial > Attachments: create_exception.txt > > > 【Test step】: > 1.launch spark-shell > 2. set role admin; > 3. create new function > CREATE FUNCTION Func AS > 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR > 'hdfs:///tmp/super_udf/two_udfs.jar' > 4. Do select on the function > sql("select Func('2018-03-09')").show() > 5.Create new UDF with same JAR > sql("CREATE FUNCTION newFunc AS > 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR > 'hdfs:///tmp/super_udf/two_udfs.jar'") > 6. Do select on the new function created. > sql("select newFunc ('2018-03-09')").show() > 【Output】: > Function is getting created but illegal argument exception is thrown , select > provides result but with illegal argument exception. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26525) Fast release memory of ShuffleBlockFetcherIterator
[ https://issues.apache.org/jira/browse/SPARK-26525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liupengcheng updated SPARK-26525: - Description: Currently, spark would not release ShuffleBlockFetcherIterator until the whole task finished. In some conditions, it incurs memory leak. An example is Shuffle -> map -> Coalesce(shuffle = false). Each ShuffleBlockFetcherIterator contains some metas about MapStatus(blocksByAddress) and each ShuffleMapTask will keep n(max to shuffle partitions) shuffleBlockFetcherIterator for they are refered by onCompleteCallbacks of TaskContext, in some case, it may take huge memory and the memory will not released until the task finished. Actually, We can release ShuffleBlockFetcherIterator as soon as it's consumed. was: Currently, spark would not release ShuffleBlockFetcherIterator until the whole task finished. In some conditions, it incurs memory leak. An example is Shuffle -> map -> Coalesce(shuffle = false). Each ShuffleMapTask will keep n(max to shuffle partitions) shuffleBlockFetcherIterator for they are refered by onCompleteCallbacks of TaskContext, and each ShuffleBlockFetcherIterator contains some metas about MapStatus(blocksByAddress), in some case, it may take huge memory and the memory will not released until the task finished. Actually, We can release ShuffleBlockFetcherIterator as soon as it's consumed. > Fast release memory of ShuffleBlockFetcherIterator > -- > > Key: SPARK-26525 > URL: https://issues.apache.org/jira/browse/SPARK-26525 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.2 >Reporter: liupengcheng >Priority: Major > > Currently, spark would not release ShuffleBlockFetcherIterator until the > whole task finished. > In some conditions, it incurs memory leak. > An example is Shuffle -> map -> Coalesce(shuffle = false). Each > ShuffleBlockFetcherIterator contains some metas about > MapStatus(blocksByAddress) and each ShuffleMapTask will keep n(max to shuffle > partitions) shuffleBlockFetcherIterator for they are refered by > onCompleteCallbacks of TaskContext, in some case, it may take huge memory and > the memory will not released until the task finished. > Actually, We can release ShuffleBlockFetcherIterator as soon as it's consumed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26433) Tail method for spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-26433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732895#comment-16732895 ] Hyukjin Kwon commented on SPARK-26433: -- There are few potential workarounds. For instnace, http://www.swi.com/spark-rdd-getting-bottom-records/ or {{df.sort($"ColumnName".desc).show()}}. BTW, usually tail or head are used in Scala as below (IMHO): {code} scala> Seq(1, 2, 3).tail res10: Seq[Int] = List(2, 3) scala> Seq(1, 2, 3).head res11: Int = 1 {code} > Tail method for spark DataFrame > --- > > Key: SPARK-26433 > URL: https://issues.apache.org/jira/browse/SPARK-26433 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Jan Gorecki >Priority: Major > > There is a head method for spark dataframes which work fine but there doesn't > seems to be tail method. > ``` > >>> ans > >>> > DataFrame[v1: bigint] > > >>> ans.head(3) > >>> > [Row(v1=299443), Row(v1=299493), Row(v1=300751)] > >>> ans.tail(3) > Traceback (most recent call last): > File "", line 1, in > File > "/home/jan/git/db-benchmark/spark/py-spark/lib/python3.6/site-packages/py > spark/sql/dataframe.py", line 1300, in __getattr__ > "'%s' object has no attribute '%s'" % (self.__class__.__name__, name)) > AttributeError: 'DataFrame' object has no attribute 'tail' > ``` > I would like to feature request Tail method for spark dataframe -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26525) Fast release memory of ShuffleBlockFetcherIterator
[ https://issues.apache.org/jira/browse/SPARK-26525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liupengcheng updated SPARK-26525: - Description: Currently, spark would not release ShuffleBlockFetcherIterator until the whole task finished. In some conditions, it incurs memory leak. An example is Shuffle -> map -> Coalesce(shuffle = false). Each ShuffleMapTask will keep n(max to shuffle partitions) shuffleBlockFetcherIterator for they are refered by onCompleteCallbacks of TaskContext, and each ShuffleBlockFetcherIterator contains some metas about MapStatus(blocksByAddress), in some case, it may take huge memory and the memory will not released until the task finished. Actually, We can release ShuffleBlockFetcherIterator as soon as it's consumed. was: Currently, spark would not release ShuffleBlockFetcherIterator until the whole task finished. In some conditions, it incurs memory leak, because it contains some metas about MapStatus(blocksByAddress), which may take huge memory. An example is Shuffle -> map -> Coalesce(shuffle = false), each ShuffleMapTask will keep n(max to shuffle partitions) shuffleBlockFetcherIterator for they are refered by onCompleteCallbacks of TaskContext. We can release ShuffleBlockFetcherIterator as soon as it's consumed. > Fast release memory of ShuffleBlockFetcherIterator > -- > > Key: SPARK-26525 > URL: https://issues.apache.org/jira/browse/SPARK-26525 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.2 >Reporter: liupengcheng >Priority: Major > > Currently, spark would not release ShuffleBlockFetcherIterator until the > whole task finished. > In some conditions, it incurs memory leak. > An example is Shuffle -> map -> Coalesce(shuffle = false). Each > ShuffleMapTask will keep n(max to shuffle partitions) > shuffleBlockFetcherIterator for they are refered by onCompleteCallbacks of > TaskContext, and each ShuffleBlockFetcherIterator contains some metas about > MapStatus(blocksByAddress), in some case, it may take huge memory and the > memory will not released until the task finished. > Actually, We can release ShuffleBlockFetcherIterator as soon as it's consumed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25253) Refactor pyspark connection & authentication
[ https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732882#comment-16732882 ] Hyukjin Kwon commented on SPARK-25253: -- [~irashid], did you manually backport this into branch-2.2? > Refactor pyspark connection & authentication > > > Key: SPARK-25253 > URL: https://issues.apache.org/jira/browse/SPARK-25253 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > Fix For: 2.2.3, 2.4.0 > > > We've got a few places in pyspark that connect to local sockets, with varying > levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. > should be pretty easy to clean this up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26525) Fast release memory of ShuffleBlockFetcherIterator
[ https://issues.apache.org/jira/browse/SPARK-26525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26525: Assignee: (was: Apache Spark) > Fast release memory of ShuffleBlockFetcherIterator > -- > > Key: SPARK-26525 > URL: https://issues.apache.org/jira/browse/SPARK-26525 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.2 >Reporter: liupengcheng >Priority: Major > > Currently, spark would not release ShuffleBlockFetcherIterator until the > whole task finished. > In some conditions, it incurs memory leak, because it contains some metas > about MapStatus(blocksByAddress), which may take huge memory. > An example is Shuffle -> map -> Coalesce(shuffle = false), each > ShuffleMapTask will keep n(max to shuffle partitions) > shuffleBlockFetcherIterator for they are refered by onCompleteCallbacks of > TaskContext. > We can release ShuffleBlockFetcherIterator as soon as it's consumed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26502) Get rid of hiveResultString() in QueryExecution
[ https://issues.apache.org/jira/browse/SPARK-26502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732874#comment-16732874 ] Hyukjin Kwon commented on SPARK-26502: -- Can we update JIRA as well, [~maxgekk]? > Get rid of hiveResultString() in QueryExecution > --- > > Key: SPARK-26502 > URL: https://issues.apache.org/jira/browse/SPARK-26502 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > The method hiveResultString() of QueryExecution is used in test and > SparkSQLDriver. It should be moved from QueryExecution to more specific class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26502) Get rid of hiveResultString() in QueryExecution
[ https://issues.apache.org/jira/browse/SPARK-26502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732873#comment-16732873 ] Hyukjin Kwon commented on SPARK-26502: -- Fixed in https://github.com/apache/spark/pull/23409 > Get rid of hiveResultString() in QueryExecution > --- > > Key: SPARK-26502 > URL: https://issues.apache.org/jira/browse/SPARK-26502 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > The method hiveResultString() of QueryExecution is used in test and > SparkSQLDriver. It should be moved from QueryExecution to more specific class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26525) Fast release memory of ShuffleBlockFetcherIterator
[ https://issues.apache.org/jira/browse/SPARK-26525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26525: Assignee: Apache Spark > Fast release memory of ShuffleBlockFetcherIterator > -- > > Key: SPARK-26525 > URL: https://issues.apache.org/jira/browse/SPARK-26525 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.2 >Reporter: liupengcheng >Assignee: Apache Spark >Priority: Major > > Currently, spark would not release ShuffleBlockFetcherIterator until the > whole task finished. > In some conditions, it incurs memory leak, because it contains some metas > about MapStatus(blocksByAddress), which may take huge memory. > An example is Shuffle -> map -> Coalesce(shuffle = false), each > ShuffleMapTask will keep n(max to shuffle partitions) > shuffleBlockFetcherIterator for they are refered by onCompleteCallbacks of > TaskContext. > We can release ShuffleBlockFetcherIterator as soon as it's consumed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26502) Get rid of hiveResultString() in QueryExecution
[ https://issues.apache.org/jira/browse/SPARK-26502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-26502. --- Resolution: Fixed Assignee: Maxim Gekk Fix Version/s: 3.0.0 > Get rid of hiveResultString() in QueryExecution > --- > > Key: SPARK-26502 > URL: https://issues.apache.org/jira/browse/SPARK-26502 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > The method hiveResultString() of QueryExecution is used in test and > SparkSQLDriver. It should be moved from QueryExecution to more specific class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26524) If the application directory fails to be created on the SPARK_WORKER_DIR on some woker nodes (for example, bad disk or disk has no capacity), the application executor w
[ https://issues.apache.org/jira/browse/SPARK-26524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26524: Assignee: Apache Spark > If the application directory fails to be created on the SPARK_WORKER_DIR on > some woker nodes (for example, bad disk or disk has no capacity), the > application executor will be allocated indefinitely. > -- > > Key: SPARK-26524 > URL: https://issues.apache.org/jira/browse/SPARK-26524 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: hantiantian >Assignee: Apache Spark >Priority: Major > > When the spark worker is started, the workerdir is created successfully. When > the application is submitted, the disks mounted by the workerdir and > worker122 workerdir are damaged. > When a worker allocates an executor, it creates a working directory and a > temporary directory. If the creation fails, the executor allocation fails. > The application directory fails to be created on the SPARK_WORKER_DIR on > woker121 and worker122,the application executor will be allocated > indefinitely. > 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Removing > executor app-20190103154954-/5762 because it is FAILED > 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Launching > executor app-20190103154954-/5765 on worker > worker-20190103154858-worker121-37199 > 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Removing > executor app-20190103154954-/5764 because it is FAILED > 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Launching > executor app-20190103154954-/5766 on worker > worker-20190103154920-worker122-41273 > 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Removing > executor app-20190103154954-/5766 because it is FAILED > 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Launching > executor app-20190103154954-/5767 on worker > worker-20190103154920-worker122-41273 > 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Removing > executor app-20190103154954-/5765 because it is FAILED > 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Launching > executor app-20190103154954-/5768 on worker > worker-20190103154858-worker121-37199 > ... > I observed the code and found that spark has some processing for the failure > of the executor allocation. However, it can only handle the case where the > current application does not have an executor that has been successfully > assigned. > if (!normalExit > && appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES > && MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path > val execs = appInfo.executors.values > if (!execs.exists(_.state == ExecutorState.RUNNING)) { > logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " + > s"${appInfo.retryCount} times; removing it") > removeApplication(appInfo, ApplicationState.FAILED) > } > } > Master will only judge whether the worker is available according to the > resources of the worker. > // Filter out workers that don't have enough resources to launch an executor > val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE) > .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB && > worker.coresFree >= coresPerExecutor) > .sortBy(_.coresFree).reverse > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26524) If the application directory fails to be created on the SPARK_WORKER_DIR on some woker nodes (for example, bad disk or disk has no capacity), the application executor w
[ https://issues.apache.org/jira/browse/SPARK-26524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26524: Assignee: (was: Apache Spark) > If the application directory fails to be created on the SPARK_WORKER_DIR on > some woker nodes (for example, bad disk or disk has no capacity), the > application executor will be allocated indefinitely. > -- > > Key: SPARK-26524 > URL: https://issues.apache.org/jira/browse/SPARK-26524 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: hantiantian >Priority: Major > > When the spark worker is started, the workerdir is created successfully. When > the application is submitted, the disks mounted by the workerdir and > worker122 workerdir are damaged. > When a worker allocates an executor, it creates a working directory and a > temporary directory. If the creation fails, the executor allocation fails. > The application directory fails to be created on the SPARK_WORKER_DIR on > woker121 and worker122,the application executor will be allocated > indefinitely. > 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Removing > executor app-20190103154954-/5762 because it is FAILED > 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Launching > executor app-20190103154954-/5765 on worker > worker-20190103154858-worker121-37199 > 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Removing > executor app-20190103154954-/5764 because it is FAILED > 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Launching > executor app-20190103154954-/5766 on worker > worker-20190103154920-worker122-41273 > 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Removing > executor app-20190103154954-/5766 because it is FAILED > 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Launching > executor app-20190103154954-/5767 on worker > worker-20190103154920-worker122-41273 > 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Removing > executor app-20190103154954-/5765 because it is FAILED > 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Launching > executor app-20190103154954-/5768 on worker > worker-20190103154858-worker121-37199 > ... > I observed the code and found that spark has some processing for the failure > of the executor allocation. However, it can only handle the case where the > current application does not have an executor that has been successfully > assigned. > if (!normalExit > && appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES > && MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path > val execs = appInfo.executors.values > if (!execs.exists(_.state == ExecutorState.RUNNING)) { > logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " + > s"${appInfo.retryCount} times; removing it") > removeApplication(appInfo, ApplicationState.FAILED) > } > } > Master will only judge whether the worker is available according to the > resources of the worker. > // Filter out workers that don't have enough resources to launch an executor > val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE) > .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB && > worker.coresFree >= coresPerExecutor) > .sortBy(_.coresFree).reverse > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26525) Fast release memory of ShuffleBlockFetcherIterator
[ https://issues.apache.org/jira/browse/SPARK-26525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liupengcheng updated SPARK-26525: - Description: Currently, spark would not release ShuffleBlockFetcherIterator until the whole task finished. In some conditions, it incurs memory leak, because it contains some metas about MapStatus(blocksByAddress), which may take huge memory. An example is Shuffle -> map -> Coalesce(shuffle = false), each ShuffleMapTask will keep n(max to shuffle partitions) shuffleBlockFetcherIterator for they are refered by onCompleteCallbacks of TaskContext. We can release ShuffleBlockFetcherIterator as soon as it's consumed. was: Currently, spark would not release ShuffleBlockFetcherIterator until the whole task finished. In some conditions, it incurs memory leak. An example is Shuffle -> map -> Coalesce(shuffle = false), each ShuffleMapTask will keep n(max to shuffle partitions) shuffleBlockFetcherIterator for they are refered by onCompleteCallbacks of TaskContext. We can release ShuffleBlockFetcherIterator as soon as it's consumed. > Fast release memory of ShuffleBlockFetcherIterator > -- > > Key: SPARK-26525 > URL: https://issues.apache.org/jira/browse/SPARK-26525 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.2 >Reporter: liupengcheng >Priority: Major > > Currently, spark would not release ShuffleBlockFetcherIterator until the > whole task finished. > In some conditions, it incurs memory leak, because it contains some metas > about MapStatus(blocksByAddress), which may take huge memory. > An example is Shuffle -> map -> Coalesce(shuffle = false), each > ShuffleMapTask will keep n(max to shuffle partitions) > shuffleBlockFetcherIterator for they are refered by onCompleteCallbacks of > TaskContext. > We can release ShuffleBlockFetcherIterator as soon as it's consumed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26525) Fast release memory of ShuffleBlockFetcherIterator
liupengcheng created SPARK-26525: Summary: Fast release memory of ShuffleBlockFetcherIterator Key: SPARK-26525 URL: https://issues.apache.org/jira/browse/SPARK-26525 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 2.3.2 Reporter: liupengcheng Currently, spark would not release ShuffleBlockFetcherIterator until the whole task finished. In some conditions, it incurs memory leak. An example is Shuffle -> map -> Coalesce(shuffle = false), each ShuffleMapTask will keep n(max to shuffle partitions) shuffleBlockFetcherIterator for they are refered by onCompleteCallbacks of TaskContext. We can release ShuffleBlockFetcherIterator as soon as it's consumed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26524) If the application directory fails to be created on the SPARK_WORKER_DIR on some woker nodes (for example, bad disk or disk has no capacity), the application executor wi
hantiantian created SPARK-26524: --- Summary: If the application directory fails to be created on the SPARK_WORKER_DIR on some woker nodes (for example, bad disk or disk has no capacity), the application executor will be allocated indefinitely. Key: SPARK-26524 URL: https://issues.apache.org/jira/browse/SPARK-26524 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: hantiantian When the spark worker is started, the workerdir is created successfully. When the application is submitted, the disks mounted by the workerdir and worker122 workerdir are damaged. When a worker allocates an executor, it creates a working directory and a temporary directory. If the creation fails, the executor allocation fails. The application directory fails to be created on the SPARK_WORKER_DIR on woker121 and worker122,the application executor will be allocated indefinitely. 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Removing executor app-20190103154954-/5762 because it is FAILED 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Launching executor app-20190103154954-/5765 on worker worker-20190103154858-worker121-37199 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Removing executor app-20190103154954-/5764 because it is FAILED 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Launching executor app-20190103154954-/5766 on worker worker-20190103154920-worker122-41273 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Removing executor app-20190103154954-/5766 because it is FAILED 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Launching executor app-20190103154954-/5767 on worker worker-20190103154920-worker122-41273 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Removing executor app-20190103154954-/5765 because it is FAILED 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Launching executor app-20190103154954-/5768 on worker worker-20190103154858-worker121-37199 ... I observed the code and found that spark has some processing for the failure of the executor allocation. However, it can only handle the case where the current application does not have an executor that has been successfully assigned. if (!normalExit && appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES && MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path val execs = appInfo.executors.values if (!execs.exists(_.state == ExecutorState.RUNNING)) { logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " + s"${appInfo.retryCount} times; removing it") removeApplication(appInfo, ApplicationState.FAILED) } } Master will only judge whether the worker is available according to the resources of the worker. // Filter out workers that don't have enough resources to launch an executor val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE) .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB && worker.coresFree >= coresPerExecutor) .sortBy(_.coresFree).reverse -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26522) Auth secret error in RBackendAuthHandler
[ https://issues.apache.org/jira/browse/SPARK-26522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732762#comment-16732762 ] Hyukjin Kwon commented on SPARK-26522: -- I think it's something Livy should support rather than letting Spark support insecure connection. See also SPARK-26019. I don't think it's going to be at the very least fixed in Spark 3.0. WDYT [~vanzin] and [~irashid]? > Auth secret error in RBackendAuthHandler > > > Key: SPARK-26522 > URL: https://issues.apache.org/jira/browse/SPARK-26522 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1 >Reporter: Euijun >Priority: Minor > > Hi expert, > I try to use livy to connect sparkR backend. > This is related to > [https://stackoverflow.com/questions/53900995/livy-spark-r-issue] > > Error message is, > {code:java} > Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, : > Auth secret not provided in environment.{code} > > caused by, > spark-2.3.1/R/pkg/R/sparkR.R > {code:java} > sparkR.sparkContext <- function( > ... > authSecret <- Sys.getenv("SPARKR_BACKEND_AUTH_SECRET") > if (nchar(authSecret) == 0) { > stop("Auth secret not provided in environment.") > } > ... > ) > {code} > > Best regard. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26522) Auth secret error in RBackendAuthHandler
[ https://issues.apache.org/jira/browse/SPARK-26522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-26522: Assignee: (was: Matt Cheah) > Auth secret error in RBackendAuthHandler > > > Key: SPARK-26522 > URL: https://issues.apache.org/jira/browse/SPARK-26522 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1 >Reporter: Euijun >Priority: Minor > > Hi expert, > I try to use livy to connect sparkR backend. > This is related to > [https://stackoverflow.com/questions/53900995/livy-spark-r-issue] > > Error message is, > {code:java} > Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, : > Auth secret not provided in environment.{code} > > caused by, > spark-2.3.1/R/pkg/R/sparkR.R > {code:java} > sparkR.sparkContext <- function( > ... > authSecret <- Sys.getenv("SPARKR_BACKEND_AUTH_SECRET") > if (nchar(authSecret) == 0) { > stop("Auth secret not provided in environment.") > } > ... > ) > {code} > > Best regard. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26522) Auth secret error in RBackendAuthHandler
[ https://issues.apache.org/jira/browse/SPARK-26522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26522: - Labels: (was: newbie) > Auth secret error in RBackendAuthHandler > > > Key: SPARK-26522 > URL: https://issues.apache.org/jira/browse/SPARK-26522 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1 >Reporter: Euijun >Assignee: Matt Cheah >Priority: Minor > > Hi expert, > I try to use livy to connect sparkR backend. > This is related to > [https://stackoverflow.com/questions/53900995/livy-spark-r-issue] > > Error message is, > {code:java} > Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, : > Auth secret not provided in environment.{code} > > caused by, > spark-2.3.1/R/pkg/R/sparkR.R > {code:java} > sparkR.sparkContext <- function( > ... > authSecret <- Sys.getenv("SPARKR_BACKEND_AUTH_SECRET") > if (nchar(authSecret) == 0) { > stop("Auth secret not provided in environment.") > } > ... > ) > {code} > > Best regard. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26519) spark sql CHANGE COLUMN not working
[ https://issues.apache.org/jira/browse/SPARK-26519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26519. -- Resolution: Duplicate > spark sql CHANGE COLUMN not working > -- > > Key: SPARK-26519 > URL: https://issues.apache.org/jira/browse/SPARK-26519 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: !image-2019-01-02-14-25-34-594.png! >Reporter: suman gorantla >Priority: Major > Attachments: sparksql error.PNG > > > Dear Team, > with spark sql I am unable to change the newly added column() position after > an existing column in the table (old_column) of a hive external table please > see the screenshot as in below > scala> sql("ALTER TABLE enterprisedatalakedev.tmptst ADD COLUMNs (new_col > STRING)") > res14: org.apache.spark.sql.DataFrame = [] > sql("ALTER TABLE enterprisedatalakedev.tmptst CHANGE COLUMN new_col new_col > STRING AFTER old_col ") > org.apache.spark.sql.catalyst.parser.ParseException: > Operation not allowed: ALTER TABLE table [PARTITION partition_spec] CHANGE > COLUMN ... FIRST | AFTER otherCol(line 1, pos 0) > == SQL == > ALTER TABLE enterprisedatalakedev.tmptst CHANGE COLUMN new_col new_col > STRING AFTER old_col > ^^^ > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.operationNotAllowed(ParserUtils.scala:39) > at > org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitChangeColumn$1.apply(SparkSqlParser.scala:934) > at > org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitChangeColumn$1.apply(SparkSqlParser.scala:928) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:99) > at > org.apache.spark.sql.execution.SparkSqlAstBuilder.visitChangeColumn(SparkSqlParser.scala:928) > at > org.apache.spark.sql.execution.SparkSqlAstBuilder.visitChangeColumn(SparkSqlParser.scala:55) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$ChangeColumnContext.accept(SqlBaseParser.java:1485) > at > org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:42) > at > org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:99) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:70) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:69) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:68) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:97) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623) > ... 48 elided > !image-2019-01-02-14-25-40-980.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26519) spark sql CHANGE COLUMN not working
[ https://issues.apache.org/jira/browse/SPARK-26519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732758#comment-16732758 ] Hyukjin Kwon commented on SPARK-26519: -- It's a duplicate of SPARK-23890 or SPARK-24602. > spark sql CHANGE COLUMN not working > -- > > Key: SPARK-26519 > URL: https://issues.apache.org/jira/browse/SPARK-26519 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: !image-2019-01-02-14-25-34-594.png! >Reporter: suman gorantla >Priority: Major > Attachments: sparksql error.PNG > > > Dear Team, > with spark sql I am unable to change the newly added column() position after > an existing column in the table (old_column) of a hive external table please > see the screenshot as in below > scala> sql("ALTER TABLE enterprisedatalakedev.tmptst ADD COLUMNs (new_col > STRING)") > res14: org.apache.spark.sql.DataFrame = [] > sql("ALTER TABLE enterprisedatalakedev.tmptst CHANGE COLUMN new_col new_col > STRING AFTER old_col ") > org.apache.spark.sql.catalyst.parser.ParseException: > Operation not allowed: ALTER TABLE table [PARTITION partition_spec] CHANGE > COLUMN ... FIRST | AFTER otherCol(line 1, pos 0) > == SQL == > ALTER TABLE enterprisedatalakedev.tmptst CHANGE COLUMN new_col new_col > STRING AFTER old_col > ^^^ > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.operationNotAllowed(ParserUtils.scala:39) > at > org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitChangeColumn$1.apply(SparkSqlParser.scala:934) > at > org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitChangeColumn$1.apply(SparkSqlParser.scala:928) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:99) > at > org.apache.spark.sql.execution.SparkSqlAstBuilder.visitChangeColumn(SparkSqlParser.scala:928) > at > org.apache.spark.sql.execution.SparkSqlAstBuilder.visitChangeColumn(SparkSqlParser.scala:55) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$ChangeColumnContext.accept(SqlBaseParser.java:1485) > at > org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:42) > at > org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:99) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:70) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:69) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:68) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:97) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623) > ... 48 elided > !image-2019-01-02-14-25-40-980.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org