[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784194#comment-16784194 ] Shahid K I commented on SPARK-27019: Please show me the screenshot of the sql page of the second scenario. I don't think in that case it will display like that. The issue happens only when the new live execution data is overwritten by the existing one > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: Screenshot from 2019-03-01 21-31-48.png, > application_1550040445209_4748, query-1-details.png, query-1-list.png, > query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26602: Assignee: (was: Apache Spark) > Insert into table fails after querying the UDF which is loaded with wrong > hdfs path > --- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Priority: Major > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784172#comment-16784172 ] peay edited comment on SPARK-27019 at 3/5/19 7:28 AM: -- Great! -Is that compatible with my second observation above? (I tested without any executors, and even without any task starting, the SQL tab had the wrong output). I can try to get an event log for that as well if that's helpful.- edit: I tried to reproduce that to export the event log, and could not. Seems like your patch should address the issue. was (Author: peay): Great! Is that compatible with my second observation above? (I tested without any executors, and even without any task starting, the SQL tab had the wrong output). I can try to get an event log for that as well if that's helpful. > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: Screenshot from 2019-03-01 21-31-48.png, > application_1550040445209_4748, query-1-details.png, query-1-list.png, > query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784179#comment-16784179 ] Shahid K I commented on SPARK-27019: Thanks [~peay] could you please share the event log for that too, if possible > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: Screenshot from 2019-03-01 21-31-48.png, > application_1550040445209_4748, query-1-details.png, query-1-list.png, > query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26602: Assignee: Apache Spark > Insert into table fails after querying the UDF which is loaded with wrong > hdfs path > --- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Assignee: Apache Spark >Priority: Major > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784172#comment-16784172 ] peay commented on SPARK-27019: -- Great! Is that compatible with my second observation above? (I tested without any executors, and even without any task starting, the SQL tab had the wrong output). I can try to get an event log for that as well if that's helpful. > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: Screenshot from 2019-03-01 21-31-48.png, > application_1550040445209_4748, query-1-details.png, query-1-list.png, > query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784164#comment-16784164 ] Lewin Ma commented on SPARK-18105: -- Still hit the same issue in Spark 2.3.1: {code:java} org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:523) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:439) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:61) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Stream is corrupted at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:170) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:348) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:335) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:335) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1380) at org.apache.spark.util.Utils$.copyStream(Utils.scala:356) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:431) ... 21 more{code} > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Major > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in > stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted > at > org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220) > at > org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109) > at java.io.BufferedInputStream.read(BufferedInputStream.java:353) > at java.io.DataInputStream.read(DataInputStream.java:149) > at com.google.common.io.ByteStreams.read(ByteStreams.java:828) > at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) > at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) > at
[jira] [Commented] (SPARK-26850) Make EventLoggingListener LOG_FILE_PERMISSIONS configurable
[ https://issues.apache.org/jira/browse/SPARK-26850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784159#comment-16784159 ] Jungtaek Lim commented on SPARK-26850: -- Looks like duplicated on SPARK-26912, as well as SPARK-26912 already has pull request. > Make EventLoggingListener LOG_FILE_PERMISSIONS configurable > --- > > Key: SPARK-26850 > URL: https://issues.apache.org/jira/browse/SPARK-26850 > Project: Spark > Issue Type: Wish > Components: Scheduler >Affects Versions: 2.2.3, 2.3.2, 2.4.0 >Reporter: Hua Zhang >Priority: Minor > > private[spark] object EventLoggingListener extends Logging { > ... > private val LOG_FILE_PERMISSIONS = new FsPermission(Integer.parseInt("770", > 8).toShort) > ... > } > > Currently the event log files are hard-coded with permission 770. > It would be fine if this permission is +configurable+. > User case: The spark application is submitted by user A but the spark history > server is started by user B. Currently user B cannot access the history event > files created by user A. When permission is set to 775, this will be possible. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24602) In Spark SQL, ALTER TABLE--CHANGE column1 column2 datatype is not supported in 2.3.1
[ https://issues.apache.org/jira/browse/SPARK-24602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784155#comment-16784155 ] Sushanta Sen commented on SPARK-24602: -- This issue is logged prior to other JIRAs as mentioned in the Issue Links. > In Spark SQL, ALTER TABLE--CHANGE column1 column2 datatype is not supported > in 2.3.1 > > > Key: SPARK-24602 > URL: https://issues.apache.org/jira/browse/SPARK-24602 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 > Environment: OS: SUSE11 > Spark Version: 2.3 >Reporter: Sushanta Sen >Priority: Major > > Precondition: > Spark cluster 2.3 is up and running > Test Steps: > # Launch Spark-sql > # spark-sql> CREATE TABLE t1(a int,string) > 0: jdbc:hive2://ha-cluster/default> *alter > table t1 change a a1 int;* > Error: org.apache.spark.sql.AnalysisException: {color:#FF}ALTER TABLE > CHANGE COLUMN is not supported for changing column 'a' with type > 'IntegerType' to 'b' with type 'IntegerType'; (state=,code=0){color} > # Launch hive beeliine > # repeat step1 & 2 > # 0: jdbc:hive2://10.18.108.126:1/> desc del1; > +---++--+--+ > | col_name | data_type | comment | > +---++--+--+ > | *a1* | *int* | | > | dob | int | | > +---++--+--+ > 2 rows selected (1.572 seconds) > 0: jdbc:hive2://10.18.108.126:1/>{color:#205081} alter table del1 change > a1 a bigint;{color} > No rows affected (0.425 seconds) > 0: jdbc:hive2://10.18.108.126:1/> desc del1; > +---++--+--+ > | col_name | data_type | comment | > +---++--+--+ > | *a* | *bigint* | | > | dob | int | | > +---++--+--+ > 2 rows selected (0.364 seconds) > > Actual Result: In spark sql, alter table change is not supported, whereas in > hive beeline it is working fine. > Expected Result: ALTER Table CHANGE should be supported in Spark-SQL as well -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26922) Set socket timeout consistently in Arrow optimization
[ https://issues.apache.org/jira/browse/SPARK-26922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26922: Assignee: (was: Apache Spark) > Set socket timeout consistently in Arrow optimization > - > > Key: SPARK-26922 > URL: https://issues.apache.org/jira/browse/SPARK-26922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > For instance, see > https://github.com/apache/spark/blob/e8982ca7ad94e98d907babf2d6f1068b7cd064c6/R/pkg/R/context.R#L184 > it should set the timeout from {{SPARKR_BACKEND_CONNECTION_TIMEOUT}}. Or > maybe we need another environment variable. > This might be able to be fixed together when some codes around there is > touched. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27054) Remove Calcite dependency
[ https://issues.apache.org/jira/browse/SPARK-27054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27054: Assignee: Apache Spark > Remove Calcite dependency > - > > Key: SPARK-27054 > URL: https://issues.apache.org/jira/browse/SPARK-27054 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > Calcite is only used for > [runSqlHive|https://github.com/apache/spark/blob/02bbe977abaf7006b845a7e99d612b0235aa0025/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L699-L705] > when > {{hive.cbo.enable=true}}([SemanticAnalyzer|https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzerFactory.java#L278-L280]). > So we can disable {{hive.cbo.enable}} and remove Calcite dependency. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26922) Set socket timeout consistently in Arrow optimization
[ https://issues.apache.org/jira/browse/SPARK-26922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26922: Assignee: Apache Spark > Set socket timeout consistently in Arrow optimization > - > > Key: SPARK-26922 > URL: https://issues.apache.org/jira/browse/SPARK-26922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Trivial > > For instance, see > https://github.com/apache/spark/blob/e8982ca7ad94e98d907babf2d6f1068b7cd064c6/R/pkg/R/context.R#L184 > it should set the timeout from {{SPARKR_BACKEND_CONNECTION_TIMEOUT}}. Or > maybe we need another environment variable. > This might be able to be fixed together when some codes around there is > touched. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chakravarthi reopened SPARK-26602: -- > Insert into table fails after querying the UDF which is loaded with wrong > hdfs path > --- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Priority: Major > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chakravarthi updated SPARK-26602: - Summary: Insert into table fails after querying the UDF which is loaded with wrong hdfs path (was: Once creating and quering udf with incorrect path,followed by querying tables or functions registered with correct path gives the runtime exception within the same session) > Insert into table fails after querying the UDF which is loaded with wrong > hdfs path > --- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Priority: Major > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26850) Make EventLoggingListener LOG_FILE_PERMISSIONS configurable
[ https://issues.apache.org/jira/browse/SPARK-26850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784139#comment-16784139 ] Sandeep Katta commented on SPARK-26850: --- [~srowen] I feel this use case should be supported, what's your view on this ? > Make EventLoggingListener LOG_FILE_PERMISSIONS configurable > --- > > Key: SPARK-26850 > URL: https://issues.apache.org/jira/browse/SPARK-26850 > Project: Spark > Issue Type: Wish > Components: Scheduler >Affects Versions: 2.2.3, 2.3.2, 2.4.0 >Reporter: Hua Zhang >Priority: Minor > > private[spark] object EventLoggingListener extends Logging { > ... > private val LOG_FILE_PERMISSIONS = new FsPermission(Integer.parseInt("770", > 8).toShort) > ... > } > > Currently the event log files are hard-coded with permission 770. > It would be fine if this permission is +configurable+. > User case: The spark application is submitted by user A but the spark history > server is started by user B. Currently user B cannot access the history event > files created by user A. When permission is set to 775, this will be possible. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26602) Once creating and quering udf with incorrect path,followed by querying tables or functions registered with correct path gives the runtime exception within the same ses
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784137#comment-16784137 ] Chakravarthi commented on SPARK-26602: -- And the problem is, though the jar does not exist ,it is added to the addedJars in sparkContext.scala, when performing select on UDF. So, when insert into table happens it is trying to load the jars from the ListJars and as the jar not exist,it gives Exception. The fix is to validate the Jar exist or not before adding to the addedJars. I have fixed it and will raise MR. > Once creating and quering udf with incorrect path,followed by querying tables > or functions registered with correct path gives the runtime exception within > the same session > --- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Priority: Major > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26602) Once creating and quering udf with incorrect path,followed by querying tables or functions registered with correct path gives the runtime exception within the same ses
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784134#comment-16784134 ] Chakravarthi commented on SPARK-26602: -- Hi [~srowen] , this issue is not duplicate of SPARK-26560. Here the issue is,Insert into table fails after querying the UDF which is loaded with wrong hdfs path. Below are the steps to reproduce this issue: 1) create a table. sql("create table check_udf(I int)"); 2) create udf using invalid hdfs path. sql("CREATE FUNCTION before_fix AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 'hdfs:///tmp/notexist.jar'") 3) Do select on the UDF and you will get exception as "Failed to read external resource". sql(" select before_fix('2018-03-09')"). 4) perform insert table. sql("insert into check_udf values(1)").show Here ,insert should work.but is fails. > Once creating and quering udf with incorrect path,followed by querying tables > or functions registered with correct path gives the runtime exception within > the same session > --- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Priority: Major > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26918) All .md should have ASF license header
[ https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784109#comment-16784109 ] Felix Cheung edited comment on SPARK-26918 at 3/5/19 5:47 AM: -- [~rmsm...@gmail.com] - you don't need to checkout a tag (or a release) - just checkout master into a local branch to test was (Author: felixcheung): [~rmsm...@gmail.com] - you don't need to checkout a tag - just checkout master into a local branch to test > All .md should have ASF license header > -- > > Key: SPARK-26918 > URL: https://issues.apache.org/jira/browse/SPARK-26918 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.4.0, 3.0.0 >Reporter: Felix Cheung >Priority: Minor > > per policy, all md files should have the header, like eg. > [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md] > or > [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md] > > currently it does not > [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26920) Deduplicate type checking across Arrow optimization and vectorized APIs in SparkR
[ https://issues.apache.org/jira/browse/SPARK-26920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26920: Assignee: (was: Apache Spark) > Deduplicate type checking across Arrow optimization and vectorized APIs in > SparkR > - > > Key: SPARK-26920 > URL: https://issues.apache.org/jira/browse/SPARK-26920 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > There are duplication about type checking in Arrow <> SparkR code paths. For > instance, > https://github.com/apache/spark/blob/8126d09fb5b969c1e293f1f8c41bec35357f74b5/R/pkg/R/group.R#L229-L253 > struct type and map type should also be restricted. > We should pull it out as a separate function and add deduplicated tests > separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26918) All .md should have ASF license header
[ https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784109#comment-16784109 ] Felix Cheung commented on SPARK-26918: -- [~rmsm...@gmail.com] - you don't need to checkout a tag - just checkout master into a local branch to test > All .md should have ASF license header > -- > > Key: SPARK-26918 > URL: https://issues.apache.org/jira/browse/SPARK-26918 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.4.0, 3.0.0 >Reporter: Felix Cheung >Priority: Minor > > per policy, all md files should have the header, like eg. > [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md] > or > [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md] > > currently it does not > [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27054) Remove Calcite dependency
[ https://issues.apache.org/jira/browse/SPARK-27054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27054: Assignee: (was: Apache Spark) > Remove Calcite dependency > - > > Key: SPARK-27054 > URL: https://issues.apache.org/jira/browse/SPARK-27054 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Calcite is only used for > [runSqlHive|https://github.com/apache/spark/blob/02bbe977abaf7006b845a7e99d612b0235aa0025/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L699-L705] > when > {{hive.cbo.enable=true}}([SemanticAnalyzer|https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzerFactory.java#L278-L280]). > So we can disable {{hive.cbo.enable}} and remove Calcite dependency. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26920) Deduplicate type checking across Arrow optimization and vectorized APIs in SparkR
[ https://issues.apache.org/jira/browse/SPARK-26920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26920: Assignee: Apache Spark > Deduplicate type checking across Arrow optimization and vectorized APIs in > SparkR > - > > Key: SPARK-26920 > URL: https://issues.apache.org/jira/browse/SPARK-26920 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > There are duplication about type checking in Arrow <> SparkR code paths. For > instance, > https://github.com/apache/spark/blob/8126d09fb5b969c1e293f1f8c41bec35357f74b5/R/pkg/R/group.R#L229-L253 > struct type and map type should also be restricted. > We should pull it out as a separate function and add deduplicated tests > separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26850) Make EventLoggingListener LOG_FILE_PERMISSIONS configurable
[ https://issues.apache.org/jira/browse/SPARK-26850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784107#comment-16784107 ] sandeep katta commented on SPARK-26850: --- [~happyhua] thanks for raising this issue, I will work on this and raise PR soon > Make EventLoggingListener LOG_FILE_PERMISSIONS configurable > --- > > Key: SPARK-26850 > URL: https://issues.apache.org/jira/browse/SPARK-26850 > Project: Spark > Issue Type: Wish > Components: Scheduler >Affects Versions: 2.2.3, 2.3.2, 2.4.0 >Reporter: Hua Zhang >Priority: Minor > > private[spark] object EventLoggingListener extends Logging { > ... > private val LOG_FILE_PERMISSIONS = new FsPermission(Integer.parseInt("770", > 8).toShort) > ... > } > > Currently the event log files are hard-coded with permission 770. > It would be fine if this permission is +configurable+. > User case: The spark application is submitted by user A but the spark history > server is started by user B. Currently user B cannot access the history event > files created by user A. When permission is set to 775, this will be possible. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24278) Create table if not exists is throwing table already exists exception
[ https://issues.apache.org/jira/browse/SPARK-24278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandeep katta resolved SPARK-24278. --- Resolution: Invalid It is the exception thrown by the Hive as mentioned above > Create table if not exists is throwing table already exists exception > - > > Key: SPARK-24278 > URL: https://issues.apache.org/jira/browse/SPARK-24278 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: OS: SUSE11 > Spark Version: 2.3 >Reporter: Sushanta Sen >Priority: Major > > # Launch Spark-sql > # create table check(time timestamp, name string, isright boolean, datetoday > date, num binary, height double, score float, decimaler decimal(10,0), id > tinyint, age int, license bigint, length smallint) row format delimited > fields terminated by ',' stored as textfile; > # create table if not exists check (time timestamp, name string, isright > boolean, datetoday date, num binary, height double, score float, decimaler > decimal(10,0), id tinyint, age int, license bigint, length smallint) row > format delimited fields terminated by ','stored as TEXTFILE; *-FAILED* ** > > Exception as below > spark-sql> create table if not exists check (col1 string); > *2018-05-15 14:29:56 ERROR RetryingHMSHandler:159 -* > *AlreadyExistsException(message:Table check already exists)* > *at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1372)* > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1449) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107) > at com.sun.proxy.$Proxy8.create_table_with_environment_context(Unknown Source) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.create_table_with_environment_context(HiveMetaStoreClient.java:2050) > at > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.create_table_with_environment_context(SessionHiveMetaStoreClient.java:97) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:669) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:657) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156) > at com.sun.proxy.$Proxy9.createTable(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:714) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:468) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:466) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:466) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:272) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:210) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:209) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:255) > at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:466) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply$mcV$sp(HiveExternalCatalog.scala:258) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:216) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:216) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.doCreateTable(HiveExternalCatalog.scala:216) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalog.createTable(ExternalCatalog.scala:119) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:304) > at > org.apache.spark.sql.execution.command.CreateTableCommand.run(tables.scala:128) > at >
[jira] [Created] (SPARK-27054) Remove Calcite dependency
Yuming Wang created SPARK-27054: --- Summary: Remove Calcite dependency Key: SPARK-27054 URL: https://issues.apache.org/jira/browse/SPARK-27054 Project: Spark Issue Type: Improvement Components: Build, SQL Affects Versions: 3.0.0 Reporter: Yuming Wang Calcite is only used for [runSqlHive|https://github.com/apache/spark/blob/02bbe977abaf7006b845a7e99d612b0235aa0025/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L699-L705] when {{hive.cbo.enable=true}}([SemanticAnalyzer|https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzerFactory.java#L278-L280]). So we can disable {{hive.cbo.enable}} and remove Calcite dependency. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27020) Unable to insert data with partial dynamic partition with Spark & Hive 3
[ https://issues.apache.org/jira/browse/SPARK-27020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784047#comment-16784047 ] Truong Duc Kien commented on SPARK-27020: - Hi, here are the command to reproduce the issue on my cluster. All commands are executed using spark-sql without any additional parameters. {code:sql} create database test_spark; use test_spark; create external table test_insert(a int) partitioned by (part_a string, part_b string) stored as parquet location '/apps/spark/warehouse/test_spark.db/test_insert'; {code} {code:sql} // OK > insert into table test_insert partition(part_a='a', part_b='b') values(1); {code} {code:sql} // OK > insert into table test_insert partition(part_a, part_b) values(2, 'a' , 'b'); .. 19/03/05 11:17:29 INFO Hive: New loading path = hdfs://datalake/apps/spark/warehouse/test_spark.db/test_insert/.hive-staging_hive_2019-03-05_11-17-29_547_8053153849357088752-1/-ext-1/part_a=a/part_b=b with partSpec {part_a=a, part_b=b} 19/03/05 11:17:30 INFO Hive: Loaded 1 partitions Time taken: 0.71 seconds ... {code} {code:sql} // Not OK > insert into table test_insert partition(part_a='a', part_b) values (3, 'b'); ... 19/03/05 11:19:21 WARN warehouse: Cannot create partition spec from hdfs://datalake/; missing keys [part_a] 19/03/05 11:19:21 WARN FileOperations: Ignoring invalid DP directory hdfs://datalake/apps/spark/warehouse/test_spark.db/test_insert/.hive-staging_hive_2019-03-05_11-19-21_365_800377896579975615-1/-ext-1/part_b=b 19/03/05 11:19:21 INFO Hive: Loaded 0 partitions Time taken: 0.466 seconds ... {code} > Unable to insert data with partial dynamic partition with Spark & Hive 3 > > > Key: SPARK-27020 > URL: https://issues.apache.org/jira/browse/SPARK-27020 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 > Environment: Hortonwork HDP 3.1.0 > Spark 2.3.2 > Hive 3 >Reporter: Truong Duc Kien >Priority: Major > > When performing inserting data with dynamic partition, the operation fails if > all partitions are not dynamic. For example: > The query > {code:sql} > insert overwrite table t1 (part_a='a', part_b) select * from t2 > {code} > will fails with errors > {code:xml} > Cannot create partition spec from hdfs:/// ; missing keys [part_a] > Ignoring invalid DP directory > {code} > On the other hand, if I remove the static value of part_a to make the insert > fully dynamic, the following query will success. > {code:sql} > insert overwrite table t1 (part_a, part_b) select * from t2 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27053) How about allowing engineers to use a different ExecutorBackend in StandAlone mode?
Ross Brigoli created SPARK-27053: Summary: How about allowing engineers to use a different ExecutorBackend in StandAlone mode? Key: SPARK-27053 URL: https://issues.apache.org/jira/browse/SPARK-27053 Project: Spark Issue Type: Improvement Components: Deploy, Spark Submit Affects Versions: 2.3.3 Reporter: Ross Brigoli In Standalone mode, the command for starting an Executor JVM in is hardcoded to use org.apache.spark.executor.CoarseGrainedExecutorBackend. There seems to be no way to configure the submit operation to use a custom ExecutorBackend (a subclass of CoarseGrainedExecutorBackend). This is very useful when engineers need to initialize things like starting a JDBC connection and Closing JDBC connection once per Executor. At line 103 of StandaloneSchedulerBackend.scala, why not make the fully qualified name of the executor backend class configurable? And then fall back to this default executor backend class if it's not configured. {{val command = Command("{color:#FF}*org.apache.spark.executor.CoarseGrainedExecutorBackend*{color}",}} {{ args, sc.executorEnvs, classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27051) Bump Jackson version to 2.9.8
[ https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27051. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23965 [https://github.com/apache/spark/pull/23965] > Bump Jackson version to 2.9.8 > - > > Key: SPARK-27051 > URL: https://issues.apache.org/jira/browse/SPARK-27051 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Major > Fix For: 3.0.0 > > > Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs > [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix > bump the dependent Jackson to 2.9.8. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27039) toPandas with Arrow swallows maxResultSize errors
[ https://issues.apache.org/jira/browse/SPARK-27039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27039. -- Resolution: Cannot Reproduce > toPandas with Arrow swallows maxResultSize errors > - > > Key: SPARK-27039 > URL: https://issues.apache.org/jira/browse/SPARK-27039 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: peay >Priority: Minor > > I am running the following simple `toPandas` with {{maxResultSize}} set to > 1mb: > {code:java} > import pyspark.sql.functions as F > df = spark.range(1000 * 1000) > df_pd = df.withColumn("test", F.lit("this is a long string that should make > the resulting dataframe too large for maxResult which is 1m")).toPandas() > {code} > > With {{spark.sql.execution.arrow.enabled}} set to {{true}}, this returns an > empty Pandas dataframe without any error: > {code:python} > df_pd.info() > # > # Index: 0 entries > # Data columns (total 2 columns): > # id 0 non-null object > # test0 non-null object > # dtypes: object(2) > # memory usage: 0.0+ bytes > {code} > The driver stderr does have an error, and so does the Spark UI: > {code:java} > ERROR TaskSetManager: Total size of serialized results of 1 tasks (52.8 MB) > is bigger than spark.driver.maxResultSize (1024.0 KB) > ERROR TaskSetManager: Total size of serialized results of 2 tasks (105.7 MB) > is bigger than spark.driver.maxResultSize (1024.0 KB) > Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job > aborted due to stage failure: Total size of serialized results of 1 tasks > (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2026) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2260) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2209) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2198) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) > at > org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3313) > at > org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3282) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432) > at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862) > {code} > With {{spark.sql.execution.arrow.enabled}} set to {{false}}, the Python call > to {{toPandas}} does fail as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27025) Speed up toLocalIterator
[ https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784014#comment-16784014 ] Hyukjin Kwon commented on SPARK-27025: -- Yes but there might be many variants of implementations. It has a tradeoff as Sean described above. > Speed up toLocalIterator > > > Key: SPARK-27025 > URL: https://issues.apache.org/jira/browse/SPARK-27025 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Erik van Oosten >Priority: Major > > Method {{toLocalIterator}} fetches the partitions to the driver one by one. > However, as far as I can see, any required computation for the > yet-to-be-fetched-partitions is not kicked off until it is fetched. > Effectively only one partition is being computed at the same time. > Desired behavior: immediately start calculation of all partitions while > retaining the download-a-partition at a time behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27028) PySpark read .dat file. Multiline issue
[ https://issues.apache.org/jira/browse/SPARK-27028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27028. -- Resolution: Not A Problem > PySpark read .dat file. Multiline issue > --- > > Key: SPARK-27028 > URL: https://issues.apache.org/jira/browse/SPARK-27028 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.4.0 > Environment: Pyspark(2.4) in AWS EMR >Reporter: alokchowdary >Priority: Critical > > * I am trying to read the dat file using pyspark csv reader and it contains > newline character ("\n") as part of the data. Spark is unable to read this > file as single column, rather treating it as new row. I tried using the > "multiLine" option while reading , but still its not working. > * {{spark.read.csv(file_path, schema=schema, sep=delimiter,multiLine=True)}} > * {{}}Data is something like this. Every line below is considered as row in > dataframe. > * Here '\x01' is actual delimeter(but used , for ease of reading). > {{ }} > {{1. name,test,12345,}} > {{2. x, }} > {{3. desc }} > {{4. name2,test2,12345 }} > {{5. ,y}} > {{6. ,desc2}} > * {{}}So pyspark is treating x and desc as new row in dataframe, with nulls > for other columns. > How to read such data in pyspark -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27039) toPandas with Arrow swallows maxResultSize errors
[ https://issues.apache.org/jira/browse/SPARK-27039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784012#comment-16784012 ] Hyukjin Kwon commented on SPARK-27039: -- Given the history, it will be roughly between May and July this year. Not so far :). Let me leave this JIRA resolved then per the current status. > toPandas with Arrow swallows maxResultSize errors > - > > Key: SPARK-27039 > URL: https://issues.apache.org/jira/browse/SPARK-27039 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: peay >Priority: Minor > > I am running the following simple `toPandas` with {{maxResultSize}} set to > 1mb: > {code:java} > import pyspark.sql.functions as F > df = spark.range(1000 * 1000) > df_pd = df.withColumn("test", F.lit("this is a long string that should make > the resulting dataframe too large for maxResult which is 1m")).toPandas() > {code} > > With {{spark.sql.execution.arrow.enabled}} set to {{true}}, this returns an > empty Pandas dataframe without any error: > {code:python} > df_pd.info() > # > # Index: 0 entries > # Data columns (total 2 columns): > # id 0 non-null object > # test0 non-null object > # dtypes: object(2) > # memory usage: 0.0+ bytes > {code} > The driver stderr does have an error, and so does the Spark UI: > {code:java} > ERROR TaskSetManager: Total size of serialized results of 1 tasks (52.8 MB) > is bigger than spark.driver.maxResultSize (1024.0 KB) > ERROR TaskSetManager: Total size of serialized results of 2 tasks (105.7 MB) > is bigger than spark.driver.maxResultSize (1024.0 KB) > Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job > aborted due to stage failure: Total size of serialized results of 1 tasks > (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2026) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2260) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2209) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2198) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) > at > org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3313) > at > org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3282) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432) > at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862) > {code} > With {{spark.sql.execution.arrow.enabled}} set to {{false}}, the Python call > to {{toPandas}} does fail as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27051) Bump Jackson version to 2.9.8
[ https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-27051: Assignee: Yanbo Liang > Bump Jackson version to 2.9.8 > - > > Key: SPARK-27051 > URL: https://issues.apache.org/jira/browse/SPARK-27051 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Major > > Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs > [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix > bump the dependent Jackson to 2.9.8. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27015) spark-submit does not properly escape arguments sent to Mesos dispatcher
[ https://issues.apache.org/jira/browse/SPARK-27015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783944#comment-16783944 ] Martin Loncaric commented on SPARK-27015: - Created a PR: https://github.com/apache/spark/pull/23967 > spark-submit does not properly escape arguments sent to Mesos dispatcher > > > Key: SPARK-27015 > URL: https://issues.apache.org/jira/browse/SPARK-27015 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.3.3, 2.4.0 >Reporter: Martin Loncaric >Priority: Major > Fix For: 2.5.0, 3.0.0 > > > Arguments sent to the dispatcher must be escaped; for instance, > {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a > b$c"{noformat} > fails, and instead must be submitted as > {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a\\ > b\\$c"{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27015) spark-submit does not properly escape arguments sent to Mesos dispatcher
[ https://issues.apache.org/jira/browse/SPARK-27015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27015: Assignee: (was: Apache Spark) > spark-submit does not properly escape arguments sent to Mesos dispatcher > > > Key: SPARK-27015 > URL: https://issues.apache.org/jira/browse/SPARK-27015 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.3.3, 2.4.0 >Reporter: Martin Loncaric >Priority: Major > Fix For: 2.5.0, 3.0.0 > > > Arguments sent to the dispatcher must be escaped; for instance, > {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a > b$c"{noformat} > fails, and instead must be submitted as > {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a\\ > b\\$c"{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27015) spark-submit does not properly escape arguments sent to Mesos dispatcher
[ https://issues.apache.org/jira/browse/SPARK-27015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27015: Assignee: Apache Spark > spark-submit does not properly escape arguments sent to Mesos dispatcher > > > Key: SPARK-27015 > URL: https://issues.apache.org/jira/browse/SPARK-27015 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.3.3, 2.4.0 >Reporter: Martin Loncaric >Assignee: Apache Spark >Priority: Major > Fix For: 2.5.0, 3.0.0 > > > Arguments sent to the dispatcher must be escaped; for instance, > {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a > b$c"{noformat} > fails, and instead must be submitted as > {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a\\ > b\\$c"{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27048) A way to execute functions on Executor Startup and Executor Exit in Standalone
[ https://issues.apache.org/jira/browse/SPARK-27048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-27048: -- Target Version/s: (was: 2.4.0) > A way to execute functions on Executor Startup and Executor Exit in Standalone > -- > > Key: SPARK-27048 > URL: https://issues.apache.org/jira/browse/SPARK-27048 > Project: Spark > Issue Type: Wish > Components: Deploy, Spark Submit >Affects Versions: 2.3.1, 2.3.3 >Reporter: Ross Brigoli >Priority: Major > Labels: usability > > *Background* > We have a Spark Standalone ETL workload that is heavily dependent on Apache > Ignite KV store for lookup/reference data. There are hundreds (400+) of > lookup data some are up to 300K records. We formerly used broadcast variables > but later found out that it was not fast enough. > So we decided implement a caching mechanism by retrieving reference data from > JDBC source and put them in-memory through Apache ignite as replicated cache. > Each Spark worker node is also running an Ignite node (JVM). Then we let the > spark executors retrieve the data from Ignite through "shared memory port". > This is very fast but is causing instability in the Ignite cluster. The > reason is that when the Spark executor JVM terminates, the Ignite Data Grid > is terminated abnormally. This makes the Ignite cluster wait for the client > node (which is the spark executor) to reconnect making the Ignite cluster > non-responsive for a while. > *Wish* > We have this need for an ability to close the ignite client node gracefully > just before the Executor process ends. So a feature that makes it possible to > pass an EventHandler for "executor.onStart" and "executor.exitExecutor()" > would be really really useful. > It could be a spark-submit argument or an entry in the spark-defaults.conf > that looks something like: > {{spark.executor.startUpClass=com.company.ExecutorInitializer}} > {{spark.executor.shutdownClass=com.company.ExecutorCleaner}} > The class will have to implement an interface provided by Spark. This class > can then be loaded dynamically in the CoarseGrainedExecutorBackend and called > on the onStart() and exitExecutor() methods respectively > This is also useful for opening and closing JDBC connections per executor > instead of per partition. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26205) Optimize InSet expression for bytes, shorts, ints, dates
[ https://issues.apache.org/jira/browse/SPARK-26205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-26205. --- Resolution: Fixed Assignee: Anton Okolnychyi Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/23171 . > Optimize InSet expression for bytes, shorts, ints, dates > > > Key: SPARK-26205 > URL: https://issues.apache.org/jira/browse/SPARK-26205 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > Fix For: 3.0.0 > > > {{In}} expressions are compiled into a sequence of if-else statements, which > results in O\(n\) time complexity. {{InSet}} is an optimized version of > {{In}}, which is supposed to improve the performance if the number of > elements is big enough. However, {{InSet}} actually degrades the performance > in many cases due to various reasons (benchmarks were created in SPARK-26203 > and solutions to the boxing problem are discussed in SPARK-26204). > The main idea of this JIRA is to use Java {{switch}} statements to > significantly improve the performance of {{InSet}} expressions for bytes, > shorts, ints, dates. All {{switch}} statements are compiled into > {{tableswitch}} and {{lookupswitch}} bytecode instructions. We will have > O\(1\) time complexity if our case values are compact and {{tableswitch}} can > be used. Otherwise, {{lookupswitch}} will give us O\(log n\). Our local > benchmarks show that this logic is more than two times faster even on 500+ > elements than using primitive collections in {{InSet}} expressions. As Spark > is using Scala {{HashSet}} right now, the performance gain will be is even > bigger. > See > [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10] > and > [here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch] > for more information. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26016) Encoding not working when using a map / mapPartitions call
[ https://issues.apache.org/jira/browse/SPARK-26016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783885#comment-16783885 ] Hyukjin Kwon commented on SPARK-26016: -- BTW, IIRC some codes assume other ascii compatible encodings can be supported since utf8 is ascii compatible but I think it's better to whitelist that utf8 only is supported. > Encoding not working when using a map / mapPartitions call > -- > > Key: SPARK-26016 > URL: https://issues.apache.org/jira/browse/SPARK-26016 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.0 >Reporter: Chris Caspanello >Assignee: Sean Owen >Priority: Major > Fix For: 3.0.0 > > Attachments: spark-sandbox.zip > > > Attached you will find a project with unit tests showing the issue at hand. > If I read in a ISO-8859-1 encoded file and simply write out what was read; > the contents in the part file matches what was read. Which is great. > However, the second I use a map / mapPartitions function it looks like the > encoding is not correct. In addition a simple collectAsList and writing that > list of strings to a file does not work either. I don't think I'm doing > anything wrong. Can someone please investigate? I think this is a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26016) Encoding not working when using a map / mapPartitions call
[ https://issues.apache.org/jira/browse/SPARK-26016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783878#comment-16783878 ] Sean Owen commented on SPARK-26016: --- It's "Fixed" in the sense that at least we plugged the documentation hole here that I am pretty certain explains the issue. I want to open a new JIRA to consider supporting 'encoding' for the text source. It looks straightforward, even. > Encoding not working when using a map / mapPartitions call > -- > > Key: SPARK-26016 > URL: https://issues.apache.org/jira/browse/SPARK-26016 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.0 >Reporter: Chris Caspanello >Assignee: Sean Owen >Priority: Major > Fix For: 3.0.0 > > Attachments: spark-sandbox.zip > > > Attached you will find a project with unit tests showing the issue at hand. > If I read in a ISO-8859-1 encoded file and simply write out what was read; > the contents in the part file matches what was read. Which is great. > However, the second I use a map / mapPartitions function it looks like the > encoding is not correct. In addition a simple collectAsList and writing that > list of strings to a file does not work either. I don't think I'm doing > anything wrong. Can someone please investigate? I think this is a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26016) Encoding not working when using a map / mapPartitions call
[ https://issues.apache.org/jira/browse/SPARK-26016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26016. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23962 [https://github.com/apache/spark/pull/23962] > Encoding not working when using a map / mapPartitions call > -- > > Key: SPARK-26016 > URL: https://issues.apache.org/jira/browse/SPARK-26016 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.0 >Reporter: Chris Caspanello >Assignee: Sean Owen >Priority: Major > Fix For: 3.0.0 > > Attachments: spark-sandbox.zip > > > Attached you will find a project with unit tests showing the issue at hand. > If I read in a ISO-8859-1 encoded file and simply write out what was read; > the contents in the part file matches what was read. Which is great. > However, the second I use a map / mapPartitions function it looks like the > encoding is not correct. In addition a simple collectAsList and writing that > list of strings to a file does not work either. I don't think I'm doing > anything wrong. Can someone please investigate? I think this is a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26016) Encoding not working when using a map / mapPartitions call
[ https://issues.apache.org/jira/browse/SPARK-26016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-26016: Assignee: Sean Owen > Encoding not working when using a map / mapPartitions call > -- > > Key: SPARK-26016 > URL: https://issues.apache.org/jira/browse/SPARK-26016 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.0 >Reporter: Chris Caspanello >Assignee: Sean Owen >Priority: Major > Attachments: spark-sandbox.zip > > > Attached you will find a project with unit tests showing the issue at hand. > If I read in a ISO-8859-1 encoded file and simply write out what was read; > the contents in the part file matches what was read. Which is great. > However, the second I use a map / mapPartitions function it looks like the > encoding is not correct. In addition a simple collectAsList and writing that > list of strings to a file does not work either. I don't think I'm doing > anything wrong. Can someone please investigate? I think this is a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k
[ https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783858#comment-16783858 ] Sean Owen commented on SPARK-26947: --- That doesn't sound "very big" but how big are the vectors you cluster? It looks like you're applying CountVectorizer with no vocabSize, so if your input are many different unique strings, your vectors have hundreds of thousands of dimensions. Ten thousand of them plus all the overhead could really add up to challenge even tens of GB of heap. Here it seems to be running out of memory while transferring a copy to/from the Python process. I'd definitely limit vocabSize or else reconsider how you're clustering. This doesn't look like a particular Spark problem. > Pyspark KMeans Clustering job fails on large values of k > > > Key: SPARK-26947 > URL: https://issues.apache.org/jira/browse/SPARK-26947 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, PySpark >Affects Versions: 2.4.0 >Reporter: Parth Gandhi >Priority: Minor > Attachments: clustering_app.py > > > We recently had a case where a user's pyspark job running KMeans clustering > was failing for large values of k. I was able to reproduce the same issue > with dummy dataset. I have attached the code as well as the data in the JIRA. > The stack trace is printed below from Java: > > {code:java} > Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:3332) > at > java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) > at > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649) > at java.lang.StringBuilder.append(StringBuilder.java:202) > at py4j.Protocol.getOutputCommand(Protocol.java:328) > at py4j.commands.CallCommand.execute(CallCommand.java:81) > at py4j.GatewayConnection.run(GatewayConnection.java:238) > at java.lang.Thread.run(Thread.java:748) > {code} > Python: > {code:java} > Traceback (most recent call last): > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1159, in send_command > raise Py4JNetworkError("Answer from Java side is empty") > py4j.protocol.Py4JNetworkError: Answer from Java side is empty > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 985, in send_command > response = connection.send_command(command) > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1164, in send_command > "Error while receiving", e, proto.ERROR_ON_RECEIVE) > py4j.protocol.Py4JNetworkError: Error while receiving > Traceback (most recent call last): > File "clustering_app.py", line 154, in > main(args) > File "clustering_app.py", line 145, in main > run_clustering(sc, args.input_path, args.output_path, > args.num_clusters_list) > File "clustering_app.py", line 136, in run_clustering > clustersTable, cluster_Centers = clustering(sc, documents, output_path, > k, max_iter) > File "clustering_app.py", line 68, in clustering > cluster_Centers = km_model.clusterCenters() > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py", > line 337, in clusterCenters > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py", > line 55, in _call_java > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py", > line 109, in _java2py > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py", > line 63, in deco > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py", > line 336, in get_return_value > py4j.protocol.Py4JError: An error occurred while calling > z:org.apache.spark.ml.python.MLSerDe.dumps > {code} > The command with which the application was launched is given below: > {code:java} > $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf > spark.executor.memory=20g --conf spark.driver.memory=20g --conf > spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf > spark.kryoserializer.buffer.max=2000m --conf
[jira] [Commented] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k
[ https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783849#comment-16783849 ] Parth Gandhi commented on SPARK-26947: -- [~srowen] for this particular case, k is set to 1. Input data size is 90 MB and memory is set to 20g(both driver and executor). [~mgaido] I will try doing that and let you know. > Pyspark KMeans Clustering job fails on large values of k > > > Key: SPARK-26947 > URL: https://issues.apache.org/jira/browse/SPARK-26947 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, PySpark >Affects Versions: 2.4.0 >Reporter: Parth Gandhi >Priority: Minor > Attachments: clustering_app.py > > > We recently had a case where a user's pyspark job running KMeans clustering > was failing for large values of k. I was able to reproduce the same issue > with dummy dataset. I have attached the code as well as the data in the JIRA. > The stack trace is printed below from Java: > > {code:java} > Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:3332) > at > java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) > at > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649) > at java.lang.StringBuilder.append(StringBuilder.java:202) > at py4j.Protocol.getOutputCommand(Protocol.java:328) > at py4j.commands.CallCommand.execute(CallCommand.java:81) > at py4j.GatewayConnection.run(GatewayConnection.java:238) > at java.lang.Thread.run(Thread.java:748) > {code} > Python: > {code:java} > Traceback (most recent call last): > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1159, in send_command > raise Py4JNetworkError("Answer from Java side is empty") > py4j.protocol.Py4JNetworkError: Answer from Java side is empty > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 985, in send_command > response = connection.send_command(command) > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1164, in send_command > "Error while receiving", e, proto.ERROR_ON_RECEIVE) > py4j.protocol.Py4JNetworkError: Error while receiving > Traceback (most recent call last): > File "clustering_app.py", line 154, in > main(args) > File "clustering_app.py", line 145, in main > run_clustering(sc, args.input_path, args.output_path, > args.num_clusters_list) > File "clustering_app.py", line 136, in run_clustering > clustersTable, cluster_Centers = clustering(sc, documents, output_path, > k, max_iter) > File "clustering_app.py", line 68, in clustering > cluster_Centers = km_model.clusterCenters() > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py", > line 337, in clusterCenters > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py", > line 55, in _call_java > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py", > line 109, in _java2py > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py", > line 63, in deco > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py", > line 336, in get_return_value > py4j.protocol.Py4JError: An error occurred while calling > z:org.apache.spark.ml.python.MLSerDe.dumps > {code} > The command with which the application was launched is given below: > {code:java} > $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf > spark.executor.memory=20g --conf spark.driver.memory=20g --conf > spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf > spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=12g > ~/clustering_app.py --input_path hdfs:///user/username/part-v001x > --output_path hdfs:///user/username --num_clusters_list 1 > {code} > The input dataset is approximately 90 MB in size and the assigned heap memory > to both driver and executor is close to 20 GB. This only happens for large > values of k. -- This message was sent by
[jira] [Commented] (SPARK-26972) Issue with CSV import and inferSchema set to true
[ https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783834#comment-16783834 ] Jean Georges Perrin commented on SPARK-26972: - [~srowen], [~hyukjin.kwon] - thanks guy for dealing with a rookie! I'll do my best to give a try against master, however: # the non-case sensitivity becoming case sensitivity, is that scheduled for v3.0 or already in v2.4.x? # I double checked the output, when you specify the schema: in 2.1.3, it crashes: {code:java} 2019-03-04 17:17:41.854 -ERROR --- [rker for task 0] Logging$class.logError(Logging.scala:91): Exception in task 0.0 in stage 0.0 (TID 0) java.lang.NumberFormatException: For input string: "An independent study by Jean Georges Perrin, IIUG Board Member*" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:580) at java.lang.Integer.parseInt(Integer.java:615) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:252) at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:125) at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:100) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2019-03-04 17:17:41.876 -ERROR --- [result-getter-0] Logging$class.logError(Logging.scala:70): Task 0 in stage 0.0 failed 1 times; aborting job Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.NumberFormatException: For input string: "An independent study by Jean Georges Perrin, IIUG Board Member*" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:580) at java.lang.Integer.parseInt(Integer.java:615) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:252) at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:125) at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at
[jira] [Assigned] (SPARK-27051) Bump Jackson version to 2.9.8
[ https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-27051: --- Assignee: (was: Yanbo Liang) > Bump Jackson version to 2.9.8 > - > > Key: SPARK-27051 > URL: https://issues.apache.org/jira/browse/SPARK-27051 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yanbo Liang >Priority: Major > > Fasterxml Jackson version before 2.9.8 is affected by multiple > [CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186],] we need > to fix bump the Jackson version to 2.9.8. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf
[ https://issues.apache.org/jira/browse/SPARK-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783810#comment-16783810 ] Martin Loncaric commented on SPARK-26192: - [~dongjoon] Thanks, I will pay more attention to those fields. However, I believe this is a bug. It violates behavior specified in the https://spark.apache.org/docs/latest/running-on-mesos.html#configuration. Can we merge into 2.4.1 as well? > MesosClusterScheduler reads options from dispatcher conf instead of > submission conf > --- > > Key: SPARK-26192 > URL: https://issues.apache.org/jira/browse/SPARK-26192 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Martin Loncaric >Assignee: Martin Loncaric >Priority: Minor > Fix For: 3.0.0 > > > There is at least one option accessed in MesosClusterScheduler that should > come from the submission's configuration instead of the dispatcher's: > spark.mesos.fetcherCache.enable > Coincidentally, the spark.mesos.fetcherCache.enable option was previously > misnamed, as referenced in the linked JIRA. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27052) Using PySpark udf in transform yields NULL values
hejsgpuom62c created SPARK-27052: Summary: Using PySpark udf in transform yields NULL values Key: SPARK-27052 URL: https://issues.apache.org/jira/browse/SPARK-27052 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.4.0 Reporter: hejsgpuom62c Steps to reproduce {code:java} from typing import Optional from pyspark.sql.functions import expr def f(x: Optional[int]) -> Optional[int]: return x + 1 if x is not None else None spark.udf.register('f', f, "integer") df = (spark .createDataFrame([(1, [1, 2, 3])], ("id", "xs")) .withColumn("xsinc", expr("transform(xs, x -> f(x))"))) df.show() # +---+-+-+ # | id| xs|xsinc| # +---+-+-+ # | 1|[1, 2, 3]| [,,]| # +---+-+-+ {code} Source https://stackoverflow.com/a/53762650 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf
[ https://issues.apache.org/jira/browse/SPARK-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783810#comment-16783810 ] Martin Loncaric edited comment on SPARK-26192 at 3/4/19 9:51 PM: - [~dongjoon] Thanks, I will pay more attention to those fields. However, I believe this is a bug. It violates behavior specified in https://spark.apache.org/docs/latest/running-on-mesos.html#configuration. Can we merge into at least 2.4.1 as well? was (Author: mwlon): [~dongjoon] Thanks, I will pay more attention to those fields. However, I believe this is a bug. It violates behavior specified in https://spark.apache.org/docs/latest/running-on-mesos.html#configuration. Can we merge into 2.4.1 as well? > MesosClusterScheduler reads options from dispatcher conf instead of > submission conf > --- > > Key: SPARK-26192 > URL: https://issues.apache.org/jira/browse/SPARK-26192 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Martin Loncaric >Assignee: Martin Loncaric >Priority: Minor > Fix For: 3.0.0 > > > There is at least one option accessed in MesosClusterScheduler that should > come from the submission's configuration instead of the dispatcher's: > spark.mesos.fetcherCache.enable > Coincidentally, the spark.mesos.fetcherCache.enable option was previously > misnamed, as referenced in the linked JIRA. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27014) Support removal of jars and Spark binaries from Mesos driver and executor sandboxes
[ https://issues.apache.org/jira/browse/SPARK-27014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783811#comment-16783811 ] Martin Loncaric commented on SPARK-27014: - Sure, will keep that in mind. > Support removal of jars and Spark binaries from Mesos driver and executor > sandboxes > --- > > Key: SPARK-27014 > URL: https://issues.apache.org/jira/browse/SPARK-27014 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 3.0.0 >Reporter: Martin Loncaric >Priority: Minor > > Currently, each Spark application run on Mesos leaves behind at least 500MB > of data in sandbox directories, coming from Spark binaries and copied URIs. > These can build up as a disk leak, causing major issues on Mesos clusters > unless their grace period for sandbox directories is very short. > Spark should have a feature to delete these (from both driver and executor > sandboxes) on teardown. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27051) Bump Jackson version to 2.9.8
[ https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27051: Assignee: (was: Apache Spark) > Bump Jackson version to 2.9.8 > - > > Key: SPARK-27051 > URL: https://issues.apache.org/jira/browse/SPARK-27051 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yanbo Liang >Priority: Major > > Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs > [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix > bump the dependent Jackson to 2.9.8. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27051) Bump Jackson version to 2.9.8
[ https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27051: Assignee: Apache Spark > Bump Jackson version to 2.9.8 > - > > Key: SPARK-27051 > URL: https://issues.apache.org/jira/browse/SPARK-27051 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Major > > Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs > [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix > bump the dependent Jackson to 2.9.8. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf
[ https://issues.apache.org/jira/browse/SPARK-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783810#comment-16783810 ] Martin Loncaric edited comment on SPARK-26192 at 3/4/19 9:49 PM: - [~dongjoon] Thanks, I will pay more attention to those fields. However, I believe this is a bug. It violates behavior specified in https://spark.apache.org/docs/latest/running-on-mesos.html#configuration. Can we merge into 2.4.1 as well? was (Author: mwlon): [~dongjoon] Thanks, I will pay more attention to those fields. However, I believe this is a bug. It violates behavior specified in the https://spark.apache.org/docs/latest/running-on-mesos.html#configuration. Can we merge into 2.4.1 as well? > MesosClusterScheduler reads options from dispatcher conf instead of > submission conf > --- > > Key: SPARK-26192 > URL: https://issues.apache.org/jira/browse/SPARK-26192 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Martin Loncaric >Assignee: Martin Loncaric >Priority: Minor > Fix For: 3.0.0 > > > There is at least one option accessed in MesosClusterScheduler that should > come from the submission's configuration instead of the dispatcher's: > spark.mesos.fetcherCache.enable > Coincidentally, the spark.mesos.fetcherCache.enable option was previously > misnamed, as referenced in the linked JIRA. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27050) Bean Encoder serializes data in a wrong order if input schema is not ordered
hejsgpuom62c created SPARK-27050: Summary: Bean Encoder serializes data in a wrong order if input schema is not ordered Key: SPARK-27050 URL: https://issues.apache.org/jira/browse/SPARK-27050 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: hejsgpuom62c Steps to reproduce. Define schema like this {code:java} StructType valid = StructType.fromDDL( "broker_name string, order integer, server_name string, " + "storages array>" );{code} {code:java} package com.example; import java.io.Serializable; import lombok.Data; import lombok.AllArgsConstructor; import lombok.NoArgsConstructor; @Data @NoArgsConstructor @AllArgsConstructor public class Entity implements Serializable { private String broker_name; private String server_name; private Integer order; private Storage[] storages; }{code} {code:java} package com.example; import java.io.Serializable; import lombok.Data; import lombok.AllArgsConstructor; import lombok.NoArgsConstructor; @Data @NoArgsConstructor @AllArgsConstructor public class Storage implements Serializable { private java.sql.Timestamp timestamp; private Double storage; }{code} Create a JSON file with the following content: {code:java} [ { "broker_name": "A1", "server_name": "S1", "order": 1, "storages": [ { "timestamp": "2018-10-29 23:11:44.000", "storage": 12.5 } ] } ]{code} Process data as {code:java} Dataset ds = spark.read().option("multiline", "true").schema(valid).json("/path/to/file") .as(Encoders.bean(Entity.class)); ds .groupByKey((MapFunction) o -> o.getBroker_name(), Encoders.STRING()) .reduceGroups((ReduceFunction)(e1, e2) -> e1) .map((MapFunction, Entity>) tuple -> tuple._2, Encoders.bean(Entity.class)) .show(10, false);{code} The result will be: {code:java} +---+-+---++ |broker_name|order|server_name|storages | +---+-+---++ |A1 |1 |S1 |[[7.612815958429577E-309, 148474-03-19 22:14:3232.5248]]| +---+-+---++ {code} Source https://stackoverflow.com/q/54987724 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27051) Bump Jackson version to 2.9.8
[ https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-27051: Description: Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix bump the dependent Jackson to 2.9.8. (was: Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix bump the dependent Jackson version to 2.9.8.) > Bump Jackson version to 2.9.8 > - > > Key: SPARK-27051 > URL: https://issues.apache.org/jira/browse/SPARK-27051 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yanbo Liang >Priority: Major > > Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs > [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix > bump the dependent Jackson to 2.9.8. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27051) Bump Jackson version to 2.9.8
[ https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-27051: Description: Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix bump the dependent Jackson version to 2.9.8. (was: Fasterxml Jackson version before 2.9.8 is affected by multiple [CVEs | [https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix bump the dependent Jackson version to 2.9.8.) > Bump Jackson version to 2.9.8 > - > > Key: SPARK-27051 > URL: https://issues.apache.org/jira/browse/SPARK-27051 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yanbo Liang >Priority: Major > > Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs > [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix > bump the dependent Jackson version to 2.9.8. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27051) Bump Jackson version to 2.9.8
[ https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-27051: Description: Fasterxml Jackson version before 2.9.8 is affected by multiple [CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186],] we need to fix bump the dependent Jackson version to 2.9.8. (was: Fasterxml Jackson version before 2.9.8 is affected by multiple [CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186],] we need to fix bump the Jackson version to 2.9.8.) > Bump Jackson version to 2.9.8 > - > > Key: SPARK-27051 > URL: https://issues.apache.org/jira/browse/SPARK-27051 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yanbo Liang >Priority: Major > > Fasterxml Jackson version before 2.9.8 is affected by multiple > [CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186],] we need > to fix bump the dependent Jackson version to 2.9.8. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27051) Bump Jackson version to 2.9.8
[ https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-27051: Description: Fasterxml Jackson version before 2.9.8 is affected by multiple [CVEs | [https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix bump the dependent Jackson version to 2.9.8. (was: Fasterxml Jackson version before 2.9.8 is affected by multiple [CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix bump the dependent Jackson version to 2.9.8.) > Bump Jackson version to 2.9.8 > - > > Key: SPARK-27051 > URL: https://issues.apache.org/jira/browse/SPARK-27051 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yanbo Liang >Priority: Major > > Fasterxml Jackson version before 2.9.8 is affected by multiple [CVEs | > [https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix > bump the dependent Jackson version to 2.9.8. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27051) Bump Jackson version to 2.9.8
[ https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-27051: Description: Fasterxml Jackson version before 2.9.8 is affected by multiple [CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix bump the dependent Jackson version to 2.9.8. (was: Fasterxml Jackson version before 2.9.8 is affected by multiple [CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186],] we need to fix bump the dependent Jackson version to 2.9.8.) > Bump Jackson version to 2.9.8 > - > > Key: SPARK-27051 > URL: https://issues.apache.org/jira/browse/SPARK-27051 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yanbo Liang >Priority: Major > > Fasterxml Jackson version before 2.9.8 is affected by multiple > [CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186]], we need > to fix bump the dependent Jackson version to 2.9.8. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27051) Bump Jackson version to 2.9.8
[ https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-27051: --- Assignee: Yanbo Liang > Bump Jackson version to 2.9.8 > - > > Key: SPARK-27051 > URL: https://issues.apache.org/jira/browse/SPARK-27051 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Major > > Fasterxml Jackson version before 2.9.8 is affected by multiple > [CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186],] we need > to fix bump the Jackson version to 2.9.8. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27051) Bump Jackson version to 2.9.8
Yanbo Liang created SPARK-27051: --- Summary: Bump Jackson version to 2.9.8 Key: SPARK-27051 URL: https://issues.apache.org/jira/browse/SPARK-27051 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Yanbo Liang Fasterxml Jackson version before 2.9.8 is affected by multiple [CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186],] we need to fix bump the Jackson version to 2.9.8. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25865) Add GC information to ExecutorMetrics
[ https://issues.apache.org/jira/browse/SPARK-25865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-25865. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22874 [https://github.com/apache/spark/pull/22874] > Add GC information to ExecutorMetrics > - > > Key: SPARK-25865 > URL: https://issues.apache.org/jira/browse/SPARK-25865 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Lantao Jin >Assignee: Lantao Jin >Priority: Major > Fix For: 3.0.0 > > > Only memory usage without GC information could not help us to determinate the > proper settings of memory. Add basic GC information to ExecutorMetrics > interface. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25865) Add GC information to ExecutorMetrics
[ https://issues.apache.org/jira/browse/SPARK-25865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-25865: Assignee: Lantao Jin > Add GC information to ExecutorMetrics > - > > Key: SPARK-25865 > URL: https://issues.apache.org/jira/browse/SPARK-25865 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Lantao Jin >Assignee: Lantao Jin >Priority: Major > > Only memory usage without GC information could not help us to determinate the > proper settings of memory. Add basic GC information to ExecutorMetrics > interface. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf
[ https://issues.apache.org/jira/browse/SPARK-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783745#comment-16783745 ] Dongjoon Hyun edited comment on SPARK-26192 at 3/4/19 8:15 PM: --- [~mwlon]. Thank you for reporting and making a PR. However, please don't set 'Fix Versions`. `Fixed Version` and `Target Version` are used in the different way. Please refer the contribution guide. - https://spark.apache.org/contributing.html For me, this is a minor improvement for Spark 3.0. was (Author: dongjoon): [~mwlon]. Thank you for reporting and making a PR. However, please don't set 'Fix Versions`. `Fixed Version` and `Target Version` are used in the different way. Please refer the contribution guide. - https://spark.apache.org/contributing.html For me, this is a minor improvement for Spark 3.0. > MesosClusterScheduler reads options from dispatcher conf instead of > submission conf > --- > > Key: SPARK-26192 > URL: https://issues.apache.org/jira/browse/SPARK-26192 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Martin Loncaric >Assignee: Martin Loncaric >Priority: Major > Fix For: 3.0.0 > > > There is at least one option accessed in MesosClusterScheduler that should > come from the submission's configuration instead of the dispatcher's: > spark.mesos.fetcherCache.enable > Coincidentally, the spark.mesos.fetcherCache.enable option was previously > misnamed, as referenced in the linked JIRA. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf
[ https://issues.apache.org/jira/browse/SPARK-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783745#comment-16783745 ] Dongjoon Hyun edited comment on SPARK-26192 at 3/4/19 8:16 PM: --- [~mwlon]. Thank you for reporting and making a PR. However, please don't set 'Fix Versions`. `Fixed Version` and `Target Version` are used in the different way. Please refer the contribution guide. - https://spark.apache.org/contributing.html For me, this is an improvement for Spark 3.0. was (Author: dongjoon): [~mwlon]. Thank you for reporting and making a PR. However, please don't set 'Fix Versions`. `Fixed Version` and `Target Version` are used in the different way. Please refer the contribution guide. - https://spark.apache.org/contributing.html For me, this is a minor improvement for Spark 3.0. > MesosClusterScheduler reads options from dispatcher conf instead of > submission conf > --- > > Key: SPARK-26192 > URL: https://issues.apache.org/jira/browse/SPARK-26192 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Martin Loncaric >Assignee: Martin Loncaric >Priority: Minor > Fix For: 3.0.0 > > > There is at least one option accessed in MesosClusterScheduler that should > come from the submission's configuration instead of the dispatcher's: > spark.mesos.fetcherCache.enable > Coincidentally, the spark.mesos.fetcherCache.enable option was previously > misnamed, as referenced in the linked JIRA. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes
[ https://issues.apache.org/jira/browse/SPARK-26688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-26688: Assignee: Attila Zsolt Piros > Provide configuration of initially blacklisted YARN nodes > - > > Key: SPARK-26688 > URL: https://issues.apache.org/jira/browse/SPARK-26688 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > > Introducing new config for initially blacklisted YARN nodes. > This came up in the apache spark user mailing list: > [http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-is-it-possible-to-manually-blacklist-nodes-before-running-spark-job-td34395.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf
[ https://issues.apache.org/jira/browse/SPARK-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26192: -- Priority: Minor (was: Major) > MesosClusterScheduler reads options from dispatcher conf instead of > submission conf > --- > > Key: SPARK-26192 > URL: https://issues.apache.org/jira/browse/SPARK-26192 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Martin Loncaric >Assignee: Martin Loncaric >Priority: Minor > Fix For: 3.0.0 > > > There is at least one option accessed in MesosClusterScheduler that should > come from the submission's configuration instead of the dispatcher's: > spark.mesos.fetcherCache.enable > Coincidentally, the spark.mesos.fetcherCache.enable option was previously > misnamed, as referenced in the linked JIRA. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes
[ https://issues.apache.org/jira/browse/SPARK-26688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-26688. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23616 [https://github.com/apache/spark/pull/23616] > Provide configuration of initially blacklisted YARN nodes > - > > Key: SPARK-26688 > URL: https://issues.apache.org/jira/browse/SPARK-26688 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.0.0 > > > Introducing new config for initially blacklisted YARN nodes. > This came up in the apache spark user mailing list: > [http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-is-it-possible-to-manually-blacklist-nodes-before-running-spark-job-td34395.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf
[ https://issues.apache.org/jira/browse/SPARK-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783745#comment-16783745 ] Dongjoon Hyun commented on SPARK-26192: --- [~mwlon]. Thank you for reporting and making a PR. However, please don't set 'Fix Versions`. `Fixed Version` and `Target Version` are used in the different way. Please refer the contribution guide. - https://spark.apache.org/contributing.html For me, this is a minor improvement for Spark 3.0. > MesosClusterScheduler reads options from dispatcher conf instead of > submission conf > --- > > Key: SPARK-26192 > URL: https://issues.apache.org/jira/browse/SPARK-26192 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Martin Loncaric >Assignee: Martin Loncaric >Priority: Major > Fix For: 3.0.0 > > > There is at least one option accessed in MesosClusterScheduler that should > come from the submission's configuration instead of the dispatcher's: > spark.mesos.fetcherCache.enable > Coincidentally, the spark.mesos.fetcherCache.enable option was previously > misnamed, as referenced in the linked JIRA. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf
[ https://issues.apache.org/jira/browse/SPARK-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-26192. --- Resolution: Fixed Assignee: Martin Loncaric Fix Version/s: (was: 2.3.4) (was: 2.4.1) This is resolved via https://github.com/apache/spark/pull/23924 . > MesosClusterScheduler reads options from dispatcher conf instead of > submission conf > --- > > Key: SPARK-26192 > URL: https://issues.apache.org/jira/browse/SPARK-26192 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Martin Loncaric >Assignee: Martin Loncaric >Priority: Major > Fix For: 3.0.0 > > > There is at least one option accessed in MesosClusterScheduler that should > come from the submission's configuration instead of the dispatcher's: > spark.mesos.fetcherCache.enable > Coincidentally, the spark.mesos.fetcherCache.enable option was previously > misnamed, as referenced in the linked JIRA. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf
[ https://issues.apache.org/jira/browse/SPARK-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26192: -- Issue Type: Improvement (was: Bug) > MesosClusterScheduler reads options from dispatcher conf instead of > submission conf > --- > > Key: SPARK-26192 > URL: https://issues.apache.org/jira/browse/SPARK-26192 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Martin Loncaric >Assignee: Martin Loncaric >Priority: Major > Fix For: 3.0.0 > > > There is at least one option accessed in MesosClusterScheduler that should > come from the submission's configuration instead of the dispatcher's: > spark.mesos.fetcherCache.enable > Coincidentally, the spark.mesos.fetcherCache.enable option was previously > misnamed, as referenced in the linked JIRA. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26961) Found Java-level deadlock in Spark Driver
[ https://issues.apache.org/jira/browse/SPARK-26961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783727#comment-16783727 ] Mi Zi commented on SPARK-26961: --- Hi Ajith, IMHO ClassLoader.registerAsParallelCapable() only helps to reduce the granularity of the lock. The lock will still be shared by loadClass calls with same "className". Theoretically deadlock can still be triggered in certain cases. > Found Java-level deadlock in Spark Driver > - > > Key: SPARK-26961 > URL: https://issues.apache.org/jira/browse/SPARK-26961 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Rong Jialei >Priority: Major > > Our spark job usually will finish in minutes, however, we recently found it > take days to run, and we can only kill it when this happened. > An investigation show all worker container could not connect drive after > start, and driver is hanging, using jstack, we found a Java-level deadlock. > > *Jstack output for deadlock part is showing below:* > > Found one Java-level deadlock: > = > "SparkUI-907": > waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a > org.apache.hadoop.conf.Configuration), > which is held by "ForkJoinPool-1-worker-57" > "ForkJoinPool-1-worker-57": > waiting to lock monitor 0x7f3860574298 (object 0x0005b7991168, a > org.apache.spark.util.MutableURLClassLoader), > which is held by "ForkJoinPool-1-worker-7" > "ForkJoinPool-1-worker-7": > waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a > org.apache.hadoop.conf.Configuration), > which is held by "ForkJoinPool-1-worker-57" > Java stack information for the threads listed above: > === > "SparkUI-907": > at org.apache.hadoop.conf.Configuration.getOverlay(Configuration.java:1328) > - waiting to lock <0x0005c0c1e5e0> (a > org.apache.hadoop.conf.Configuration) > at > org.apache.hadoop.conf.Configuration.handleDeprecation(Configuration.java:684) > at org.apache.hadoop.conf.Configuration.get(Configuration.java:1088) > at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1145) > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2363) > at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840) > at > org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74) > at java.net.URL.getURLStreamHandler(URL.java:1142) > at java.net.URL.(URL.java:599) > at java.net.URL.(URL.java:490) > at java.net.URL.(URL.java:439) > at org.apache.spark.ui.JettyUtils$$anon$4.doRequest(JettyUtils.scala:176) > at org.apache.spark.ui.JettyUtils$$anon$4.doGet(JettyUtils.scala:161) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:171) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:534) > at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at
[jira] [Assigned] (SPARK-24120) Show `Jobs` page when `jobId` is missing
[ https://issues.apache.org/jira/browse/SPARK-24120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24120: Assignee: (was: Apache Spark) > Show `Jobs` page when `jobId` is missing > > > Key: SPARK-24120 > URL: https://issues.apache.org/jira/browse/SPARK-24120 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Jongyoul Lee >Priority: Minor > > For now, users try to connect {{job}} page without {{jobid}}, Spark UI shows > only error page. It's not incorrect but helpless to users. It would be better > to redirect to `jobs` page to select proper job. This, actually, happens when > users use yarn mode. Because of yarn's bug(YARN-6615), some parameters aren't > passed to Spark's driver UI with now the latest version of Yarn. It's also > mentioned at SPARK-20772. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24120) Show `Jobs` page when `jobId` is missing
[ https://issues.apache.org/jira/browse/SPARK-24120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24120: Assignee: Apache Spark > Show `Jobs` page when `jobId` is missing > > > Key: SPARK-24120 > URL: https://issues.apache.org/jira/browse/SPARK-24120 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Jongyoul Lee >Assignee: Apache Spark >Priority: Minor > > For now, users try to connect {{job}} page without {{jobid}}, Spark UI shows > only error page. It's not incorrect but helpless to users. It would be better > to redirect to `jobs` page to select proper job. This, actually, happens when > users use yarn mode. Because of yarn's bug(YARN-6615), some parameters aren't > passed to Spark's driver UI with now the latest version of Yarn. It's also > mentioned at SPARK-20772. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27039) toPandas with Arrow swallows maxResultSize errors
[ https://issues.apache.org/jira/browse/SPARK-27039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783711#comment-16783711 ] peay commented on SPARK-27039: -- Interesting, thanks for checking. Yes, I can definitely live without that until 3.0. Is there already a timeline for 3.0? > toPandas with Arrow swallows maxResultSize errors > - > > Key: SPARK-27039 > URL: https://issues.apache.org/jira/browse/SPARK-27039 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: peay >Priority: Minor > > I am running the following simple `toPandas` with {{maxResultSize}} set to > 1mb: > {code:java} > import pyspark.sql.functions as F > df = spark.range(1000 * 1000) > df_pd = df.withColumn("test", F.lit("this is a long string that should make > the resulting dataframe too large for maxResult which is 1m")).toPandas() > {code} > > With {{spark.sql.execution.arrow.enabled}} set to {{true}}, this returns an > empty Pandas dataframe without any error: > {code:python} > df_pd.info() > # > # Index: 0 entries > # Data columns (total 2 columns): > # id 0 non-null object > # test0 non-null object > # dtypes: object(2) > # memory usage: 0.0+ bytes > {code} > The driver stderr does have an error, and so does the Spark UI: > {code:java} > ERROR TaskSetManager: Total size of serialized results of 1 tasks (52.8 MB) > is bigger than spark.driver.maxResultSize (1024.0 KB) > ERROR TaskSetManager: Total size of serialized results of 2 tasks (105.7 MB) > is bigger than spark.driver.maxResultSize (1024.0 KB) > Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job > aborted due to stage failure: Total size of serialized results of 1 tasks > (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2026) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2260) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2209) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2198) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) > at > org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3313) > at > org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3282) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436) > at > org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432) > at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862) > {code} > With {{spark.sql.execution.arrow.enabled}} set to {{false}}, the Python call > to {{toPandas}} does fail as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25564) Add output bytes metrics for each Executor
[ https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25564: Assignee: Apache Spark > Add output bytes metrics for each Executor > -- > > Key: SPARK-25564 > URL: https://issues.apache.org/jira/browse/SPARK-25564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Lantao Jin >Assignee: Apache Spark >Priority: Minor > > LiveExecutor only statistics the total input bytes. And total output bytes > for each executor also has the equal importance like input. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25564) Add output bytes metrics for each Executor
[ https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25564: Assignee: (was: Apache Spark) > Add output bytes metrics for each Executor > -- > > Key: SPARK-25564 > URL: https://issues.apache.org/jira/browse/SPARK-25564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Lantao Jin >Priority: Minor > > LiveExecutor only statistics the total input bytes. And total output bytes > for each executor also has the equal importance like input. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24120) Show `Jobs` page when `jobId` is missing
[ https://issues.apache.org/jira/browse/SPARK-24120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-24120: -- Assignee: Marcelo Vanzin > Show `Jobs` page when `jobId` is missing > > > Key: SPARK-24120 > URL: https://issues.apache.org/jira/browse/SPARK-24120 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Jongyoul Lee >Assignee: Marcelo Vanzin >Priority: Minor > > For now, users try to connect {{job}} page without {{jobid}}, Spark UI shows > only error page. It's not incorrect but helpless to users. It would be better > to redirect to `jobs` page to select proper job. This, actually, happens when > users use yarn mode. Because of yarn's bug(YARN-6615), some parameters aren't > passed to Spark's driver UI with now the latest version of Yarn. It's also > mentioned at SPARK-20772. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24120) Show `Jobs` page when `jobId` is missing
[ https://issues.apache.org/jira/browse/SPARK-24120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-24120: -- Assignee: (was: Marcelo Vanzin) > Show `Jobs` page when `jobId` is missing > > > Key: SPARK-24120 > URL: https://issues.apache.org/jira/browse/SPARK-24120 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Jongyoul Lee >Priority: Minor > > For now, users try to connect {{job}} page without {{jobid}}, Spark UI shows > only error page. It's not incorrect but helpless to users. It would be better > to redirect to `jobs` page to select proper job. This, actually, happens when > users use yarn mode. Because of yarn's bug(YARN-6615), some parameters aren't > passed to Spark's driver UI with now the latest version of Yarn. It's also > mentioned at SPARK-20772. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25564) Add output bytes metrics for each Executor
[ https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25564: Assignee: Marcelo Vanzin (was: Apache Spark) > Add output bytes metrics for each Executor > -- > > Key: SPARK-25564 > URL: https://issues.apache.org/jira/browse/SPARK-25564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Lantao Jin >Assignee: Marcelo Vanzin >Priority: Minor > > LiveExecutor only statistics the total input bytes. And total output bytes > for each executor also has the equal importance like input. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25564) Add output bytes metrics for each Executor
[ https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-25564: -- Assignee: Marcelo Vanzin > Add output bytes metrics for each Executor > -- > > Key: SPARK-25564 > URL: https://issues.apache.org/jira/browse/SPARK-25564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Lantao Jin >Assignee: Marcelo Vanzin >Priority: Minor > > LiveExecutor only statistics the total input bytes. And total output bytes > for each executor also has the equal importance like input. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25564) Add output bytes metrics for each Executor
[ https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-25564: -- Assignee: Marcelo Vanzin > Add output bytes metrics for each Executor > -- > > Key: SPARK-25564 > URL: https://issues.apache.org/jira/browse/SPARK-25564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Lantao Jin >Assignee: Marcelo Vanzin >Priority: Minor > > LiveExecutor only statistics the total input bytes. And total output bytes > for each executor also has the equal importance like input. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25564) Add output bytes metrics for each Executor
[ https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-25564: -- Assignee: (was: Marcelo Vanzin) > Add output bytes metrics for each Executor > -- > > Key: SPARK-25564 > URL: https://issues.apache.org/jira/browse/SPARK-25564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Lantao Jin >Priority: Minor > > LiveExecutor only statistics the total input bytes. And total output bytes > for each executor also has the equal importance like input. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25564) Add output bytes metrics for each Executor
[ https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25564: Assignee: Apache Spark (was: Marcelo Vanzin) > Add output bytes metrics for each Executor > -- > > Key: SPARK-25564 > URL: https://issues.apache.org/jira/browse/SPARK-25564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Lantao Jin >Assignee: Apache Spark >Priority: Minor > > LiveExecutor only statistics the total input bytes. And total output bytes > for each executor also has the equal importance like input. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25564) Add output bytes metrics for each Executor
[ https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-25564: -- Assignee: (was: Marcelo Vanzin) > Add output bytes metrics for each Executor > -- > > Key: SPARK-25564 > URL: https://issues.apache.org/jira/browse/SPARK-25564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Lantao Jin >Priority: Minor > > LiveExecutor only statistics the total input bytes. And total output bytes > for each executor also has the equal importance like input. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27039) toPandas with Arrow swallows maxResultSize errors
[ https://issues.apache.org/jira/browse/SPARK-27039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783690#comment-16783690 ] Bryan Cutler commented on SPARK-27039: -- I was able to reproduce in v2.4.0, but it looks like current master raises an error in the driver and does not return an empty Pandas DataFrame. This is probably due to some of the recent changes in toPandas() with Arrow enabled. {noformat} In [4]: spark.conf.set('spark.sql.execution.arrow.enabled', True) In [5]: import pyspark.sql.functions as F ...: df = spark.range(1000 * 1000) ...: df_pd = df.withColumn("test", F.lit("this is a long string that should make the resulting dataframe too large for maxRe ...: sult which is 1m")).toPandas() ...: 19/03/04 10:54:56 ERROR TaskSetManager: Total size of serialized results of 1 tasks (13.2 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB) 19/03/04 10:54:56 ERROR TaskSetManager: Total size of serialized results of 2 tasks (26.4 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB) Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 1 tasks (13.2 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB) at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:1938) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1926) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1925) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1925) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:935) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:935) at scala.Option.foreach(Option.scala:274) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:935) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2155) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2104) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2093) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:746) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2008) at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3300) at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3265) at org.apache.spark.api.python.PythonRDD$.$anonfun$serveToStream$2(PythonRDD.scala:442) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319) at org.apache.spark.api.python.PythonRDD$.$anonfun$serveToStream$1(PythonRDD.scala:444) at org.apache.spark.api.python.PythonRDD$.$anonfun$serveToStream$1$adapted(PythonRDD.scala:439) at org.apache.spark.api.python.PythonServer$$anon$3.run(PythonRDD.scala:890) /home/bryan/git/spark/python/pyspark/sql/dataframe.py:2129: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.fallback.enabled' does not have an effect on failures in the middle of computation. warnings.warn(msg) 19/03/04 10:54:56 ERROR TaskSetManager: Total size of serialized results of 3 tasks (39.6 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB) [Stage 0:==>(1 + 7) / 8][Stage 1:> (0 + 8) / 8]--- EOFError Traceback (most recent call last) in () 1 import pyspark.sql.functions as F 2 df = spark.range(1000 * 1000) > 3 df_pd = df.withColumn("test", F.lit("this is a long string that should make the resulting dataframe too large for maxResult which is 1m")).toPandas() /home/bryan/git/spark/python/pyspark/sql/dataframe.pyc in toPandas(self) 2111 _check_dataframe_localize_timestamps 2112 import pyarrow -> 2113 batches = self._collectAsArrow() 2114 if len(batches) > 0: 2115 table =
[jira] [Assigned] (SPARK-26792) Apply custom log URL to Spark UI
[ https://issues.apache.org/jira/browse/SPARK-26792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-26792: -- Assignee: Jungtaek Lim > Apply custom log URL to Spark UI > > > Key: SPARK-26792 > URL: https://issues.apache.org/jira/browse/SPARK-26792 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > > SPARK-23155 enables SHS to set up custom log URLs for incompleted / completed > apps. > While getting reviews from SPARK-23155, I've got two comments which applying > custom log URLs to UI would help achieving it. Quoting these comments here: > https://github.com/apache/spark/pull/23260#issuecomment-456827963 > {quote} > Sorry I haven't had time to look through all the code so this might be a > separate jira, but one thing I thought of here is it would be really nice not > to have specifically stderr/stdout. users can specify any log4j.properties > and some tools like oozie by default end up using hadoop log4j rather then > spark log4j, so files aren't necessarily the same. Also users can put in > other logs files so it would be nice to have links to those from the UI. It > seems simpler if we just had a link to the directory and it read the files > within there. Other things in Hadoop do it this way, but I'm not sure if that > works well for other resource managers, any thoughts on that? As long as this > doesn't prevent the above I can file a separate jira for it. > {quote} > https://github.com/apache/spark/pull/23260#issuecomment-456904716 > {quote} > Hi Tom, +1: singling out stdout and stderr is definitely an annoyance. We > typically configure Spark jobs to write the GC log and dump heap on OOM > using , and/or we use the rolling file appender to deal with > large logs during debugging. So linking the YARN container log overview > page would make much more sense for us. We work it around with a custom > submit process that logs all important URLs on the submit side log. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26792) Apply custom log URL to Spark UI
[ https://issues.apache.org/jira/browse/SPARK-26792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-26792. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23790 [https://github.com/apache/spark/pull/23790] > Apply custom log URL to Spark UI > > > Key: SPARK-26792 > URL: https://issues.apache.org/jira/browse/SPARK-26792 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > SPARK-23155 enables SHS to set up custom log URLs for incompleted / completed > apps. > While getting reviews from SPARK-23155, I've got two comments which applying > custom log URLs to UI would help achieving it. Quoting these comments here: > https://github.com/apache/spark/pull/23260#issuecomment-456827963 > {quote} > Sorry I haven't had time to look through all the code so this might be a > separate jira, but one thing I thought of here is it would be really nice not > to have specifically stderr/stdout. users can specify any log4j.properties > and some tools like oozie by default end up using hadoop log4j rather then > spark log4j, so files aren't necessarily the same. Also users can put in > other logs files so it would be nice to have links to those from the UI. It > seems simpler if we just had a link to the directory and it read the files > within there. Other things in Hadoop do it this way, but I'm not sure if that > works well for other resource managers, any thoughts on that? As long as this > doesn't prevent the above I can file a separate jira for it. > {quote} > https://github.com/apache/spark/pull/23260#issuecomment-456904716 > {quote} > Hi Tom, +1: singling out stdout and stderr is definitely an annoyance. We > typically configure Spark jobs to write the GC log and dump heap on OOM > using , and/or we use the rolling file appender to deal with > large logs during debugging. So linking the YARN container log overview > page would make much more sense for us. We work it around with a custom > submit process that logs all important URLs on the submit side log. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24135: Assignee: Apache Spark > [K8s] Executors that fail to start up because of init-container errors are > not retried and limit the executor pool size > --- > > Key: SPARK-24135 > URL: https://issues.apache.org/jira/browse/SPARK-24135 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Assignee: Apache Spark >Priority: Major > > In KubernetesClusterSchedulerBackend, we detect if executors disconnect after > having been started or if executors hit the {{ERROR}} or {{DELETED}} states. > When executors fail in these ways, they are removed from the pending > executors pool and the driver should retry requesting these executors. > However, the driver does not handle a different class of error: when the pod > enters the {{Init:Error}} state. This state comes up when the executor fails > to launch because one of its init-containers fails. Spark itself doesn't > attach any init-containers to the executors. However, custom web hooks can > run on the cluster and attach init-containers to the executor pods. > Additionally, pod presets can specify init containers to run on these pods. > Therefore Spark should be handling the {{Init:Error}} cases regardless if > Spark itself is aware of init-containers or not. > This class of error is particularly bad because when we hit this state, the > failed executor will never start, but it's still seen as pending by the > executor allocator. The executor allocator won't request more rounds of > executors because its current batch hasn't been resolved to either running or > failed. Therefore we end up with being stuck with the number of executors > that successfully started before the faulty one failed to start, potentially > creating a fake resource bottleneck. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25681) Delegation Tokens fetched twice upon HadoopFSDelegationTokenProvider creation
[ https://issues.apache.org/jira/browse/SPARK-25681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-25681: -- Assignee: (was: Marcelo Vanzin) > Delegation Tokens fetched twice upon HadoopFSDelegationTokenProvider creation > - > > Key: SPARK-25681 > URL: https://issues.apache.org/jira/browse/SPARK-25681 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Mesos, YARN >Affects Versions: 2.5.0 >Reporter: Ilan Filonenko >Priority: Major > Labels: Hadoop, Kerberos > > Looking for a refactor to {{HadoopFSDelegationTokenProvider.}} Within the > function {{obtainDelegationTokens()}}: > This code-block: > {code:java} > val fetchCreds = fetchDelegationTokens(getTokenRenewer(hadoopConf),...) > // Get the token renewal interval if it is not set. It will only be > called once. > if (tokenRenewalInterval == null) { > tokenRenewalInterval = getTokenRenewalInterval(...) > }{code} > calls {{fetchDelegationTokens()}} twice since the {{tokenRenewalInterval}} > will always be null upon creation of the {{TokenManager}} which I think is > unnecessary in the case of Kubernetes (as you are creating 2 DTs when only > one is needed.) Could this possibly be refactored to only call > {{fetchDelegationTokens()}} once upon startup or to have a param to specify > {{tokenRenewalInterval}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26995) Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when using snappy
[ https://issues.apache.org/jira/browse/SPARK-26995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-26995. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23898 [https://github.com/apache/spark/pull/23898] > Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when > using snappy > - > > Key: SPARK-26995 > URL: https://issues.apache.org/jira/browse/SPARK-26995 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0, 2.4.0 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > Fix For: 3.0.0 > > > Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when > using snappy. > The issue can be reproduced for example as follows: > `Seq(1,2).toDF("id").write.format("parquet").save("DELETEME1")` > The key part of the error stack is as follows `Caused by: > java.lang.UnsatisfiedLinkError: > /tmp/snappy-1.1.7-2b4872f1-7c41-4b84-bda1-dbcb8dd0ce4c-libsnappyjava.so: > Error loading shared library ld-linux-x86-64.so.2: Noded by > /tmp/snappy-1.1.7-2b4872f1-7c41-4b84-bda1-dbcb8dd0ce4c-libsnappyjava.so)` > The source of the error appears to be due to the fact that libsnappyjava.so > needs ld-linux-x86-64.so.2 and looks for it in /lib, while in Alpine Linux > 3.9.0 with libc6-compat version 1.1.20-r3 ld-linux-x86-64.so.2 is located in > /lib64. > Note: this issue is not present with Alpine Linux 3.8 and libc6-compat > version 1.1.19-r10 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25681) Delegation Tokens fetched twice upon HadoopFSDelegationTokenProvider creation
[ https://issues.apache.org/jira/browse/SPARK-25681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-25681: -- Assignee: Marcelo Vanzin > Delegation Tokens fetched twice upon HadoopFSDelegationTokenProvider creation > - > > Key: SPARK-25681 > URL: https://issues.apache.org/jira/browse/SPARK-25681 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Mesos, YARN >Affects Versions: 2.5.0 >Reporter: Ilan Filonenko >Assignee: Marcelo Vanzin >Priority: Major > Labels: Hadoop, Kerberos > > Looking for a refactor to {{HadoopFSDelegationTokenProvider.}} Within the > function {{obtainDelegationTokens()}}: > This code-block: > {code:java} > val fetchCreds = fetchDelegationTokens(getTokenRenewer(hadoopConf),...) > // Get the token renewal interval if it is not set. It will only be > called once. > if (tokenRenewalInterval == null) { > tokenRenewalInterval = getTokenRenewalInterval(...) > }{code} > calls {{fetchDelegationTokens()}} twice since the {{tokenRenewalInterval}} > will always be null upon creation of the {{TokenManager}} which I think is > unnecessary in the case of Kubernetes (as you are creating 2 DTs when only > one is needed.) Could this possibly be refactored to only call > {{fetchDelegationTokens()}} once upon startup or to have a param to specify > {{tokenRenewalInterval}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25750) Integration Testing for Kerberos Support for Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-25750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25750: Assignee: Apache Spark > Integration Testing for Kerberos Support for Spark on Kubernetes > > > Key: SPARK-25750 > URL: https://issues.apache.org/jira/browse/SPARK-25750 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Ilan Filonenko >Assignee: Apache Spark >Priority: Major > > Integration testing for Secure HDFS interaction for Spark on Kubernetes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25750) Integration Testing for Kerberos Support for Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-25750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25750: Assignee: (was: Apache Spark) > Integration Testing for Kerberos Support for Spark on Kubernetes > > > Key: SPARK-25750 > URL: https://issues.apache.org/jira/browse/SPARK-25750 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Ilan Filonenko >Priority: Major > > Integration testing for Secure HDFS interaction for Spark on Kubernetes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25750) Integration Testing for Kerberos Support for Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-25750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-25750: -- Assignee: Marcelo Vanzin > Integration Testing for Kerberos Support for Spark on Kubernetes > > > Key: SPARK-25750 > URL: https://issues.apache.org/jira/browse/SPARK-25750 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Ilan Filonenko >Assignee: Marcelo Vanzin >Priority: Major > > Integration testing for Secure HDFS interaction for Spark on Kubernetes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25750) Integration Testing for Kerberos Support for Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-25750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-25750: -- Assignee: (was: Marcelo Vanzin) > Integration Testing for Kerberos Support for Spark on Kubernetes > > > Key: SPARK-25750 > URL: https://issues.apache.org/jira/browse/SPARK-25750 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Ilan Filonenko >Priority: Major > > Integration testing for Secure HDFS interaction for Spark on Kubernetes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24135: Assignee: (was: Apache Spark) > [K8s] Executors that fail to start up because of init-container errors are > not retried and limit the executor pool size > --- > > Key: SPARK-24135 > URL: https://issues.apache.org/jira/browse/SPARK-24135 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Major > > In KubernetesClusterSchedulerBackend, we detect if executors disconnect after > having been started or if executors hit the {{ERROR}} or {{DELETED}} states. > When executors fail in these ways, they are removed from the pending > executors pool and the driver should retry requesting these executors. > However, the driver does not handle a different class of error: when the pod > enters the {{Init:Error}} state. This state comes up when the executor fails > to launch because one of its init-containers fails. Spark itself doesn't > attach any init-containers to the executors. However, custom web hooks can > run on the cluster and attach init-containers to the executor pods. > Additionally, pod presets can specify init containers to run on these pods. > Therefore Spark should be handling the {{Init:Error}} cases regardless if > Spark itself is aware of init-containers or not. > This class of error is particularly bad because when we hit this state, the > failed executor will never start, but it's still seen as pending by the > executor allocator. The executor allocator won't request more rounds of > executors because its current batch hasn't been resolved to either running or > failed. Therefore we end up with being stuck with the number of executors > that successfully started before the faulty one failed to start, potentially > creating a fake resource bottleneck. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25681) Delegation Tokens fetched twice upon HadoopFSDelegationTokenProvider creation
[ https://issues.apache.org/jira/browse/SPARK-25681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25681: Assignee: Apache Spark > Delegation Tokens fetched twice upon HadoopFSDelegationTokenProvider creation > - > > Key: SPARK-25681 > URL: https://issues.apache.org/jira/browse/SPARK-25681 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Mesos, YARN >Affects Versions: 2.5.0 >Reporter: Ilan Filonenko >Assignee: Apache Spark >Priority: Major > Labels: Hadoop, Kerberos > > Looking for a refactor to {{HadoopFSDelegationTokenProvider.}} Within the > function {{obtainDelegationTokens()}}: > This code-block: > {code:java} > val fetchCreds = fetchDelegationTokens(getTokenRenewer(hadoopConf),...) > // Get the token renewal interval if it is not set. It will only be > called once. > if (tokenRenewalInterval == null) { > tokenRenewalInterval = getTokenRenewalInterval(...) > }{code} > calls {{fetchDelegationTokens()}} twice since the {{tokenRenewalInterval}} > will always be null upon creation of the {{TokenManager}} which I think is > unnecessary in the case of Kubernetes (as you are creating 2 DTs when only > one is needed.) Could this possibly be refactored to only call > {{fetchDelegationTokens()}} once upon startup or to have a param to specify > {{tokenRenewalInterval}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25681) Delegation Tokens fetched twice upon HadoopFSDelegationTokenProvider creation
[ https://issues.apache.org/jira/browse/SPARK-25681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25681: Assignee: (was: Apache Spark) > Delegation Tokens fetched twice upon HadoopFSDelegationTokenProvider creation > - > > Key: SPARK-25681 > URL: https://issues.apache.org/jira/browse/SPARK-25681 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Mesos, YARN >Affects Versions: 2.5.0 >Reporter: Ilan Filonenko >Priority: Major > Labels: Hadoop, Kerberos > > Looking for a refactor to {{HadoopFSDelegationTokenProvider.}} Within the > function {{obtainDelegationTokens()}}: > This code-block: > {code:java} > val fetchCreds = fetchDelegationTokens(getTokenRenewer(hadoopConf),...) > // Get the token renewal interval if it is not set. It will only be > called once. > if (tokenRenewalInterval == null) { > tokenRenewalInterval = getTokenRenewalInterval(...) > }{code} > calls {{fetchDelegationTokens()}} twice since the {{tokenRenewalInterval}} > will always be null upon creation of the {{TokenManager}} which I think is > unnecessary in the case of Kubernetes (as you are creating 2 DTs when only > one is needed.) Could this possibly be refactored to only call > {{fetchDelegationTokens()}} once upon startup or to have a param to specify > {{tokenRenewalInterval}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-24135: -- Assignee: Marcelo Vanzin > [K8s] Executors that fail to start up because of init-container errors are > not retried and limit the executor pool size > --- > > Key: SPARK-24135 > URL: https://issues.apache.org/jira/browse/SPARK-24135 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Assignee: Marcelo Vanzin >Priority: Major > > In KubernetesClusterSchedulerBackend, we detect if executors disconnect after > having been started or if executors hit the {{ERROR}} or {{DELETED}} states. > When executors fail in these ways, they are removed from the pending > executors pool and the driver should retry requesting these executors. > However, the driver does not handle a different class of error: when the pod > enters the {{Init:Error}} state. This state comes up when the executor fails > to launch because one of its init-containers fails. Spark itself doesn't > attach any init-containers to the executors. However, custom web hooks can > run on the cluster and attach init-containers to the executor pods. > Additionally, pod presets can specify init containers to run on these pods. > Therefore Spark should be handling the {{Init:Error}} cases regardless if > Spark itself is aware of init-containers or not. > This class of error is particularly bad because when we hit this state, the > failed executor will never start, but it's still seen as pending by the > executor allocator. The executor allocator won't request more rounds of > executors because its current batch hasn't been resolved to either running or > failed. Therefore we end up with being stuck with the number of executors > that successfully started before the faulty one failed to start, potentially > creating a fake resource bottleneck. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org