[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-04 Thread Shahid K I (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784194#comment-16784194
 ] 

Shahid K I commented on SPARK-27019:


Please show me the screenshot of the sql page of the second scenario. I don't 
think in that case it will display like that. The issue happens only when the 
new live execution data is overwritten by the existing one

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: Screenshot from 2019-03-01 21-31-48.png, 
> application_1550040445209_4748, query-1-details.png, query-1-list.png, 
> query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26602:


Assignee: (was: Apache Spark)

> Insert into table fails after querying the UDF which is loaded with wrong 
> hdfs path
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-04 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784172#comment-16784172
 ] 

peay edited comment on SPARK-27019 at 3/5/19 7:28 AM:
--

Great! 

-Is that compatible with my second observation above? (I tested without any 
executors, and even without any task starting, the SQL tab had the wrong 
output). I can try to get an event log for that as well if that's helpful.- 
edit: I tried to reproduce that to export the event log, and could not. Seems 
like your patch should address the issue.


was (Author: peay):
Great! Is that compatible with my second observation above? (I tested without 
any executors, and even without any task starting, the SQL tab had the wrong 
output). I can try to get an event log for that as well if that's helpful.

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: Screenshot from 2019-03-01 21-31-48.png, 
> application_1550040445209_4748, query-1-details.png, query-1-list.png, 
> query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-04 Thread Shahid K I (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784179#comment-16784179
 ] 

Shahid K I commented on SPARK-27019:


Thanks [~peay] could you please share the event log for that too, if possible 

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: Screenshot from 2019-03-01 21-31-48.png, 
> application_1550040445209_4748, query-1-details.png, query-1-list.png, 
> query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26602:


Assignee: Apache Spark

> Insert into table fails after querying the UDF which is loaded with wrong 
> hdfs path
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Assignee: Apache Spark
>Priority: Major
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-04 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784172#comment-16784172
 ] 

peay commented on SPARK-27019:
--

Great! Is that compatible with my second observation above? (I tested without 
any executors, and even without any task starting, the SQL tab had the wrong 
output). I can try to get an event log for that as well if that's helpful.

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: Screenshot from 2019-03-01 21-31-48.png, 
> application_1550040445209_4748, query-1-details.png, query-1-list.png, 
> query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2019-03-04 Thread Lewin Ma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784164#comment-16784164
 ] 

Lewin Ma commented on SPARK-18105:
--

Still hit the same issue in Spark 2.3.1:

 
{code:java}
 

org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:523)
 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:439)
 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:61)
 at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at 
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) 
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) 
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_1$(Unknown
 Source) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(Unknown
 Source) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 
at org.apache.spark.scheduler.Task.run(Task.scala:109) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Stream 
is corrupted at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202) at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:170) at 
org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:348) 
at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:335) at 
org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:335) at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1380) at 
org.apache.spark.util.Utils$.copyStream(Utils.scala:356) at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:431)
 ... 21 more{code}
 
 

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Major
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at 

[jira] [Commented] (SPARK-26850) Make EventLoggingListener LOG_FILE_PERMISSIONS configurable

2019-03-04 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784159#comment-16784159
 ] 

Jungtaek Lim commented on SPARK-26850:
--

Looks like duplicated on SPARK-26912, as well as SPARK-26912 already has pull 
request.

> Make EventLoggingListener LOG_FILE_PERMISSIONS configurable
> ---
>
> Key: SPARK-26850
> URL: https://issues.apache.org/jira/browse/SPARK-26850
> Project: Spark
>  Issue Type: Wish
>  Components: Scheduler
>Affects Versions: 2.2.3, 2.3.2, 2.4.0
>Reporter: Hua Zhang
>Priority: Minor
>
> private[spark] object EventLoggingListener extends Logging {
> ...
> private val LOG_FILE_PERMISSIONS = new FsPermission(Integer.parseInt("770", 
> 8).toShort)
> ...
> }
>  
> Currently the event log files are hard-coded with permission 770.
> It would be fine if this permission is +configurable+.
> User case: The spark application is submitted by user A but the spark history 
> server is started by user B. Currently user B cannot access the history event 
> files created by user A. When permission is set to 775, this will be possible.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24602) In Spark SQL, ALTER TABLE--CHANGE column1 column2 datatype is not supported in 2.3.1

2019-03-04 Thread Sushanta Sen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784155#comment-16784155
 ] 

Sushanta Sen commented on SPARK-24602:
--

This issue is logged prior to other JIRAs as mentioned in the Issue Links.

> In Spark SQL, ALTER TABLE--CHANGE column1 column2 datatype is not supported 
> in 2.3.1
> 
>
> Key: SPARK-24602
> URL: https://issues.apache.org/jira/browse/SPARK-24602
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: OS: SUSE11
> Spark Version: 2.3
>Reporter: Sushanta Sen
>Priority: Major
>
> Precondition:
> Spark cluster 2.3 is up and running
> Test Steps:
>  # Launch Spark-sql
>  # spark-sql> CREATE TABLE t1(a int,string)   
>   0: jdbc:hive2://ha-cluster/default> *alter 
> table t1 change a a1 int;*
> Error: org.apache.spark.sql.AnalysisException: {color:#FF}ALTER TABLE 
> CHANGE COLUMN is not supported for changing column 'a' with type 
> 'IntegerType' to 'b' with type 'IntegerType'; (state=,code=0){color}
>  # Launch hive beeliine
>  # repeat step1 & 2
>  # 0: jdbc:hive2://10.18.108.126:1/> desc del1;
> +---++--+--+
> | col_name  | data_type  | comment  |
> +---++--+--+
> | *a1*    | *int*    |  |
> | dob   | int    |  |
> +---++--+--+
> 2 rows selected (1.572 seconds)
> 0: jdbc:hive2://10.18.108.126:1/>{color:#205081} alter table del1 change 
> a1 a bigint;{color}
> No rows affected (0.425 seconds)
> 0: jdbc:hive2://10.18.108.126:1/> desc del1;
> +---++--+--+
> | col_name  | data_type  | comment  |
> +---++--+--+
> | *a* | *bigint* |  |
> | dob   | int    |  |
> +---++--+--+
> 2 rows selected (0.364 seconds)
>  
> Actual Result: In spark sql, alter table change is not supported, whereas in 
> hive beeline it is working fine.
> Expected Result: ALTER Table CHANGE should be supported in Spark-SQL as well



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26922) Set socket timeout consistently in Arrow optimization

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26922:


Assignee: (was: Apache Spark)

> Set socket timeout consistently in Arrow optimization
> -
>
> Key: SPARK-26922
> URL: https://issues.apache.org/jira/browse/SPARK-26922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> For instance, see 
> https://github.com/apache/spark/blob/e8982ca7ad94e98d907babf2d6f1068b7cd064c6/R/pkg/R/context.R#L184
> it should set the timeout from {{SPARKR_BACKEND_CONNECTION_TIMEOUT}}. Or 
> maybe we need another environment variable.
> This might be able to be fixed together when some codes around there is 
> touched.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27054) Remove Calcite dependency

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27054:


Assignee: Apache Spark

> Remove Calcite dependency
> -
>
> Key: SPARK-27054
> URL: https://issues.apache.org/jira/browse/SPARK-27054
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> Calcite is only used for 
> [runSqlHive|https://github.com/apache/spark/blob/02bbe977abaf7006b845a7e99d612b0235aa0025/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L699-L705]
>  when 
> {{hive.cbo.enable=true}}([SemanticAnalyzer|https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzerFactory.java#L278-L280]).
> So we can disable {{hive.cbo.enable}} and remove Calcite dependency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26922) Set socket timeout consistently in Arrow optimization

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26922:


Assignee: Apache Spark

> Set socket timeout consistently in Arrow optimization
> -
>
> Key: SPARK-26922
> URL: https://issues.apache.org/jira/browse/SPARK-26922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Trivial
>
> For instance, see 
> https://github.com/apache/spark/blob/e8982ca7ad94e98d907babf2d6f1068b7cd064c6/R/pkg/R/context.R#L184
> it should set the timeout from {{SPARKR_BACKEND_CONNECTION_TIMEOUT}}. Or 
> maybe we need another environment variable.
> This might be able to be fixed together when some codes around there is 
> touched.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path

2019-03-04 Thread Chakravarthi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chakravarthi reopened SPARK-26602:
--

> Insert into table fails after querying the UDF which is loaded with wrong 
> hdfs path
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path

2019-03-04 Thread Chakravarthi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chakravarthi updated SPARK-26602:
-
Summary: Insert into table fails after querying the UDF which is loaded 
with wrong hdfs path  (was: Once creating and quering udf with incorrect 
path,followed by querying tables or functions registered with correct path 
gives the runtime exception within the same session)

> Insert into table fails after querying the UDF which is loaded with wrong 
> hdfs path
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26850) Make EventLoggingListener LOG_FILE_PERMISSIONS configurable

2019-03-04 Thread Sandeep Katta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784139#comment-16784139
 ] 

Sandeep Katta commented on SPARK-26850:
---

[~srowen] I feel this use case should be supported, what's your view on this ?

> Make EventLoggingListener LOG_FILE_PERMISSIONS configurable
> ---
>
> Key: SPARK-26850
> URL: https://issues.apache.org/jira/browse/SPARK-26850
> Project: Spark
>  Issue Type: Wish
>  Components: Scheduler
>Affects Versions: 2.2.3, 2.3.2, 2.4.0
>Reporter: Hua Zhang
>Priority: Minor
>
> private[spark] object EventLoggingListener extends Logging {
> ...
> private val LOG_FILE_PERMISSIONS = new FsPermission(Integer.parseInt("770", 
> 8).toShort)
> ...
> }
>  
> Currently the event log files are hard-coded with permission 770.
> It would be fine if this permission is +configurable+.
> User case: The spark application is submitted by user A but the spark history 
> server is started by user B. Currently user B cannot access the history event 
> files created by user A. When permission is set to 775, this will be possible.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26602) Once creating and quering udf with incorrect path,followed by querying tables or functions registered with correct path gives the runtime exception within the same ses

2019-03-04 Thread Chakravarthi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784137#comment-16784137
 ] 

Chakravarthi commented on SPARK-26602:
--

And the problem is, though the jar does not exist ,it is added to the addedJars 
in sparkContext.scala, when performing  select on UDF. 
So, when insert into table happens it is trying to load the jars from the 
ListJars and as the jar not exist,it gives Exception.

The fix is to validate the Jar exist or not before adding to the addedJars. I 
have fixed it and will raise MR.


> Once creating and quering udf with incorrect path,followed by querying tables 
> or functions registered with correct path gives the runtime exception within 
> the same session
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26602) Once creating and quering udf with incorrect path,followed by querying tables or functions registered with correct path gives the runtime exception within the same ses

2019-03-04 Thread Chakravarthi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784134#comment-16784134
 ] 

Chakravarthi commented on SPARK-26602:
--

Hi [~srowen] , this issue is not duplicate of SPARK-26560. Here the issue 
is,Insert into table fails after querying the UDF which is loaded with wrong 
hdfs path.

Below are the steps to reproduce this issue:

1) create a table.
sql("create table check_udf(I int)");

2) create udf using invalid hdfs path.
sql("CREATE FUNCTION before_fix  AS 
'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 
'hdfs:///tmp/notexist.jar'")

3) Do select on the UDF  and you will get exception as "Failed to read external 
resource".
 sql(" select  before_fix('2018-03-09')").

4) perform insert table. 
 sql("insert into  check_udf values(1)").show

Here ,insert should work.but is fails.










> Once creating and quering udf with incorrect path,followed by querying tables 
> or functions registered with correct path gives the runtime exception within 
> the same session
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26918) All .md should have ASF license header

2019-03-04 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784109#comment-16784109
 ] 

Felix Cheung edited comment on SPARK-26918 at 3/5/19 5:47 AM:
--

[~rmsm...@gmail.com] - you don't need to checkout a tag (or a release) - just 
checkout master into a local branch to test 


was (Author: felixcheung):
[~rmsm...@gmail.com] - you don't need to checkout a tag - just checkout master 
into a local branch to test 

> All .md should have ASF license header
> --
>
> Key: SPARK-26918
> URL: https://issues.apache.org/jira/browse/SPARK-26918
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Felix Cheung
>Priority: Minor
>
> per policy, all md files should have the header, like eg. 
> [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md]
>  or
> [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md]
>  
> currently it does not
> [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26920) Deduplicate type checking across Arrow optimization and vectorized APIs in SparkR

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26920:


Assignee: (was: Apache Spark)

> Deduplicate type checking across Arrow optimization and vectorized APIs in 
> SparkR
> -
>
> Key: SPARK-26920
> URL: https://issues.apache.org/jira/browse/SPARK-26920
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> There are duplication about type checking in Arrow <> SparkR code paths. For 
> instance,
> https://github.com/apache/spark/blob/8126d09fb5b969c1e293f1f8c41bec35357f74b5/R/pkg/R/group.R#L229-L253
> struct type and map type should also be restricted.
> We should pull it out as a separate function and add deduplicated tests 
> separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26918) All .md should have ASF license header

2019-03-04 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784109#comment-16784109
 ] 

Felix Cheung commented on SPARK-26918:
--

[~rmsm...@gmail.com] - you don't need to checkout a tag - just checkout master 
into a local branch to test 

> All .md should have ASF license header
> --
>
> Key: SPARK-26918
> URL: https://issues.apache.org/jira/browse/SPARK-26918
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Felix Cheung
>Priority: Minor
>
> per policy, all md files should have the header, like eg. 
> [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md]
>  or
> [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md]
>  
> currently it does not
> [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27054) Remove Calcite dependency

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27054:


Assignee: (was: Apache Spark)

> Remove Calcite dependency
> -
>
> Key: SPARK-27054
> URL: https://issues.apache.org/jira/browse/SPARK-27054
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Calcite is only used for 
> [runSqlHive|https://github.com/apache/spark/blob/02bbe977abaf7006b845a7e99d612b0235aa0025/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L699-L705]
>  when 
> {{hive.cbo.enable=true}}([SemanticAnalyzer|https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzerFactory.java#L278-L280]).
> So we can disable {{hive.cbo.enable}} and remove Calcite dependency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26920) Deduplicate type checking across Arrow optimization and vectorized APIs in SparkR

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26920:


Assignee: Apache Spark

> Deduplicate type checking across Arrow optimization and vectorized APIs in 
> SparkR
> -
>
> Key: SPARK-26920
> URL: https://issues.apache.org/jira/browse/SPARK-26920
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> There are duplication about type checking in Arrow <> SparkR code paths. For 
> instance,
> https://github.com/apache/spark/blob/8126d09fb5b969c1e293f1f8c41bec35357f74b5/R/pkg/R/group.R#L229-L253
> struct type and map type should also be restricted.
> We should pull it out as a separate function and add deduplicated tests 
> separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26850) Make EventLoggingListener LOG_FILE_PERMISSIONS configurable

2019-03-04 Thread sandeep katta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784107#comment-16784107
 ] 

sandeep katta commented on SPARK-26850:
---

[~happyhua] thanks for raising this issue, I will work on this and raise PR soon

> Make EventLoggingListener LOG_FILE_PERMISSIONS configurable
> ---
>
> Key: SPARK-26850
> URL: https://issues.apache.org/jira/browse/SPARK-26850
> Project: Spark
>  Issue Type: Wish
>  Components: Scheduler
>Affects Versions: 2.2.3, 2.3.2, 2.4.0
>Reporter: Hua Zhang
>Priority: Minor
>
> private[spark] object EventLoggingListener extends Logging {
> ...
> private val LOG_FILE_PERMISSIONS = new FsPermission(Integer.parseInt("770", 
> 8).toShort)
> ...
> }
>  
> Currently the event log files are hard-coded with permission 770.
> It would be fine if this permission is +configurable+.
> User case: The spark application is submitted by user A but the spark history 
> server is started by user B. Currently user B cannot access the history event 
> files created by user A. When permission is set to 775, this will be possible.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24278) Create table if not exists is throwing table already exists exception

2019-03-04 Thread sandeep katta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeep katta resolved SPARK-24278.
---
Resolution: Invalid

It is the exception thrown by the Hive as mentioned above 

> Create table if not exists is throwing table already exists exception
> -
>
> Key: SPARK-24278
> URL: https://issues.apache.org/jira/browse/SPARK-24278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: OS: SUSE11
> Spark Version: 2.3
>Reporter: Sushanta Sen
>Priority: Major
>
> # Launch Spark-sql
>  # create table check(time timestamp, name string, isright boolean, datetoday 
> date, num binary, height double, score float, decimaler decimal(10,0), id 
> tinyint, age int, license bigint, length smallint) row format delimited 
> fields terminated by ',' stored as textfile;
>  # create table if not exists check (time timestamp, name string, isright 
> boolean, datetoday date, num binary, height double, score float, decimaler 
> decimal(10,0), id tinyint, age int, license bigint, length smallint) row 
> format delimited fields terminated by ','stored as TEXTFILE; *-FAILED* **
>  
> Exception as below
> spark-sql> create table if not exists check (col1 string);
> *2018-05-15 14:29:56 ERROR RetryingHMSHandler:159 -* 
> *AlreadyExistsException(message:Table check already exists)*
> *at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1372)*
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1449)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
> at com.sun.proxy.$Proxy8.create_table_with_environment_context(Unknown Source)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.create_table_with_environment_context(HiveMetaStoreClient.java:2050)
> at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.create_table_with_environment_context(SessionHiveMetaStoreClient.java:97)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:669)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:657)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
> at com.sun.proxy.$Proxy9.createTable(Unknown Source)
> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:714)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:468)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:466)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:466)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:272)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:210)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:209)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:255)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:466)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply$mcV$sp(HiveExternalCatalog.scala:258)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:216)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:216)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.doCreateTable(HiveExternalCatalog.scala:216)
> at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalog.createTable(ExternalCatalog.scala:119)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:304)
> at 
> org.apache.spark.sql.execution.command.CreateTableCommand.run(tables.scala:128)
> at 
> 

[jira] [Created] (SPARK-27054) Remove Calcite dependency

2019-03-04 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-27054:
---

 Summary: Remove Calcite dependency
 Key: SPARK-27054
 URL: https://issues.apache.org/jira/browse/SPARK-27054
 Project: Spark
  Issue Type: Improvement
  Components: Build, SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


Calcite is only used for 
[runSqlHive|https://github.com/apache/spark/blob/02bbe977abaf7006b845a7e99d612b0235aa0025/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L699-L705]
 when 
{{hive.cbo.enable=true}}([SemanticAnalyzer|https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzerFactory.java#L278-L280]).
So we can disable {{hive.cbo.enable}} and remove Calcite dependency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27020) Unable to insert data with partial dynamic partition with Spark & Hive 3

2019-03-04 Thread Truong Duc Kien (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784047#comment-16784047
 ] 

Truong Duc Kien commented on SPARK-27020:
-

Hi, here are the command to reproduce the issue on my cluster. All commands are 
executed using spark-sql without any additional parameters.
{code:sql}
create database test_spark;
use test_spark;
create external table test_insert(a int) partitioned by (part_a string, part_b 
string) stored as parquet location 
'/apps/spark/warehouse/test_spark.db/test_insert';
{code}
{code:sql}
// OK
> insert into table test_insert partition(part_a='a', part_b='b') values(1); 
{code}
{code:sql}
// OK
> insert into table test_insert partition(part_a, part_b) values(2, 'a' , 'b'); 
..
19/03/05 11:17:29 INFO Hive: New loading path = 
hdfs://datalake/apps/spark/warehouse/test_spark.db/test_insert/.hive-staging_hive_2019-03-05_11-17-29_547_8053153849357088752-1/-ext-1/part_a=a/part_b=b
 with partSpec {part_a=a, part_b=b}
19/03/05 11:17:30 INFO Hive: Loaded 1 partitions
Time taken: 0.71 seconds
...
{code}
{code:sql}
// Not OK
> insert into table test_insert partition(part_a='a', part_b) values (3, 'b'); 
...
19/03/05 11:19:21 WARN warehouse: Cannot create partition spec from 
hdfs://datalake/; missing keys [part_a]
19/03/05 11:19:21 WARN FileOperations: Ignoring invalid DP directory 
hdfs://datalake/apps/spark/warehouse/test_spark.db/test_insert/.hive-staging_hive_2019-03-05_11-19-21_365_800377896579975615-1/-ext-1/part_b=b
19/03/05 11:19:21 INFO Hive: Loaded 0 partitions
Time taken: 0.466 seconds
...
{code}

> Unable to insert data with partial dynamic partition with Spark & Hive 3
> 
>
> Key: SPARK-27020
> URL: https://issues.apache.org/jira/browse/SPARK-27020
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Hortonwork HDP 3.1.0
> Spark 2.3.2
> Hive 3
>Reporter: Truong Duc Kien
>Priority: Major
>
> When performing inserting data with dynamic partition, the operation fails if 
> all partitions are not dynamic. For example:
> The query
> {code:sql}
> insert overwrite table t1 (part_a='a', part_b) select * from t2
> {code}
> will fails with errors
> {code:xml}
> Cannot create partition spec from hdfs:/// ; missing keys [part_a]
> Ignoring invalid DP directory 
> {code}
> On the other hand, if I remove the static value of part_a to make the insert 
> fully dynamic, the following query will success.
> {code:sql}
> insert overwrite table t1 (part_a, part_b) select * from t2
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27053) How about allowing engineers to use a different ExecutorBackend in StandAlone mode?

2019-03-04 Thread Ross Brigoli (JIRA)
Ross Brigoli created SPARK-27053:


 Summary: How about allowing engineers to use a different 
ExecutorBackend in StandAlone mode?
 Key: SPARK-27053
 URL: https://issues.apache.org/jira/browse/SPARK-27053
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Spark Submit
Affects Versions: 2.3.3
Reporter: Ross Brigoli


In Standalone mode, the command for starting an Executor JVM in is hardcoded to 
use org.apache.spark.executor.CoarseGrainedExecutorBackend. There seems to be 
no way to configure the submit operation to use a custom ExecutorBackend (a 
subclass of CoarseGrainedExecutorBackend).

This is very useful when engineers need to initialize things like starting a 
JDBC connection and Closing JDBC connection once per Executor.

At line 103 of StandaloneSchedulerBackend.scala, why not make the fully 
qualified name of the executor backend class configurable? And then fall back 
to this default executor backend class if it's not configured.

{{val command = 
Command("{color:#FF}*org.apache.spark.executor.CoarseGrainedExecutorBackend*{color}",}}
{{  args, sc.executorEnvs, classPathEntries ++ testingClassPath, 
libraryPathEntries, javaOpts)}}

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27051) Bump Jackson version to 2.9.8

2019-03-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27051.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23965
[https://github.com/apache/spark/pull/23965]

> Bump Jackson version to 2.9.8
> -
>
> Key: SPARK-27051
> URL: https://issues.apache.org/jira/browse/SPARK-27051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Major
> Fix For: 3.0.0
>
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs 
> [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix 
> bump the dependent Jackson to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27039) toPandas with Arrow swallows maxResultSize errors

2019-03-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27039.
--
Resolution: Cannot Reproduce

> toPandas with Arrow swallows maxResultSize errors
> -
>
> Key: SPARK-27039
> URL: https://issues.apache.org/jira/browse/SPARK-27039
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Minor
>
> I am running the following simple `toPandas` with {{maxResultSize}} set to 
> 1mb:
> {code:java}
> import pyspark.sql.functions as F
> df = spark.range(1000 * 1000)
> df_pd = df.withColumn("test", F.lit("this is a long string that should make 
> the resulting dataframe too large for maxResult which is 1m")).toPandas()
> {code}
>  
> With {{spark.sql.execution.arrow.enabled}} set to {{true}}, this returns an 
> empty Pandas dataframe without any error:
> {code:python}
> df_pd.info()
> # 
> # Index: 0 entries
> # Data columns (total 2 columns):
> # id  0 non-null object
> # test0 non-null object
> # dtypes: object(2)
> # memory usage: 0.0+ bytes
> {code}
> The driver stderr does have an error, and so does the Spark UI:
> {code:java}
> ERROR TaskSetManager: Total size of serialized results of 1 tasks (52.8 MB) 
> is bigger than spark.driver.maxResultSize (1024.0 KB)
> ERROR TaskSetManager: Total size of serialized results of 2 tasks (105.7 MB) 
> is bigger than spark.driver.maxResultSize (1024.0 KB)
> Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job 
> aborted due to stage failure: Total size of serialized results of 1 tasks 
> (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB)
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2026)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2260)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2209)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2198)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
>  at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3313)
>  at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3282)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432)
>  at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862)
> {code}
> With {{spark.sql.execution.arrow.enabled}} set to {{false}}, the Python call 
> to {{toPandas}} does fail as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27025) Speed up toLocalIterator

2019-03-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784014#comment-16784014
 ] 

Hyukjin Kwon commented on SPARK-27025:
--

Yes but there might be many variants of implementations. It has a tradeoff as 
Sean described above.

> Speed up toLocalIterator
> 
>
> Key: SPARK-27025
> URL: https://issues.apache.org/jira/browse/SPARK-27025
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Erik van Oosten
>Priority: Major
>
> Method {{toLocalIterator}} fetches the partitions to the driver one by one. 
> However, as far as I can see, any required computation for the 
> yet-to-be-fetched-partitions is not kicked off until it is fetched. 
> Effectively only one partition is being computed at the same time. 
> Desired behavior: immediately start calculation of all partitions while 
> retaining the download-a-partition at a time behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27028) PySpark read .dat file. Multiline issue

2019-03-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27028.
--
Resolution: Not A Problem

> PySpark read .dat file. Multiline issue
> ---
>
> Key: SPARK-27028
> URL: https://issues.apache.org/jira/browse/SPARK-27028
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Pyspark(2.4) in AWS EMR
>Reporter: alokchowdary
>Priority: Critical
>
> * I am trying to read the dat file using pyspark csv reader and it contains 
> newline character ("\n") as part of the data. Spark is unable to read this 
> file as single column, rather treating it as new row. I tried using the 
> "multiLine" option while reading , but still its not working.
>  * {{spark.read.csv(file_path, schema=schema, sep=delimiter,multiLine=True)}}
>  * {{}}Data is something like this. Every line below is considered as row in 
> dataframe.
>  * Here  '\x01' is actual delimeter(but used , for ease of reading).
> {{ }}
> {{1. name,test,12345,}}
> {{2. x, }}
> {{3. desc }}
> {{4. name2,test2,12345 }}
> {{5. ,y}}
> {{6. ,desc2}}
>  * {{}}So pyspark is treating x and desc as new row in dataframe, with nulls 
> for other columns.
> How to read such data in pyspark 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27039) toPandas with Arrow swallows maxResultSize errors

2019-03-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784012#comment-16784012
 ] 

Hyukjin Kwon commented on SPARK-27039:
--

Given the history, it will be roughly between May and July this year. Not so 
far :). Let me leave this JIRA resolved then per the current status.

> toPandas with Arrow swallows maxResultSize errors
> -
>
> Key: SPARK-27039
> URL: https://issues.apache.org/jira/browse/SPARK-27039
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Minor
>
> I am running the following simple `toPandas` with {{maxResultSize}} set to 
> 1mb:
> {code:java}
> import pyspark.sql.functions as F
> df = spark.range(1000 * 1000)
> df_pd = df.withColumn("test", F.lit("this is a long string that should make 
> the resulting dataframe too large for maxResult which is 1m")).toPandas()
> {code}
>  
> With {{spark.sql.execution.arrow.enabled}} set to {{true}}, this returns an 
> empty Pandas dataframe without any error:
> {code:python}
> df_pd.info()
> # 
> # Index: 0 entries
> # Data columns (total 2 columns):
> # id  0 non-null object
> # test0 non-null object
> # dtypes: object(2)
> # memory usage: 0.0+ bytes
> {code}
> The driver stderr does have an error, and so does the Spark UI:
> {code:java}
> ERROR TaskSetManager: Total size of serialized results of 1 tasks (52.8 MB) 
> is bigger than spark.driver.maxResultSize (1024.0 KB)
> ERROR TaskSetManager: Total size of serialized results of 2 tasks (105.7 MB) 
> is bigger than spark.driver.maxResultSize (1024.0 KB)
> Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job 
> aborted due to stage failure: Total size of serialized results of 1 tasks 
> (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB)
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2026)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2260)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2209)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2198)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
>  at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3313)
>  at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3282)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432)
>  at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862)
> {code}
> With {{spark.sql.execution.arrow.enabled}} set to {{false}}, the Python call 
> to {{toPandas}} does fail as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27051) Bump Jackson version to 2.9.8

2019-03-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27051:


Assignee: Yanbo Liang

> Bump Jackson version to 2.9.8
> -
>
> Key: SPARK-27051
> URL: https://issues.apache.org/jira/browse/SPARK-27051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Major
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs 
> [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix 
> bump the dependent Jackson to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27015) spark-submit does not properly escape arguments sent to Mesos dispatcher

2019-03-04 Thread Martin Loncaric (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783944#comment-16783944
 ] 

Martin Loncaric commented on SPARK-27015:
-

Created a PR: https://github.com/apache/spark/pull/23967

> spark-submit does not properly escape arguments sent to Mesos dispatcher
> 
>
> Key: SPARK-27015
> URL: https://issues.apache.org/jira/browse/SPARK-27015
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
> Fix For: 2.5.0, 3.0.0
>
>
> Arguments sent to the dispatcher must be escaped; for instance,
> {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a 
> b$c"{noformat}
> fails, and instead must be submitted as
> {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a\\ 
> b\\$c"{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27015) spark-submit does not properly escape arguments sent to Mesos dispatcher

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27015:


Assignee: (was: Apache Spark)

> spark-submit does not properly escape arguments sent to Mesos dispatcher
> 
>
> Key: SPARK-27015
> URL: https://issues.apache.org/jira/browse/SPARK-27015
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
> Fix For: 2.5.0, 3.0.0
>
>
> Arguments sent to the dispatcher must be escaped; for instance,
> {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a 
> b$c"{noformat}
> fails, and instead must be submitted as
> {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a\\ 
> b\\$c"{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27015) spark-submit does not properly escape arguments sent to Mesos dispatcher

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27015:


Assignee: Apache Spark

> spark-submit does not properly escape arguments sent to Mesos dispatcher
> 
>
> Key: SPARK-27015
> URL: https://issues.apache.org/jira/browse/SPARK-27015
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Martin Loncaric
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.5.0, 3.0.0
>
>
> Arguments sent to the dispatcher must be escaped; for instance,
> {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a 
> b$c"{noformat}
> fails, and instead must be submitted as
> {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a\\ 
> b\\$c"{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27048) A way to execute functions on Executor Startup and Executor Exit in Standalone

2019-03-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-27048:
--
Target Version/s:   (was: 2.4.0)

> A way to execute functions on Executor Startup and Executor Exit in Standalone
> --
>
> Key: SPARK-27048
> URL: https://issues.apache.org/jira/browse/SPARK-27048
> Project: Spark
>  Issue Type: Wish
>  Components: Deploy, Spark Submit
>Affects Versions: 2.3.1, 2.3.3
>Reporter: Ross Brigoli
>Priority: Major
>  Labels: usability
>
> *Background*
> We have a Spark Standalone ETL workload that is heavily dependent on Apache 
> Ignite KV store for lookup/reference data. There are hundreds (400+) of 
> lookup data some are up to 300K records. We formerly used broadcast variables 
> but later found out that it was not fast enough.
> So we decided implement a caching mechanism by retrieving reference data from 
> JDBC source and put them in-memory through Apache ignite as replicated cache. 
> Each Spark worker node is also running an Ignite node (JVM). Then we let the 
> spark executors retrieve the data from Ignite through "shared memory port". 
> This is very fast but is causing instability in the Ignite cluster. The 
> reason is that when the Spark executor JVM terminates, the Ignite Data Grid 
> is terminated abnormally. This makes the Ignite cluster wait for the client 
> node (which is the spark executor) to reconnect making the Ignite cluster 
> non-responsive for a while.
> *Wish*
> We have this need for an ability to close the ignite client node gracefully 
> just before the Executor process ends. So a feature that makes it possible to 
> pass an EventHandler for "executor.onStart" and "executor.exitExecutor()" 
> would be really really useful.
> It could be a spark-submit argument or an entry in the spark-defaults.conf 
> that looks something like:
> {{spark.executor.startUpClass=com.company.ExecutorInitializer}}
> {{spark.executor.shutdownClass=com.company.ExecutorCleaner}}
> The class will have to implement an interface provided by Spark. This class 
> can then be loaded dynamically in the CoarseGrainedExecutorBackend and called 
> on the onStart() and exitExecutor() methods respectively
> This is also useful for opening and closing JDBC connections per executor 
> instead of per partition.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26205) Optimize InSet expression for bytes, shorts, ints, dates

2019-03-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26205.
---
   Resolution: Fixed
 Assignee: Anton Okolnychyi
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/23171 .

> Optimize InSet expression for bytes, shorts, ints, dates
> 
>
> Key: SPARK-26205
> URL: https://issues.apache.org/jira/browse/SPARK-26205
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
> Fix For: 3.0.0
>
>
> {{In}} expressions are compiled into a sequence of if-else statements, which 
> results in O\(n\) time complexity. {{InSet}} is an optimized version of 
> {{In}}, which is supposed to improve the performance if the number of 
> elements is big enough. However, {{InSet}} actually degrades the performance 
> in many cases due to various reasons (benchmarks were created in SPARK-26203 
> and solutions to the boxing problem are discussed in SPARK-26204).
> The main idea of this JIRA is to use Java {{switch}} statements to 
> significantly improve the performance of {{InSet}} expressions for bytes, 
> shorts, ints, dates. All {{switch}} statements are compiled into 
> {{tableswitch}} and {{lookupswitch}} bytecode instructions. We will have 
> O\(1\) time complexity if our case values are compact and {{tableswitch}} can 
> be used. Otherwise, {{lookupswitch}} will give us O\(log n\). Our local 
> benchmarks show that this logic is more than two times faster even on 500+ 
> elements than using primitive collections in {{InSet}} expressions. As Spark 
> is using Scala {{HashSet}} right now, the performance gain will be is even 
> bigger.
> See 
> [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10]
>  and 
> [here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch]
>  for more information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26016) Encoding not working when using a map / mapPartitions call

2019-03-04 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783885#comment-16783885
 ] 

Hyukjin Kwon commented on SPARK-26016:
--

BTW, IIRC some codes assume other ascii compatible encodings can be supported 
since utf8 is ascii compatible but I think it's better to whitelist that utf8 
only is supported.

> Encoding not working when using a map / mapPartitions call
> --
>
> Key: SPARK-26016
> URL: https://issues.apache.org/jira/browse/SPARK-26016
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.0
>Reporter: Chris Caspanello
>Assignee: Sean Owen
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: spark-sandbox.zip
>
>
> Attached you will find a project with unit tests showing the issue at hand.
> If I read in a ISO-8859-1 encoded file and simply write out what was read; 
> the contents in the part file matches what was read.  Which is great.
> However, the second I use a map / mapPartitions function it looks like the 
> encoding is not correct.  In addition a simple collectAsList and writing that 
> list of strings to a file does not work either.  I don't think I'm doing 
> anything wrong.  Can someone please investigate?  I think this is a bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26016) Encoding not working when using a map / mapPartitions call

2019-03-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783878#comment-16783878
 ] 

Sean Owen commented on SPARK-26016:
---

It's "Fixed" in the sense that at least we plugged the documentation hole here 
that I am pretty certain explains the issue.
I want to open a new JIRA to consider supporting 'encoding' for the text 
source. It looks straightforward, even.

> Encoding not working when using a map / mapPartitions call
> --
>
> Key: SPARK-26016
> URL: https://issues.apache.org/jira/browse/SPARK-26016
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.0
>Reporter: Chris Caspanello
>Assignee: Sean Owen
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: spark-sandbox.zip
>
>
> Attached you will find a project with unit tests showing the issue at hand.
> If I read in a ISO-8859-1 encoded file and simply write out what was read; 
> the contents in the part file matches what was read.  Which is great.
> However, the second I use a map / mapPartitions function it looks like the 
> encoding is not correct.  In addition a simple collectAsList and writing that 
> list of strings to a file does not work either.  I don't think I'm doing 
> anything wrong.  Can someone please investigate?  I think this is a bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26016) Encoding not working when using a map / mapPartitions call

2019-03-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26016.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23962
[https://github.com/apache/spark/pull/23962]

> Encoding not working when using a map / mapPartitions call
> --
>
> Key: SPARK-26016
> URL: https://issues.apache.org/jira/browse/SPARK-26016
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.0
>Reporter: Chris Caspanello
>Assignee: Sean Owen
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: spark-sandbox.zip
>
>
> Attached you will find a project with unit tests showing the issue at hand.
> If I read in a ISO-8859-1 encoded file and simply write out what was read; 
> the contents in the part file matches what was read.  Which is great.
> However, the second I use a map / mapPartitions function it looks like the 
> encoding is not correct.  In addition a simple collectAsList and writing that 
> list of strings to a file does not work either.  I don't think I'm doing 
> anything wrong.  Can someone please investigate?  I think this is a bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26016) Encoding not working when using a map / mapPartitions call

2019-03-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26016:


Assignee: Sean Owen

> Encoding not working when using a map / mapPartitions call
> --
>
> Key: SPARK-26016
> URL: https://issues.apache.org/jira/browse/SPARK-26016
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.0
>Reporter: Chris Caspanello
>Assignee: Sean Owen
>Priority: Major
> Attachments: spark-sandbox.zip
>
>
> Attached you will find a project with unit tests showing the issue at hand.
> If I read in a ISO-8859-1 encoded file and simply write out what was read; 
> the contents in the part file matches what was read.  Which is great.
> However, the second I use a map / mapPartitions function it looks like the 
> encoding is not correct.  In addition a simple collectAsList and writing that 
> list of strings to a file does not work either.  I don't think I'm doing 
> anything wrong.  Can someone please investigate?  I think this is a bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k

2019-03-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783858#comment-16783858
 ] 

Sean Owen commented on SPARK-26947:
---

That doesn't sound "very big" but how big are the vectors you cluster? It looks 
like you're applying CountVectorizer with no vocabSize, so if your input are 
many different unique strings, your vectors have hundreds of thousands of 
dimensions. Ten thousand of them plus all the overhead could really add up to 
challenge even tens of GB of heap. Here it seems to be running out of memory 
while transferring a copy to/from the Python process.

I'd definitely limit vocabSize or else reconsider how you're clustering. This 
doesn't look like a particular Spark problem.

> Pyspark KMeans Clustering job fails on large values of k
> 
>
> Key: SPARK-26947
> URL: https://issues.apache.org/jira/browse/SPARK-26947
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
> Attachments: clustering_app.py
>
>
> We recently had a case where a user's pyspark job running KMeans clustering 
> was failing for large values of k. I was able to reproduce the same issue 
> with dummy dataset. I have attached the code as well as the data in the JIRA. 
> The stack trace is printed below from Java:
>  
> {code:java}
> Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:3332)
>   at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
>   at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
>   at java.lang.StringBuilder.append(StringBuilder.java:202)
>   at py4j.Protocol.getOutputCommand(Protocol.java:328)
>   at py4j.commands.CallCommand.execute(CallCommand.java:81)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Python:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 985, in send_command
> response = connection.send_command(command)
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> Traceback (most recent call last):
>   File "clustering_app.py", line 154, in 
> main(args)
>   File "clustering_app.py", line 145, in main
> run_clustering(sc, args.input_path, args.output_path, 
> args.num_clusters_list)
>   File "clustering_app.py", line 136, in run_clustering
> clustersTable, cluster_Centers = clustering(sc, documents, output_path, 
> k, max_iter)
>   File "clustering_app.py", line 68, in clustering
> cluster_Centers = km_model.clusterCenters()
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py",
>  line 337, in clusterCenters
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py",
>  line 55, in _call_java
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py",
>  line 109, in _java2py
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 336, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling 
> z:org.apache.spark.ml.python.MLSerDe.dumps
> {code}
> The command with which the application was launched is given below:
> {code:java}
> $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf 
> spark.executor.memory=20g --conf spark.driver.memory=20g --conf 
> spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf 
> spark.kryoserializer.buffer.max=2000m --conf 

[jira] [Commented] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k

2019-03-04 Thread Parth Gandhi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783849#comment-16783849
 ] 

Parth Gandhi commented on SPARK-26947:
--

[~srowen] for this particular case, k is set to 1. Input data size is 90 MB 
and memory is set to 20g(both driver and executor). [~mgaido] I will try doing 
that and let you know.

> Pyspark KMeans Clustering job fails on large values of k
> 
>
> Key: SPARK-26947
> URL: https://issues.apache.org/jira/browse/SPARK-26947
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
> Attachments: clustering_app.py
>
>
> We recently had a case where a user's pyspark job running KMeans clustering 
> was failing for large values of k. I was able to reproduce the same issue 
> with dummy dataset. I have attached the code as well as the data in the JIRA. 
> The stack trace is printed below from Java:
>  
> {code:java}
> Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:3332)
>   at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
>   at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
>   at java.lang.StringBuilder.append(StringBuilder.java:202)
>   at py4j.Protocol.getOutputCommand(Protocol.java:328)
>   at py4j.commands.CallCommand.execute(CallCommand.java:81)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Python:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 985, in send_command
> response = connection.send_command(command)
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> Traceback (most recent call last):
>   File "clustering_app.py", line 154, in 
> main(args)
>   File "clustering_app.py", line 145, in main
> run_clustering(sc, args.input_path, args.output_path, 
> args.num_clusters_list)
>   File "clustering_app.py", line 136, in run_clustering
> clustersTable, cluster_Centers = clustering(sc, documents, output_path, 
> k, max_iter)
>   File "clustering_app.py", line 68, in clustering
> cluster_Centers = km_model.clusterCenters()
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py",
>  line 337, in clusterCenters
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py",
>  line 55, in _call_java
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py",
>  line 109, in _java2py
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 336, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling 
> z:org.apache.spark.ml.python.MLSerDe.dumps
> {code}
> The command with which the application was launched is given below:
> {code:java}
> $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf 
> spark.executor.memory=20g --conf spark.driver.memory=20g --conf 
> spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf 
> spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=12g 
> ~/clustering_app.py --input_path hdfs:///user/username/part-v001x 
> --output_path hdfs:///user/username --num_clusters_list 1
> {code}
> The input dataset is approximately 90 MB in size and the assigned heap memory 
> to both driver and executor is close to 20 GB. This only happens for large 
> values of k.



--
This message was sent by 

[jira] [Commented] (SPARK-26972) Issue with CSV import and inferSchema set to true

2019-03-04 Thread Jean Georges Perrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783834#comment-16783834
 ] 

Jean Georges Perrin commented on SPARK-26972:
-

[~srowen], [~hyukjin.kwon] - thanks guy for dealing with a rookie! 

I'll do my best to give a try against master, however:
 # the non-case sensitivity becoming case sensitivity, is that scheduled for 
v3.0 or already in v2.4.x?
 # I double checked the output, when you specify the schema: 

in 2.1.3, it crashes:
{code:java}
2019-03-04 17:17:41.854 -ERROR --- [rker for task 0] 
Logging$class.logError(Logging.scala:91): Exception in task 0.0 in stage 0.0 
(TID 0)
java.lang.NumberFormatException: For input string: "An independent study by 
Jean Georges Perrin, IIUG Board Member*"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at 
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:252)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:125)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:100)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-03-04 17:17:41.876 -ERROR --- [result-getter-0] 
Logging$class.logError(Logging.scala:70): Task 0 in stage 0.0 failed 1 times; 
aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost 
task 0.0 in stage 0.0 (TID 0, localhost, executor driver): 
java.lang.NumberFormatException: For input string: "An independent study by 
Jean Georges Perrin, IIUG Board Member*"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at 
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:252)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:125)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 

[jira] [Assigned] (SPARK-27051) Bump Jackson version to 2.9.8

2019-03-04 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-27051:
---

Assignee: (was: Yanbo Liang)

> Bump Jackson version to 2.9.8
> -
>
> Key: SPARK-27051
> URL: https://issues.apache.org/jira/browse/SPARK-27051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yanbo Liang
>Priority: Major
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple 
> [CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186],] we need 
> to fix bump the Jackson version to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf

2019-03-04 Thread Martin Loncaric (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783810#comment-16783810
 ] 

Martin Loncaric commented on SPARK-26192:
-

[~dongjoon] Thanks, I will pay more attention to those fields.

However, I believe this is a bug. It violates behavior specified in the 
https://spark.apache.org/docs/latest/running-on-mesos.html#configuration. Can 
we merge into 2.4.1 as well?

> MesosClusterScheduler reads options from dispatcher conf instead of 
> submission conf
> ---
>
> Key: SPARK-26192
> URL: https://issues.apache.org/jira/browse/SPARK-26192
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Martin Loncaric
>Assignee: Martin Loncaric
>Priority: Minor
> Fix For: 3.0.0
>
>
> There is at least one option accessed in MesosClusterScheduler that should 
> come from the submission's configuration instead of the dispatcher's:
> spark.mesos.fetcherCache.enable
> Coincidentally, the spark.mesos.fetcherCache.enable option was previously 
> misnamed, as referenced in the linked JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27052) Using PySpark udf in transform yields NULL values

2019-03-04 Thread hejsgpuom62c (JIRA)
hejsgpuom62c created SPARK-27052:


 Summary: Using PySpark udf in transform yields NULL values
 Key: SPARK-27052
 URL: https://issues.apache.org/jira/browse/SPARK-27052
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.4.0
Reporter: hejsgpuom62c


Steps to reproduce
{code:java}

from typing import Optional
from pyspark.sql.functions import expr

def f(x: Optional[int]) -> Optional[int]:
return x + 1 if x is not None else None

spark.udf.register('f', f, "integer")

df = (spark
.createDataFrame([(1, [1, 2, 3])], ("id", "xs"))
.withColumn("xsinc", expr("transform(xs, x -> f(x))")))

df.show()

# +---+-+-+
# | id|   xs|xsinc|
# +---+-+-+
# |  1|[1, 2, 3]| [,,]|
# +---+-+-+


{code}
 

Source https://stackoverflow.com/a/53762650



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf

2019-03-04 Thread Martin Loncaric (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783810#comment-16783810
 ] 

Martin Loncaric edited comment on SPARK-26192 at 3/4/19 9:51 PM:
-

[~dongjoon] Thanks, I will pay more attention to those fields.

However, I believe this is a bug. It violates behavior specified in 
https://spark.apache.org/docs/latest/running-on-mesos.html#configuration. Can 
we merge into at least 2.4.1 as well?


was (Author: mwlon):
[~dongjoon] Thanks, I will pay more attention to those fields.

However, I believe this is a bug. It violates behavior specified in 
https://spark.apache.org/docs/latest/running-on-mesos.html#configuration. Can 
we merge into 2.4.1 as well?

> MesosClusterScheduler reads options from dispatcher conf instead of 
> submission conf
> ---
>
> Key: SPARK-26192
> URL: https://issues.apache.org/jira/browse/SPARK-26192
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Martin Loncaric
>Assignee: Martin Loncaric
>Priority: Minor
> Fix For: 3.0.0
>
>
> There is at least one option accessed in MesosClusterScheduler that should 
> come from the submission's configuration instead of the dispatcher's:
> spark.mesos.fetcherCache.enable
> Coincidentally, the spark.mesos.fetcherCache.enable option was previously 
> misnamed, as referenced in the linked JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27014) Support removal of jars and Spark binaries from Mesos driver and executor sandboxes

2019-03-04 Thread Martin Loncaric (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783811#comment-16783811
 ] 

Martin Loncaric commented on SPARK-27014:
-

Sure, will keep that in mind.

> Support removal of jars and Spark binaries from Mesos driver and executor 
> sandboxes
> ---
>
> Key: SPARK-27014
> URL: https://issues.apache.org/jira/browse/SPARK-27014
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: Martin Loncaric
>Priority: Minor
>
> Currently, each Spark application run on Mesos leaves behind at least 500MB 
> of data in sandbox directories, coming from Spark binaries and copied URIs. 
> These can build up as a disk leak, causing major issues on Mesos clusters 
> unless their grace period for sandbox directories is very short.
> Spark should have a feature to delete these (from both driver and executor 
> sandboxes) on teardown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27051) Bump Jackson version to 2.9.8

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27051:


Assignee: (was: Apache Spark)

> Bump Jackson version to 2.9.8
> -
>
> Key: SPARK-27051
> URL: https://issues.apache.org/jira/browse/SPARK-27051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yanbo Liang
>Priority: Major
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs 
> [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix 
> bump the dependent Jackson to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27051) Bump Jackson version to 2.9.8

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27051:


Assignee: Apache Spark

> Bump Jackson version to 2.9.8
> -
>
> Key: SPARK-27051
> URL: https://issues.apache.org/jira/browse/SPARK-27051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Major
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs 
> [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix 
> bump the dependent Jackson to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf

2019-03-04 Thread Martin Loncaric (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783810#comment-16783810
 ] 

Martin Loncaric edited comment on SPARK-26192 at 3/4/19 9:49 PM:
-

[~dongjoon] Thanks, I will pay more attention to those fields.

However, I believe this is a bug. It violates behavior specified in 
https://spark.apache.org/docs/latest/running-on-mesos.html#configuration. Can 
we merge into 2.4.1 as well?


was (Author: mwlon):
[~dongjoon] Thanks, I will pay more attention to those fields.

However, I believe this is a bug. It violates behavior specified in the 
https://spark.apache.org/docs/latest/running-on-mesos.html#configuration. Can 
we merge into 2.4.1 as well?

> MesosClusterScheduler reads options from dispatcher conf instead of 
> submission conf
> ---
>
> Key: SPARK-26192
> URL: https://issues.apache.org/jira/browse/SPARK-26192
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Martin Loncaric
>Assignee: Martin Loncaric
>Priority: Minor
> Fix For: 3.0.0
>
>
> There is at least one option accessed in MesosClusterScheduler that should 
> come from the submission's configuration instead of the dispatcher's:
> spark.mesos.fetcherCache.enable
> Coincidentally, the spark.mesos.fetcherCache.enable option was previously 
> misnamed, as referenced in the linked JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27050) Bean Encoder serializes data in a wrong order if input schema is not ordered

2019-03-04 Thread hejsgpuom62c (JIRA)
hejsgpuom62c created SPARK-27050:


 Summary: Bean Encoder serializes data in a wrong order if input 
schema is not ordered
 Key: SPARK-27050
 URL: https://issues.apache.org/jira/browse/SPARK-27050
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: hejsgpuom62c


Steps to reproduce. Define schema like this

 
{code:java}

StructType valid = StructType.fromDDL(
  "broker_name string, order integer, server_name string, " + 
  "storages array>" 
);{code}
{code:java}
package com.example;

import java.io.Serializable;
import lombok.Data;
import lombok.AllArgsConstructor;
import lombok.NoArgsConstructor;

@Data
@NoArgsConstructor
@AllArgsConstructor
public class Entity implements Serializable {
    private String broker_name;
    private String server_name;
    private Integer order;
    private Storage[] storages;
}{code}
{code:java}
package com.example;

import java.io.Serializable;
import lombok.Data;
import lombok.AllArgsConstructor;
import lombok.NoArgsConstructor;

@Data
@NoArgsConstructor
@AllArgsConstructor
public class Storage implements Serializable {
    private java.sql.Timestamp timestamp;
    private Double storage;
}{code}


Create a JSON file with the following content:


{code:java}
[
  {
"broker_name": "A1",
"server_name": "S1",
"order": 1,
"storages": [
  {
"timestamp": "2018-10-29 23:11:44.000",
"storage": 12.5
  }
]
  }
]{code}
 

Process data as


{code:java}
Dataset ds = spark.read().option("multiline", 
"true").schema(valid).json("/path/to/file")
    .as(Encoders.bean(Entity.class));

    ds
    .groupByKey((MapFunction) o -> o.getBroker_name(), 
Encoders.STRING())
    .reduceGroups((ReduceFunction)(e1, e2) -> e1)
    .map((MapFunction, Entity>) tuple -> tuple._2, 
Encoders.bean(Entity.class))
    .show(10, false);{code}

The result will be:
{code:java}
+---+-+---++
|broker_name|order|server_name|storages 
   |
+---+-+---++
|A1 |1    |S1 |[[7.612815958429577E-309, 148474-03-19 
22:14:3232.5248]]|
+---+-+---++
{code}

Source https://stackoverflow.com/q/54987724



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27051) Bump Jackson version to 2.9.8

2019-03-04 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-27051:

Description: Fasterxml Jackson version before 2.9.8 is affected by multiple 
CVEs [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to 
fix bump the dependent Jackson to 2.9.8.  (was: Fasterxml Jackson version 
before 2.9.8 is affected by multiple CVEs 
[[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix 
bump the dependent Jackson version to 2.9.8.)

> Bump Jackson version to 2.9.8
> -
>
> Key: SPARK-27051
> URL: https://issues.apache.org/jira/browse/SPARK-27051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yanbo Liang
>Priority: Major
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs 
> [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix 
> bump the dependent Jackson to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27051) Bump Jackson version to 2.9.8

2019-03-04 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-27051:

Description: Fasterxml Jackson version before 2.9.8 is affected by multiple 
CVEs [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to 
fix bump the dependent Jackson version to 2.9.8.  (was: Fasterxml Jackson 
version before 2.9.8 is affected by multiple [CVEs | 
[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix 
bump the dependent Jackson version to 2.9.8.)

> Bump Jackson version to 2.9.8
> -
>
> Key: SPARK-27051
> URL: https://issues.apache.org/jira/browse/SPARK-27051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yanbo Liang
>Priority: Major
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs 
> [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix 
> bump the dependent Jackson version to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27051) Bump Jackson version to 2.9.8

2019-03-04 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-27051:

Description: Fasterxml Jackson version before 2.9.8 is affected by multiple 
[CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186],] we need to 
fix bump the dependent Jackson version to 2.9.8.  (was: Fasterxml Jackson 
version before 2.9.8 is affected by multiple 
[CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186],] we need to 
fix bump the Jackson version to 2.9.8.)

> Bump Jackson version to 2.9.8
> -
>
> Key: SPARK-27051
> URL: https://issues.apache.org/jira/browse/SPARK-27051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yanbo Liang
>Priority: Major
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple 
> [CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186],] we need 
> to fix bump the dependent Jackson version to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27051) Bump Jackson version to 2.9.8

2019-03-04 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-27051:

Description: Fasterxml Jackson version before 2.9.8 is affected by multiple 
[CVEs | [https://github.com/FasterXML/jackson-databind/issues/2186]], we need 
to fix bump the dependent Jackson version to 2.9.8.  (was: Fasterxml Jackson 
version before 2.9.8 is affected by multiple 
[CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to 
fix bump the dependent Jackson version to 2.9.8.)

> Bump Jackson version to 2.9.8
> -
>
> Key: SPARK-27051
> URL: https://issues.apache.org/jira/browse/SPARK-27051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yanbo Liang
>Priority: Major
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple [CVEs | 
> [https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix 
> bump the dependent Jackson version to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27051) Bump Jackson version to 2.9.8

2019-03-04 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-27051:

Description: Fasterxml Jackson version before 2.9.8 is affected by multiple 
[CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to 
fix bump the dependent Jackson version to 2.9.8.  (was: Fasterxml Jackson 
version before 2.9.8 is affected by multiple 
[CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186],] we need to 
fix bump the dependent Jackson version to 2.9.8.)

> Bump Jackson version to 2.9.8
> -
>
> Key: SPARK-27051
> URL: https://issues.apache.org/jira/browse/SPARK-27051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yanbo Liang
>Priority: Major
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple 
> [CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186]], we need 
> to fix bump the dependent Jackson version to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27051) Bump Jackson version to 2.9.8

2019-03-04 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-27051:
---

Assignee: Yanbo Liang

> Bump Jackson version to 2.9.8
> -
>
> Key: SPARK-27051
> URL: https://issues.apache.org/jira/browse/SPARK-27051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Major
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple 
> [CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186],] we need 
> to fix bump the Jackson version to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27051) Bump Jackson version to 2.9.8

2019-03-04 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-27051:
---

 Summary: Bump Jackson version to 2.9.8
 Key: SPARK-27051
 URL: https://issues.apache.org/jira/browse/SPARK-27051
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Yanbo Liang


Fasterxml Jackson version before 2.9.8 is affected by multiple 
[CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186],] we need to 
fix bump the Jackson version to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25865) Add GC information to ExecutorMetrics

2019-03-04 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-25865.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22874
[https://github.com/apache/spark/pull/22874]

> Add GC information to ExecutorMetrics
> -
>
> Key: SPARK-25865
> URL: https://issues.apache.org/jira/browse/SPARK-25865
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
> Fix For: 3.0.0
>
>
> Only memory usage without GC information could not help us to determinate the 
> proper settings of memory. Add basic GC information to ExecutorMetrics 
> interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25865) Add GC information to ExecutorMetrics

2019-03-04 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-25865:


Assignee: Lantao Jin

> Add GC information to ExecutorMetrics
> -
>
> Key: SPARK-25865
> URL: https://issues.apache.org/jira/browse/SPARK-25865
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
>
> Only memory usage without GC information could not help us to determinate the 
> proper settings of memory. Add basic GC information to ExecutorMetrics 
> interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf

2019-03-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783745#comment-16783745
 ] 

Dongjoon Hyun edited comment on SPARK-26192 at 3/4/19 8:15 PM:
---

[~mwlon]. Thank you for reporting and making a PR. However, please don't set 
'Fix Versions`. `Fixed Version` and `Target Version` are used in the different 
way. Please refer the contribution guide.

- https://spark.apache.org/contributing.html

For me, this is a minor improvement for Spark 3.0.


was (Author: dongjoon):
[~mwlon]. Thank you for reporting and making a PR. However, please don't set 
'Fix Versions`. `Fixed Version` and `Target Version` are used in the different 
way. Please refer the contribution guide.

- https://spark.apache.org/contributing.html

For me, this is a minor improvement for Spark 3.0.

> MesosClusterScheduler reads options from dispatcher conf instead of 
> submission conf
> ---
>
> Key: SPARK-26192
> URL: https://issues.apache.org/jira/browse/SPARK-26192
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Martin Loncaric
>Assignee: Martin Loncaric
>Priority: Major
> Fix For: 3.0.0
>
>
> There is at least one option accessed in MesosClusterScheduler that should 
> come from the submission's configuration instead of the dispatcher's:
> spark.mesos.fetcherCache.enable
> Coincidentally, the spark.mesos.fetcherCache.enable option was previously 
> misnamed, as referenced in the linked JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf

2019-03-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783745#comment-16783745
 ] 

Dongjoon Hyun edited comment on SPARK-26192 at 3/4/19 8:16 PM:
---

[~mwlon]. Thank you for reporting and making a PR. However, please don't set 
'Fix Versions`. `Fixed Version` and `Target Version` are used in the different 
way. Please refer the contribution guide.

- https://spark.apache.org/contributing.html

For me, this is an improvement for Spark 3.0.


was (Author: dongjoon):
[~mwlon]. Thank you for reporting and making a PR. However, please don't set 
'Fix Versions`. `Fixed Version` and `Target Version` are used in the different 
way. Please refer the contribution guide.

- https://spark.apache.org/contributing.html

For me, this is a minor improvement for Spark 3.0.

> MesosClusterScheduler reads options from dispatcher conf instead of 
> submission conf
> ---
>
> Key: SPARK-26192
> URL: https://issues.apache.org/jira/browse/SPARK-26192
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Martin Loncaric
>Assignee: Martin Loncaric
>Priority: Minor
> Fix For: 3.0.0
>
>
> There is at least one option accessed in MesosClusterScheduler that should 
> come from the submission's configuration instead of the dispatcher's:
> spark.mesos.fetcherCache.enable
> Coincidentally, the spark.mesos.fetcherCache.enable option was previously 
> misnamed, as referenced in the linked JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes

2019-03-04 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-26688:


Assignee: Attila Zsolt Piros

> Provide configuration of initially blacklisted YARN nodes
> -
>
> Key: SPARK-26688
> URL: https://issues.apache.org/jira/browse/SPARK-26688
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
>
> Introducing new config for initially blacklisted YARN nodes.
> This came up in the apache spark user mailing list: 
> [http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-is-it-possible-to-manually-blacklist-nodes-before-running-spark-job-td34395.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf

2019-03-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26192:
--
Priority: Minor  (was: Major)

> MesosClusterScheduler reads options from dispatcher conf instead of 
> submission conf
> ---
>
> Key: SPARK-26192
> URL: https://issues.apache.org/jira/browse/SPARK-26192
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Martin Loncaric
>Assignee: Martin Loncaric
>Priority: Minor
> Fix For: 3.0.0
>
>
> There is at least one option accessed in MesosClusterScheduler that should 
> come from the submission's configuration instead of the dispatcher's:
> spark.mesos.fetcherCache.enable
> Coincidentally, the spark.mesos.fetcherCache.enable option was previously 
> misnamed, as referenced in the linked JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes

2019-03-04 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-26688.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23616
[https://github.com/apache/spark/pull/23616]

> Provide configuration of initially blacklisted YARN nodes
> -
>
> Key: SPARK-26688
> URL: https://issues.apache.org/jira/browse/SPARK-26688
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.0.0
>
>
> Introducing new config for initially blacklisted YARN nodes.
> This came up in the apache spark user mailing list: 
> [http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-is-it-possible-to-manually-blacklist-nodes-before-running-spark-job-td34395.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf

2019-03-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783745#comment-16783745
 ] 

Dongjoon Hyun commented on SPARK-26192:
---

[~mwlon]. Thank you for reporting and making a PR. However, please don't set 
'Fix Versions`. `Fixed Version` and `Target Version` are used in the different 
way. Please refer the contribution guide.

- https://spark.apache.org/contributing.html

For me, this is a minor improvement for Spark 3.0.

> MesosClusterScheduler reads options from dispatcher conf instead of 
> submission conf
> ---
>
> Key: SPARK-26192
> URL: https://issues.apache.org/jira/browse/SPARK-26192
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Martin Loncaric
>Assignee: Martin Loncaric
>Priority: Major
> Fix For: 3.0.0
>
>
> There is at least one option accessed in MesosClusterScheduler that should 
> come from the submission's configuration instead of the dispatcher's:
> spark.mesos.fetcherCache.enable
> Coincidentally, the spark.mesos.fetcherCache.enable option was previously 
> misnamed, as referenced in the linked JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf

2019-03-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26192.
---
   Resolution: Fixed
 Assignee: Martin Loncaric
Fix Version/s: (was: 2.3.4)
   (was: 2.4.1)

This is resolved via https://github.com/apache/spark/pull/23924 .

> MesosClusterScheduler reads options from dispatcher conf instead of 
> submission conf
> ---
>
> Key: SPARK-26192
> URL: https://issues.apache.org/jira/browse/SPARK-26192
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Martin Loncaric
>Assignee: Martin Loncaric
>Priority: Major
> Fix For: 3.0.0
>
>
> There is at least one option accessed in MesosClusterScheduler that should 
> come from the submission's configuration instead of the dispatcher's:
> spark.mesos.fetcherCache.enable
> Coincidentally, the spark.mesos.fetcherCache.enable option was previously 
> misnamed, as referenced in the linked JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf

2019-03-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26192:
--
Issue Type: Improvement  (was: Bug)

> MesosClusterScheduler reads options from dispatcher conf instead of 
> submission conf
> ---
>
> Key: SPARK-26192
> URL: https://issues.apache.org/jira/browse/SPARK-26192
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Martin Loncaric
>Assignee: Martin Loncaric
>Priority: Major
> Fix For: 3.0.0
>
>
> There is at least one option accessed in MesosClusterScheduler that should 
> come from the submission's configuration instead of the dispatcher's:
> spark.mesos.fetcherCache.enable
> Coincidentally, the spark.mesos.fetcherCache.enable option was previously 
> misnamed, as referenced in the linked JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26961) Found Java-level deadlock in Spark Driver

2019-03-04 Thread Mi Zi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783727#comment-16783727
 ] 

Mi Zi commented on SPARK-26961:
---

Hi Ajith,

 

IMHO ClassLoader.registerAsParallelCapable() only helps to reduce the 
granularity of the lock. The lock will still be shared by loadClass calls with 
same "className". Theoretically deadlock can still be triggered in certain 
cases.

> Found Java-level deadlock in Spark Driver
> -
>
> Key: SPARK-26961
> URL: https://issues.apache.org/jira/browse/SPARK-26961
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Rong Jialei
>Priority: Major
>
> Our spark job usually will finish in minutes, however, we recently found it 
> take days to run, and we can only kill it when this happened.
> An investigation show all worker container could not connect drive after 
> start, and driver is hanging, using jstack, we found a Java-level deadlock.
>  
> *Jstack output for deadlock part is showing below:*
>  
> Found one Java-level deadlock:
> =
> "SparkUI-907":
>  waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
> org.apache.hadoop.conf.Configuration),
>  which is held by "ForkJoinPool-1-worker-57"
> "ForkJoinPool-1-worker-57":
>  waiting to lock monitor 0x7f3860574298 (object 0x0005b7991168, a 
> org.apache.spark.util.MutableURLClassLoader),
>  which is held by "ForkJoinPool-1-worker-7"
> "ForkJoinPool-1-worker-7":
>  waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
> org.apache.hadoop.conf.Configuration),
>  which is held by "ForkJoinPool-1-worker-57"
> Java stack information for the threads listed above:
> ===
> "SparkUI-907":
>  at org.apache.hadoop.conf.Configuration.getOverlay(Configuration.java:1328)
>  - waiting to lock <0x0005c0c1e5e0> (a 
> org.apache.hadoop.conf.Configuration)
>  at 
> org.apache.hadoop.conf.Configuration.handleDeprecation(Configuration.java:684)
>  at org.apache.hadoop.conf.Configuration.get(Configuration.java:1088)
>  at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1145)
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2363)
>  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840)
>  at 
> org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74)
>  at java.net.URL.getURLStreamHandler(URL.java:1142)
>  at java.net.URL.(URL.java:599)
>  at java.net.URL.(URL.java:490)
>  at java.net.URL.(URL.java:439)
>  at org.apache.spark.ui.JettyUtils$$anon$4.doRequest(JettyUtils.scala:176)
>  at org.apache.spark.ui.JettyUtils$$anon$4.doGet(JettyUtils.scala:161)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:171)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.spark_project.jetty.server.Server.handle(Server.java:534)
>  at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:320)
>  at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>  at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
>  at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>  at 

[jira] [Assigned] (SPARK-24120) Show `Jobs` page when `jobId` is missing

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24120:


Assignee: (was: Apache Spark)

> Show `Jobs` page when `jobId` is missing
> 
>
> Key: SPARK-24120
> URL: https://issues.apache.org/jira/browse/SPARK-24120
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Jongyoul Lee
>Priority: Minor
>
> For now, users try to connect {{job}} page without {{jobid}}, Spark UI shows 
> only error page. It's not incorrect but helpless to users. It would be better 
> to redirect to `jobs` page to select proper job. This, actually, happens when 
> users use yarn mode. Because of yarn's bug(YARN-6615), some parameters aren't 
> passed to Spark's driver UI with now the latest version of Yarn. It's also 
> mentioned at SPARK-20772.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24120) Show `Jobs` page when `jobId` is missing

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24120:


Assignee: Apache Spark

> Show `Jobs` page when `jobId` is missing
> 
>
> Key: SPARK-24120
> URL: https://issues.apache.org/jira/browse/SPARK-24120
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Jongyoul Lee
>Assignee: Apache Spark
>Priority: Minor
>
> For now, users try to connect {{job}} page without {{jobid}}, Spark UI shows 
> only error page. It's not incorrect but helpless to users. It would be better 
> to redirect to `jobs` page to select proper job. This, actually, happens when 
> users use yarn mode. Because of yarn's bug(YARN-6615), some parameters aren't 
> passed to Spark's driver UI with now the latest version of Yarn. It's also 
> mentioned at SPARK-20772.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27039) toPandas with Arrow swallows maxResultSize errors

2019-03-04 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783711#comment-16783711
 ] 

peay commented on SPARK-27039:
--

Interesting, thanks for checking. Yes, I can definitely live without that until 
3.0. Is there already a timeline for 3.0?

> toPandas with Arrow swallows maxResultSize errors
> -
>
> Key: SPARK-27039
> URL: https://issues.apache.org/jira/browse/SPARK-27039
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Minor
>
> I am running the following simple `toPandas` with {{maxResultSize}} set to 
> 1mb:
> {code:java}
> import pyspark.sql.functions as F
> df = spark.range(1000 * 1000)
> df_pd = df.withColumn("test", F.lit("this is a long string that should make 
> the resulting dataframe too large for maxResult which is 1m")).toPandas()
> {code}
>  
> With {{spark.sql.execution.arrow.enabled}} set to {{true}}, this returns an 
> empty Pandas dataframe without any error:
> {code:python}
> df_pd.info()
> # 
> # Index: 0 entries
> # Data columns (total 2 columns):
> # id  0 non-null object
> # test0 non-null object
> # dtypes: object(2)
> # memory usage: 0.0+ bytes
> {code}
> The driver stderr does have an error, and so does the Spark UI:
> {code:java}
> ERROR TaskSetManager: Total size of serialized results of 1 tasks (52.8 MB) 
> is bigger than spark.driver.maxResultSize (1024.0 KB)
> ERROR TaskSetManager: Total size of serialized results of 2 tasks (105.7 MB) 
> is bigger than spark.driver.maxResultSize (1024.0 KB)
> Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job 
> aborted due to stage failure: Total size of serialized results of 1 tasks 
> (52.8 MB) is bigger than spark.driver.maxResultSize (1024.0 KB)
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2026)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2260)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2209)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2198)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
>  at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3313)
>  at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3282)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432)
>  at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862)
> {code}
> With {{spark.sql.execution.arrow.enabled}} set to {{false}}, the Python call 
> to {{toPandas}} does fail as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25564) Add output bytes metrics for each Executor

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25564:


Assignee: Apache Spark

> Add output bytes metrics for each Executor
> --
>
> Key: SPARK-25564
> URL: https://issues.apache.org/jira/browse/SPARK-25564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Minor
>
> LiveExecutor only statistics the total input bytes. And total output bytes 
> for each executor also has the equal importance like input.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25564) Add output bytes metrics for each Executor

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25564:


Assignee: (was: Apache Spark)

> Add output bytes metrics for each Executor
> --
>
> Key: SPARK-25564
> URL: https://issues.apache.org/jira/browse/SPARK-25564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Lantao Jin
>Priority: Minor
>
> LiveExecutor only statistics the total input bytes. And total output bytes 
> for each executor also has the equal importance like input.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24120) Show `Jobs` page when `jobId` is missing

2019-03-04 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24120:
--

Assignee: Marcelo Vanzin

> Show `Jobs` page when `jobId` is missing
> 
>
> Key: SPARK-24120
> URL: https://issues.apache.org/jira/browse/SPARK-24120
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Jongyoul Lee
>Assignee: Marcelo Vanzin
>Priority: Minor
>
> For now, users try to connect {{job}} page without {{jobid}}, Spark UI shows 
> only error page. It's not incorrect but helpless to users. It would be better 
> to redirect to `jobs` page to select proper job. This, actually, happens when 
> users use yarn mode. Because of yarn's bug(YARN-6615), some parameters aren't 
> passed to Spark's driver UI with now the latest version of Yarn. It's also 
> mentioned at SPARK-20772.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24120) Show `Jobs` page when `jobId` is missing

2019-03-04 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24120:
--

Assignee: (was: Marcelo Vanzin)

> Show `Jobs` page when `jobId` is missing
> 
>
> Key: SPARK-24120
> URL: https://issues.apache.org/jira/browse/SPARK-24120
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Jongyoul Lee
>Priority: Minor
>
> For now, users try to connect {{job}} page without {{jobid}}, Spark UI shows 
> only error page. It's not incorrect but helpless to users. It would be better 
> to redirect to `jobs` page to select proper job. This, actually, happens when 
> users use yarn mode. Because of yarn's bug(YARN-6615), some parameters aren't 
> passed to Spark's driver UI with now the latest version of Yarn. It's also 
> mentioned at SPARK-20772.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25564) Add output bytes metrics for each Executor

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25564:


Assignee: Marcelo Vanzin  (was: Apache Spark)

> Add output bytes metrics for each Executor
> --
>
> Key: SPARK-25564
> URL: https://issues.apache.org/jira/browse/SPARK-25564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Lantao Jin
>Assignee: Marcelo Vanzin
>Priority: Minor
>
> LiveExecutor only statistics the total input bytes. And total output bytes 
> for each executor also has the equal importance like input.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25564) Add output bytes metrics for each Executor

2019-03-04 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25564:
--

Assignee: Marcelo Vanzin

> Add output bytes metrics for each Executor
> --
>
> Key: SPARK-25564
> URL: https://issues.apache.org/jira/browse/SPARK-25564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Lantao Jin
>Assignee: Marcelo Vanzin
>Priority: Minor
>
> LiveExecutor only statistics the total input bytes. And total output bytes 
> for each executor also has the equal importance like input.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25564) Add output bytes metrics for each Executor

2019-03-04 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25564:
--

Assignee: Marcelo Vanzin

> Add output bytes metrics for each Executor
> --
>
> Key: SPARK-25564
> URL: https://issues.apache.org/jira/browse/SPARK-25564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Lantao Jin
>Assignee: Marcelo Vanzin
>Priority: Minor
>
> LiveExecutor only statistics the total input bytes. And total output bytes 
> for each executor also has the equal importance like input.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25564) Add output bytes metrics for each Executor

2019-03-04 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25564:
--

Assignee: (was: Marcelo Vanzin)

> Add output bytes metrics for each Executor
> --
>
> Key: SPARK-25564
> URL: https://issues.apache.org/jira/browse/SPARK-25564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Lantao Jin
>Priority: Minor
>
> LiveExecutor only statistics the total input bytes. And total output bytes 
> for each executor also has the equal importance like input.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25564) Add output bytes metrics for each Executor

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25564:


Assignee: Apache Spark  (was: Marcelo Vanzin)

> Add output bytes metrics for each Executor
> --
>
> Key: SPARK-25564
> URL: https://issues.apache.org/jira/browse/SPARK-25564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Minor
>
> LiveExecutor only statistics the total input bytes. And total output bytes 
> for each executor also has the equal importance like input.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25564) Add output bytes metrics for each Executor

2019-03-04 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25564:
--

Assignee: (was: Marcelo Vanzin)

> Add output bytes metrics for each Executor
> --
>
> Key: SPARK-25564
> URL: https://issues.apache.org/jira/browse/SPARK-25564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Lantao Jin
>Priority: Minor
>
> LiveExecutor only statistics the total input bytes. And total output bytes 
> for each executor also has the equal importance like input.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27039) toPandas with Arrow swallows maxResultSize errors

2019-03-04 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783690#comment-16783690
 ] 

Bryan Cutler commented on SPARK-27039:
--

I was able to reproduce in v2.4.0, but it looks like current master raises an 
error in the driver and does not return an empty Pandas DataFrame. This is 
probably due to some of the recent changes in toPandas() with Arrow enabled.

{noformat}
In [4]: spark.conf.set('spark.sql.execution.arrow.enabled', True)

In [5]: import pyspark.sql.functions as F
   ...: df = spark.range(1000 * 1000)
   ...: df_pd = df.withColumn("test", F.lit("this is a long string that should 
make the resulting dataframe too large for maxRe
   ...: sult which is 1m")).toPandas()
   ...: 
19/03/04 10:54:56 ERROR TaskSetManager: Total size of serialized results of 1 
tasks (13.2 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)
19/03/04 10:54:56 ERROR TaskSetManager: Total size of serialized results of 2 
tasks (26.4 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)
Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job aborted 
due to stage failure: Total size of serialized results of 1 tasks (13.2 MiB) is 
bigger than spark.driver.maxResultSize (1024.0 KiB)
at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:1938)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1926)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1925)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1925)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:935)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:935)
at scala.Option.foreach(Option.scala:274)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:935)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2155)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2104)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2093)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:746)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2008)
at 
org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3300)
at 
org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3265)
at 
org.apache.spark.api.python.PythonRDD$.$anonfun$serveToStream$2(PythonRDD.scala:442)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319)
at 
org.apache.spark.api.python.PythonRDD$.$anonfun$serveToStream$1(PythonRDD.scala:444)
at 
org.apache.spark.api.python.PythonRDD$.$anonfun$serveToStream$1$adapted(PythonRDD.scala:439)
at 
org.apache.spark.api.python.PythonServer$$anon$3.run(PythonRDD.scala:890)
/home/bryan/git/spark/python/pyspark/sql/dataframe.py:2129: UserWarning: 
toPandas attempted Arrow optimization because 
'spark.sql.execution.arrow.enabled' is set to true, but has reached the error 
below and can not continue. Note that 
'spark.sql.execution.arrow.fallback.enabled' does not have an effect on 
failures in the middle of computation.
  
  warnings.warn(msg)
19/03/04 10:54:56 ERROR TaskSetManager: Total size of serialized results of 3 
tasks (39.6 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)
[Stage 0:==>(1 + 7) / 8][Stage 1:>  (0 + 8) / 
8]---
EOFError  Traceback (most recent call last)
 in ()
  1 import pyspark.sql.functions as F
  2 df = spark.range(1000 * 1000)
> 3 df_pd = df.withColumn("test", F.lit("this is a long string that should 
make the resulting dataframe too large for maxResult which is 1m")).toPandas()

/home/bryan/git/spark/python/pyspark/sql/dataframe.pyc in toPandas(self)
   2111 _check_dataframe_localize_timestamps
   2112 import pyarrow
-> 2113 batches = self._collectAsArrow()
   2114 if len(batches) > 0:
   2115 table = 

[jira] [Assigned] (SPARK-26792) Apply custom log URL to Spark UI

2019-03-04 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26792:
--

Assignee: Jungtaek Lim

> Apply custom log URL to Spark UI
> 
>
> Key: SPARK-26792
> URL: https://issues.apache.org/jira/browse/SPARK-26792
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> SPARK-23155 enables SHS to set up custom log URLs for incompleted / completed 
> apps.
> While getting reviews from SPARK-23155, I've got two comments which applying 
> custom log URLs to UI would help achieving it. Quoting these comments here:
> https://github.com/apache/spark/pull/23260#issuecomment-456827963
> {quote}
> Sorry I haven't had time to look through all the code so this might be a 
> separate jira, but one thing I thought of here is it would be really nice not 
> to have specifically stderr/stdout. users can specify any log4j.properties 
> and some tools like oozie by default end up using hadoop log4j rather then 
> spark log4j, so files aren't necessarily the same. Also users can put in 
> other logs files so it would be nice to have links to those from the UI. It 
> seems simpler if we just had a link to the directory and it read the files 
> within there. Other things in Hadoop do it this way, but I'm not sure if that 
> works well for other resource managers, any thoughts on that? As long as this 
> doesn't prevent the above I can file a separate jira for it.
> {quote}
> https://github.com/apache/spark/pull/23260#issuecomment-456904716
> {quote}
> Hi Tom, +1: singling out stdout and stderr is definitely an annoyance. We
> typically configure Spark jobs to write the GC log and dump heap on OOM
> using ,  and/or we use the rolling file appender to deal with
> large logs during debugging. So linking the YARN container log overview
> page would make much more sense for us. We work it around with a custom
> submit process that logs all important URLs on the submit side log.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26792) Apply custom log URL to Spark UI

2019-03-04 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26792.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23790
[https://github.com/apache/spark/pull/23790]

> Apply custom log URL to Spark UI
> 
>
> Key: SPARK-26792
> URL: https://issues.apache.org/jira/browse/SPARK-26792
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> SPARK-23155 enables SHS to set up custom log URLs for incompleted / completed 
> apps.
> While getting reviews from SPARK-23155, I've got two comments which applying 
> custom log URLs to UI would help achieving it. Quoting these comments here:
> https://github.com/apache/spark/pull/23260#issuecomment-456827963
> {quote}
> Sorry I haven't had time to look through all the code so this might be a 
> separate jira, but one thing I thought of here is it would be really nice not 
> to have specifically stderr/stdout. users can specify any log4j.properties 
> and some tools like oozie by default end up using hadoop log4j rather then 
> spark log4j, so files aren't necessarily the same. Also users can put in 
> other logs files so it would be nice to have links to those from the UI. It 
> seems simpler if we just had a link to the directory and it read the files 
> within there. Other things in Hadoop do it this way, but I'm not sure if that 
> works well for other resource managers, any thoughts on that? As long as this 
> doesn't prevent the above I can file a separate jira for it.
> {quote}
> https://github.com/apache/spark/pull/23260#issuecomment-456904716
> {quote}
> Hi Tom, +1: singling out stdout and stderr is definitely an annoyance. We
> typically configure Spark jobs to write the GC log and dump heap on OOM
> using ,  and/or we use the rolling file appender to deal with
> large logs during debugging. So linking the YARN container log overview
> page would make much more sense for us. We work it around with a custom
> submit process that logs all important URLs on the submit side log.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24135:


Assignee: Apache Spark

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Assignee: Apache Spark
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25681) Delegation Tokens fetched twice upon HadoopFSDelegationTokenProvider creation

2019-03-04 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25681:
--

Assignee: (was: Marcelo Vanzin)

> Delegation Tokens fetched twice upon HadoopFSDelegationTokenProvider creation
> -
>
> Key: SPARK-25681
> URL: https://issues.apache.org/jira/browse/SPARK-25681
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Mesos, YARN
>Affects Versions: 2.5.0
>Reporter: Ilan Filonenko
>Priority: Major
>  Labels: Hadoop, Kerberos
>
> Looking for a refactor to {{HadoopFSDelegationTokenProvider.}} Within the 
> function {{obtainDelegationTokens()}}:
> This code-block:
> {code:java}
> val fetchCreds = fetchDelegationTokens(getTokenRenewer(hadoopConf),...)
> // Get the token renewal interval if it is not set. It will only be 
> called once.
> if (tokenRenewalInterval == null) {
>   tokenRenewalInterval = getTokenRenewalInterval(...)
> }{code}
>  calls {{fetchDelegationTokens()}} twice since the {{tokenRenewalInterval}} 
> will always be null upon creation of the {{TokenManager}} which I think is 
> unnecessary in the case of Kubernetes (as you are creating 2 DTs when only 
> one is needed.) Could this possibly be refactored to only call 
> {{fetchDelegationTokens()}} once upon startup or to have a param to specify 
> {{tokenRenewalInterval}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26995) Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when using snappy

2019-03-04 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26995.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23898
[https://github.com/apache/spark/pull/23898]

> Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when 
> using snappy
> -
>
> Key: SPARK-26995
> URL: https://issues.apache.org/jira/browse/SPARK-26995
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 3.0.0
>
>
> Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when 
> using snappy.  
> The issue can be reproduced for example as follows: 
> `Seq(1,2).toDF("id").write.format("parquet").save("DELETEME1")`  
> The key part of the error stack is as follows `Caused by: 
> java.lang.UnsatisfiedLinkError: 
> /tmp/snappy-1.1.7-2b4872f1-7c41-4b84-bda1-dbcb8dd0ce4c-libsnappyjava.so: 
> Error loading shared library ld-linux-x86-64.so.2: Noded by 
> /tmp/snappy-1.1.7-2b4872f1-7c41-4b84-bda1-dbcb8dd0ce4c-libsnappyjava.so)`  
> The source of the error appears to be due to the fact that libsnappyjava.so 
> needs ld-linux-x86-64.so.2 and looks for it in /lib, while in Alpine Linux 
> 3.9.0 with libc6-compat version 1.1.20-r3 ld-linux-x86-64.so.2 is located in 
> /lib64.
> Note: this issue is not present with Alpine Linux 3.8 and libc6-compat 
> version 1.1.19-r10 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25681) Delegation Tokens fetched twice upon HadoopFSDelegationTokenProvider creation

2019-03-04 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25681:
--

Assignee: Marcelo Vanzin

> Delegation Tokens fetched twice upon HadoopFSDelegationTokenProvider creation
> -
>
> Key: SPARK-25681
> URL: https://issues.apache.org/jira/browse/SPARK-25681
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Mesos, YARN
>Affects Versions: 2.5.0
>Reporter: Ilan Filonenko
>Assignee: Marcelo Vanzin
>Priority: Major
>  Labels: Hadoop, Kerberos
>
> Looking for a refactor to {{HadoopFSDelegationTokenProvider.}} Within the 
> function {{obtainDelegationTokens()}}:
> This code-block:
> {code:java}
> val fetchCreds = fetchDelegationTokens(getTokenRenewer(hadoopConf),...)
> // Get the token renewal interval if it is not set. It will only be 
> called once.
> if (tokenRenewalInterval == null) {
>   tokenRenewalInterval = getTokenRenewalInterval(...)
> }{code}
>  calls {{fetchDelegationTokens()}} twice since the {{tokenRenewalInterval}} 
> will always be null upon creation of the {{TokenManager}} which I think is 
> unnecessary in the case of Kubernetes (as you are creating 2 DTs when only 
> one is needed.) Could this possibly be refactored to only call 
> {{fetchDelegationTokens()}} once upon startup or to have a param to specify 
> {{tokenRenewalInterval}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25750) Integration Testing for Kerberos Support for Spark on Kubernetes

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25750:


Assignee: Apache Spark

> Integration Testing for Kerberos Support for Spark on Kubernetes
> 
>
> Key: SPARK-25750
> URL: https://issues.apache.org/jira/browse/SPARK-25750
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Assignee: Apache Spark
>Priority: Major
>
> Integration testing for Secure HDFS interaction for Spark on Kubernetes. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25750) Integration Testing for Kerberos Support for Spark on Kubernetes

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25750:


Assignee: (was: Apache Spark)

> Integration Testing for Kerberos Support for Spark on Kubernetes
> 
>
> Key: SPARK-25750
> URL: https://issues.apache.org/jira/browse/SPARK-25750
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Priority: Major
>
> Integration testing for Secure HDFS interaction for Spark on Kubernetes. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25750) Integration Testing for Kerberos Support for Spark on Kubernetes

2019-03-04 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25750:
--

Assignee: Marcelo Vanzin

> Integration Testing for Kerberos Support for Spark on Kubernetes
> 
>
> Key: SPARK-25750
> URL: https://issues.apache.org/jira/browse/SPARK-25750
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Assignee: Marcelo Vanzin
>Priority: Major
>
> Integration testing for Secure HDFS interaction for Spark on Kubernetes. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25750) Integration Testing for Kerberos Support for Spark on Kubernetes

2019-03-04 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25750:
--

Assignee: (was: Marcelo Vanzin)

> Integration Testing for Kerberos Support for Spark on Kubernetes
> 
>
> Key: SPARK-25750
> URL: https://issues.apache.org/jira/browse/SPARK-25750
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Priority: Major
>
> Integration testing for Secure HDFS interaction for Spark on Kubernetes. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24135:


Assignee: (was: Apache Spark)

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25681) Delegation Tokens fetched twice upon HadoopFSDelegationTokenProvider creation

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25681:


Assignee: Apache Spark

> Delegation Tokens fetched twice upon HadoopFSDelegationTokenProvider creation
> -
>
> Key: SPARK-25681
> URL: https://issues.apache.org/jira/browse/SPARK-25681
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Mesos, YARN
>Affects Versions: 2.5.0
>Reporter: Ilan Filonenko
>Assignee: Apache Spark
>Priority: Major
>  Labels: Hadoop, Kerberos
>
> Looking for a refactor to {{HadoopFSDelegationTokenProvider.}} Within the 
> function {{obtainDelegationTokens()}}:
> This code-block:
> {code:java}
> val fetchCreds = fetchDelegationTokens(getTokenRenewer(hadoopConf),...)
> // Get the token renewal interval if it is not set. It will only be 
> called once.
> if (tokenRenewalInterval == null) {
>   tokenRenewalInterval = getTokenRenewalInterval(...)
> }{code}
>  calls {{fetchDelegationTokens()}} twice since the {{tokenRenewalInterval}} 
> will always be null upon creation of the {{TokenManager}} which I think is 
> unnecessary in the case of Kubernetes (as you are creating 2 DTs when only 
> one is needed.) Could this possibly be refactored to only call 
> {{fetchDelegationTokens()}} once upon startup or to have a param to specify 
> {{tokenRenewalInterval}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25681) Delegation Tokens fetched twice upon HadoopFSDelegationTokenProvider creation

2019-03-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25681:


Assignee: (was: Apache Spark)

> Delegation Tokens fetched twice upon HadoopFSDelegationTokenProvider creation
> -
>
> Key: SPARK-25681
> URL: https://issues.apache.org/jira/browse/SPARK-25681
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Mesos, YARN
>Affects Versions: 2.5.0
>Reporter: Ilan Filonenko
>Priority: Major
>  Labels: Hadoop, Kerberos
>
> Looking for a refactor to {{HadoopFSDelegationTokenProvider.}} Within the 
> function {{obtainDelegationTokens()}}:
> This code-block:
> {code:java}
> val fetchCreds = fetchDelegationTokens(getTokenRenewer(hadoopConf),...)
> // Get the token renewal interval if it is not set. It will only be 
> called once.
> if (tokenRenewalInterval == null) {
>   tokenRenewalInterval = getTokenRenewalInterval(...)
> }{code}
>  calls {{fetchDelegationTokens()}} twice since the {{tokenRenewalInterval}} 
> will always be null upon creation of the {{TokenManager}} which I think is 
> unnecessary in the case of Kubernetes (as you are creating 2 DTs when only 
> one is needed.) Could this possibly be refactored to only call 
> {{fetchDelegationTokens()}} once upon startup or to have a param to specify 
> {{tokenRenewalInterval}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2019-03-04 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24135:
--

Assignee: Marcelo Vanzin

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Assignee: Marcelo Vanzin
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >