[jira] [Commented] (SPARK-34684) Hadoop config could not be successfully serilized from driver pods to executor pods

2021-03-23 Thread shanyu zhao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307375#comment-17307375
 ] 

shanyu zhao commented on SPARK-34684:
-

[~attilapiros] What about we want to connect to HDFS HA with something like 
hdfs://mycluster/...? We need to deliver hdfs-site.xml to the driver or 
executor pods, right? Also we'd like to control the storage client's (Hadoop 
file system) behavior with the configuration file.

> Hadoop config could not be successfully serilized from driver pods to 
> executor pods
> ---
>
> Key: SPARK-34684
> URL: https://issues.apache.org/jira/browse/SPARK-34684
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.0.2
>Reporter: Yue Peng
>Priority: Major
>
> I have set HADOOP_CONF_DIR correctly. And I have verified that hadoop configs 
> have been stored into a configmap and mounted to driver. However, spark pi 
> example job keeps failing where executor do not know how to talk to hdfs. I 
> highly suspect that there is a bug causing it, as I manually create a 
> configmap storing hadoop configs and mounted it to executor in template file, 
> which could fix the error. 
>  
> Spark submit command:
> /opt/spark-3.0/bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --deploy-mode cluster --master k8s://https://10.***.18.96:6443 
> --num-executors 1 --conf spark.kubernetes.namespace=test --conf 
> spark.kubernetes.container.image= --conf 
> spark.kubernetes.driver.podTemplateFile=/opt/spark-3.0/conf/spark-driver.template
>  --conf 
> spark.kubernetes.executor.podTemplateFile=/opt/spark-3.0/conf/spark-executor.template
>   --conf spark.kubernetes.file.upload.path=/opt/spark-3.0/examples/jars 
> hdfs:///tmp/spark-examples_2.12-3.0.125067.jar 1000
>  
>  
> Error log:
>  
> 21/03/10 06:59:58 INFO TransportClientFactory: Successfully created 
> connection to 
> org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc/100.64.0.191:7078
>  after 608 ms (392 ms spent in bootstraps)
> 21/03/10 06:59:58 INFO SecurityManager: Changing view acls to: root
> 21/03/10 06:59:58 INFO SecurityManager: Changing modify acls to: root
> 21/03/10 06:59:58 INFO SecurityManager: Changing view acls groups to:
> 21/03/10 06:59:58 INFO SecurityManager: Changing modify acls groups to:
> 21/03/10 06:59:58 INFO SecurityManager: SecurityManager: authentication 
> enabled; ui acls disabled; users with view permissions: Set(root); groups 
> with view permissions: Set(); users with modify permissions: Set(root); 
> groups with modify permissions: Set()
> 21/03/10 06:59:59 INFO TransportClientFactory: Successfully created 
> connection to 
> org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc/100.64.0.191:7078
>  after 130 ms (104 ms spent in bootstraps)
> 21/03/10 06:59:59 INFO DiskBlockManager: Created local directory at 
> /var/data/spark-0f541e3d-994f-4c7a-843f-f7dac57dfc13/blockmgr-981cfb62-5b27-4d1a-8fbd-eddb466faf1d
> 21/03/10 06:59:59 INFO MemoryStore: MemoryStore started with capacity 2047.2 
> MiB
> 21/03/10 06:59:59 INFO CoarseGrainedExecutorBackend: Connecting to driver: 
> spark://coarsegrainedschedu...@org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc:7078
> 21/03/10 06:59:59 INFO ResourceUtils: 
> ==
> 21/03/10 06:59:59 INFO ResourceUtils: Resources for spark.executor:
> 21/03/10 06:59:59 INFO ResourceUtils: 
> ==
> 21/03/10 06:59:59 INFO CoarseGrainedExecutorBackend: Successfully registered 
> with driver
> 21/03/10 06:59:59 INFO Executor: Starting executor ID 1 on host 100.64.0.192
> 21/03/10 07:00:00 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37956.
> 21/03/10 07:00:00 INFO NettyBlockTransferService: Server created on 
> 100.64.0.192:37956
> 21/03/10 07:00:00 INFO BlockManager: Using 
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
> policy
> 21/03/10 07:00:00 INFO BlockManagerMaster: Registering BlockManager 
> BlockManagerId(1, 100.64.0.192, 37956, None)
> 21/03/10 07:00:00 INFO BlockManagerMaster: Registered BlockManager 
> BlockManagerId(1, 100.64.0.192, 37956, None)
> 21/03/10 07:00:00 INFO BlockManager: Initialized BlockManager: 
> BlockManagerId(1, 100.64.0.192, 37956, None)
> 21/03/10 07:00:01 INFO CoarseGrainedExecutorBackend: Got assigned task 0
> 21/03/10 07:00:01 INFO CoarseGrainedExecutorBackend: Got assigned task 1
> 21/03/10 07:00:01 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
> 21/03/10 07:00:01 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 21/03/10 07:00:01 INFO Executor: Fetching 
> 

[jira] [Commented] (SPARK-30536) Sort-merge join operator spilling performance improvements

2020-03-08 Thread shanyu zhao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054555#comment-17054555
 ] 

shanyu zhao commented on SPARK-30536:
-

Uploaded two slides to explain the optimization idea of this PR.

> Sort-merge join operator spilling performance improvements
> --
>
> Key: SPARK-30536
> URL: https://issues.apache.org/jira/browse/SPARK-30536
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sinisa Knezevic
>Priority: Major
> Attachments: spark-30536-explained.pdf
>
>
> Testing with TPC-DS 100 TB benchmark data set showed that some of SQLs 
> (example query 14) are not able to run even with extremely large Spark 
> executor memory.Spark spilling feature has to be enabled, in order to be able 
> to process these SQLs. Processing of SQLs becomes extremely slow when 
> spilling is enabled.The Spark spilling feature is enabled via two parameters: 
> “spark.sql.sortMergeJoinExec.buffer.in.memory.threshold” and 
> “spark.sql.sortMergeJoinExec.buffer.spill.threshold”
> “spark.sql.sortMergeJoinExec.buffer.in.memory.threshold” – when this 
> threshold is reached, the data will be moved from 
> ExternalAppendOnlyUnsafeRowArrey object into UnsafeExternalSorter object.
> “spark.sql.sortMergeJoinExec.buffer.spill.threshold” – when this threshold is 
> reached, the data will be spilled from UnsafeExternalSorter object onto the 
> disk.
>  
> During execution of sort-merge join (Left Semi Join ) for each left join row 
> “right matches” are found and stored into ExternalAppendOnlyUnsafeRowArrey 
> object.In the case of Query 14 there are millions of rows of “right matches”. 
> To run this query spilling is enabled and data is moved from 
> ExternalAppendOnlyUnsafeRowArrey into UnsafeExternalSorter and then spilled 
> onto the disk.When million rows are processed on left side of the join, the 
> iterator on top of spilled “right matches” rows is created each time. This 
> means that millions of time iterator on top of right matches (that are 
> spilled on the disk) is created.The current Spark implementation creates 
> iterator on top of spilled rows and producing I/0 which results into millions 
> of I/0 when million rows are processed.
>  
> To avoid the performance bottleneck this JIRA introducing following solution:
> 1. Implement lazy initialization of UnsafeSorterSpillReader - iterator on top 
> of spilled rows:
>     … During SortMergeJoin (Left Semi Join) execution, the iterator on the 
> spill data is created but no iteration over the data is done.
>    ... Having lazy initialization of UnsafeSorterSpillReader will enable 
> efficient processing of SortMergeJoin even if data is spilled onto disk. 
> Unnecessary I/O will be avoided.
> 2. Decrease initial memory read buffer size in UnsafeSorterSpillReader from 
> 1MB to 1KB:
>     … UnsafeSorterSpillReader constructor takes lot of time due to size of 
> default 1MB memory read buffer.
>     … The code already has logic to increase the memory read buffer if it 
> cannot fit the data, so decreasing the size to 1K is safe and has positive 
> performance impact.
> 3. Improve memory utilization when spilling is enabled in 
> ExternalAppendOnlyUnsafeRowArrey:
>     … In the current implementation, when spilling is enabled, 
> UnsafeExternalSorter object is created and then data moved from 
> ExternalAppendOnlyUnsafeRowArrey object into UnsafeExternalSorter and then 
> ExternalAppendOnlyUnsafeRowArrey object is emptied. Just before 
> ExternalAppendOnlyUnsafeRowArrey object is emptied there are both objects in 
> the memory with the same data. That require double memory and there is 
> duplication of data. This can be avoided.
>     … In the proposed solution, when 
> spark.sql.sortMergeJoinExec.buffer.in.memory.threshold is reached  adding new 
> rows into ExternalAppendOnlyUnsafeRowArray object stops. UnsafeExternalSorter 
> object is created and new rows are added into this object. 
> ExternalAppendOnlyUnsafeRowArray object retains all rows already added into 
> this object. This approach will enable better memory utilization and avoid 
> unnecessary movement of data from one object into another.
>  
> The test of this solution with query 14 and enabled spilling on the disk, 
> showed 500X performance improvements and it didn’t degrade performance of the 
> other SQLs from TPC-DS benchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30536) Sort-merge join operator spilling performance improvements

2020-03-08 Thread shanyu zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shanyu zhao updated SPARK-30536:

Attachment: spark-30536-explained.pdf

> Sort-merge join operator spilling performance improvements
> --
>
> Key: SPARK-30536
> URL: https://issues.apache.org/jira/browse/SPARK-30536
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sinisa Knezevic
>Priority: Major
> Attachments: spark-30536-explained.pdf
>
>
> Testing with TPC-DS 100 TB benchmark data set showed that some of SQLs 
> (example query 14) are not able to run even with extremely large Spark 
> executor memory.Spark spilling feature has to be enabled, in order to be able 
> to process these SQLs. Processing of SQLs becomes extremely slow when 
> spilling is enabled.The Spark spilling feature is enabled via two parameters: 
> “spark.sql.sortMergeJoinExec.buffer.in.memory.threshold” and 
> “spark.sql.sortMergeJoinExec.buffer.spill.threshold”
> “spark.sql.sortMergeJoinExec.buffer.in.memory.threshold” – when this 
> threshold is reached, the data will be moved from 
> ExternalAppendOnlyUnsafeRowArrey object into UnsafeExternalSorter object.
> “spark.sql.sortMergeJoinExec.buffer.spill.threshold” – when this threshold is 
> reached, the data will be spilled from UnsafeExternalSorter object onto the 
> disk.
>  
> During execution of sort-merge join (Left Semi Join ) for each left join row 
> “right matches” are found and stored into ExternalAppendOnlyUnsafeRowArrey 
> object.In the case of Query 14 there are millions of rows of “right matches”. 
> To run this query spilling is enabled and data is moved from 
> ExternalAppendOnlyUnsafeRowArrey into UnsafeExternalSorter and then spilled 
> onto the disk.When million rows are processed on left side of the join, the 
> iterator on top of spilled “right matches” rows is created each time. This 
> means that millions of time iterator on top of right matches (that are 
> spilled on the disk) is created.The current Spark implementation creates 
> iterator on top of spilled rows and producing I/0 which results into millions 
> of I/0 when million rows are processed.
>  
> To avoid the performance bottleneck this JIRA introducing following solution:
> 1. Implement lazy initialization of UnsafeSorterSpillReader - iterator on top 
> of spilled rows:
>     … During SortMergeJoin (Left Semi Join) execution, the iterator on the 
> spill data is created but no iteration over the data is done.
>    ... Having lazy initialization of UnsafeSorterSpillReader will enable 
> efficient processing of SortMergeJoin even if data is spilled onto disk. 
> Unnecessary I/O will be avoided.
> 2. Decrease initial memory read buffer size in UnsafeSorterSpillReader from 
> 1MB to 1KB:
>     … UnsafeSorterSpillReader constructor takes lot of time due to size of 
> default 1MB memory read buffer.
>     … The code already has logic to increase the memory read buffer if it 
> cannot fit the data, so decreasing the size to 1K is safe and has positive 
> performance impact.
> 3. Improve memory utilization when spilling is enabled in 
> ExternalAppendOnlyUnsafeRowArrey:
>     … In the current implementation, when spilling is enabled, 
> UnsafeExternalSorter object is created and then data moved from 
> ExternalAppendOnlyUnsafeRowArrey object into UnsafeExternalSorter and then 
> ExternalAppendOnlyUnsafeRowArrey object is emptied. Just before 
> ExternalAppendOnlyUnsafeRowArrey object is emptied there are both objects in 
> the memory with the same data. That require double memory and there is 
> duplication of data. This can be avoided.
>     … In the proposed solution, when 
> spark.sql.sortMergeJoinExec.buffer.in.memory.threshold is reached  adding new 
> rows into ExternalAppendOnlyUnsafeRowArray object stops. UnsafeExternalSorter 
> object is created and new rows are added into this object. 
> ExternalAppendOnlyUnsafeRowArray object retains all rows already added into 
> this object. This approach will enable better memory utilization and avoid 
> unnecessary movement of data from one object into another.
>  
> The test of this solution with query 14 and enabled spilling on the disk, 
> showed 500X performance improvements and it didn’t degrade performance of the 
> other SQLs from TPC-DS benchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31029) Occasional class not found error in user's Future code using global ExecutionContext

2020-03-04 Thread shanyu zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shanyu zhao updated SPARK-31029:

Description: 
*Problem:*
When running tpc-ds test (https://github.com/databricks/spark-sql-perf), 
occasionally we see error related to class not found:

2020-02-04 20:00:26,673 ERROR yarn.ApplicationMaster: User class threw 
exception: scala.ScalaReflectionException: class 
com.databricks.spark.sql.perf.ExperimentRun in JavaMirror with 
sun.misc.Launcher$AppClassLoader@28ba21f3 of type class 
sun.misc.Launcher$AppClassLoader with classpath [...] 
and parent being sun.misc.Launcher$ExtClassLoader@3ff5d147 of type class 
sun.misc.Launcher$ExtClassLoader with classpath [...] 
and parent being primordial classloader with boot classpath [...] not found.

*Root cause:*
Spark driver starts ApplicationMaster in the main thread, which starts a user 
thread and set MutableURLClassLoader to that thread's ContextClassLoader.
userClassThread = startUserApplication()

The main thread then setup YarnSchedulerBackend RPC endpoints, which handles 
these calls using scala Future with the default global ExecutionContext:
- doRequestTotalExecutors
- doKillExecutors

If main thread starts a future to handle doKillExecutors() before user thread 
does then the default thread pool thread's ContextClassLoader would be the 
default (AppClassLoader). 
If user thread starts a future first then the thread pool thread will have 
MutableURLClassLoader.

So if user's code uses a future which references a user provided class (only 
MutableURLClassLoader can load), and before the future if there are executor 
lost, you will see errors related to class not found.

*Proposed Solution:*
We can potentially solve this problem in one of two ways:
1) Set the same class loader (userClassLoader) to both the main thread and user 
thread in ApplicationMaster.scala

2) Do not use "ExecutionContext.Implicits.global" in YarnSchedulerBackend

  was:
*Problem:*
When running tpc-ds test (https://github.com/databricks/spark-sql-perf), 
occasionally we see error related to class not found:

2020-02-04 20:00:26,673 ERROR yarn.ApplicationMaster: User class threw 
exception: scala.ScalaReflectionException: class 
com.databricks.spark.sql.perf.ExperimentRun in JavaMirror with 
sun.misc.Launcher$AppClassLoader@28ba21f3 of type class 
sun.misc.Launcher$AppClassLoader with classpath [...] 
and parent being sun.misc.Launcher$ExtClassLoader@3ff5d147 of type class 
sun.misc.Launcher$ExtClassLoader with classpath [...] 
and parent being primordial classloader with boot classpath [...] not found.

*Root cause:*
Spark driver starts ApplicationMaster in the main thread, which starts a user 
thread and set MutableURLClassLoader to that thread's ContextClassLoader.
userClassThread = startUserApplication()

The main thread then setup YarnSchedulerBackend RPC endpoints, which handles 
these calls using scala Future with the default global ExecutionContext:
- doRequestTotalExecutors
- doKillExecutors

If main thread starts a future to handle doKillExecutors() before user thread 
does then the default thread pool thread's ContextClassLoader would be the 
default (AppClassLoader). 
If user thread starts a future first then the thread pool thread will have 
MutableURLClassLoader.

So if user's code uses a future which references a user provided class (only 
MutableURLClassLoader can load), and before the future if there are executor 
lost, you will see errors related to class not found.

*Proposed Solution:*
Set the same class loader (userClassLoader) to both the main thread and user 
thread in ApplicationMaster.scala


> Occasional class not found error in user's Future code using global 
> ExecutionContext
> 
>
> Key: SPARK-31029
> URL: https://issues.apache.org/jira/browse/SPARK-31029
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.5
>Reporter: shanyu zhao
>Priority: Major
>
> *Problem:*
> When running tpc-ds test (https://github.com/databricks/spark-sql-perf), 
> occasionally we see error related to class not found:
> 2020-02-04 20:00:26,673 ERROR yarn.ApplicationMaster: User class threw 
> exception: scala.ScalaReflectionException: class 
> com.databricks.spark.sql.perf.ExperimentRun in JavaMirror with 
> sun.misc.Launcher$AppClassLoader@28ba21f3 of type class 
> sun.misc.Launcher$AppClassLoader with classpath [...] 
> and parent being sun.misc.Launcher$ExtClassLoader@3ff5d147 of type class 
> sun.misc.Launcher$ExtClassLoader with classpath [...] 
> and parent being primordial classloader with boot classpath [...] not found.
> *Root cause:*
> Spark driver starts ApplicationMaster in the main thread, which starts a user 
> thread and set 

[jira] [Created] (SPARK-31029) Occasional class not found error in user's Future code using global ExecutionContext

2020-03-03 Thread shanyu zhao (Jira)
shanyu zhao created SPARK-31029:
---

 Summary: Occasional class not found error in user's Future code 
using global ExecutionContext
 Key: SPARK-31029
 URL: https://issues.apache.org/jira/browse/SPARK-31029
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.4.5
Reporter: shanyu zhao


*Problem:*
When running tpc-ds test (https://github.com/databricks/spark-sql-perf), 
occasionally we see error related to class not found:

2020-02-04 20:00:26,673 ERROR yarn.ApplicationMaster: User class threw 
exception: scala.ScalaReflectionException: class 
com.databricks.spark.sql.perf.ExperimentRun in JavaMirror with 
sun.misc.Launcher$AppClassLoader@28ba21f3 of type class 
sun.misc.Launcher$AppClassLoader with classpath [...] 
and parent being sun.misc.Launcher$ExtClassLoader@3ff5d147 of type class 
sun.misc.Launcher$ExtClassLoader with classpath [...] 
and parent being primordial classloader with boot classpath [...] not found.

*Root cause:*
Spark driver starts ApplicationMaster in the main thread, which starts a user 
thread and set MutableURLClassLoader to that thread's ContextClassLoader.
userClassThread = startUserApplication()

The main thread then setup YarnSchedulerBackend RPC endpoints, which handles 
these calls using scala Future with the default global ExecutionContext:
- doRequestTotalExecutors
- doKillExecutors

If main thread starts a future to handle doKillExecutors() before user thread 
does then the default thread pool thread's ContextClassLoader would be the 
default (AppClassLoader). 
If user thread starts a future first then the thread pool thread will have 
MutableURLClassLoader.

So if user's code uses a future which references a user provided class (only 
MutableURLClassLoader can load), and before the future if there are executor 
lost, you will see errors related to class not found.

*Proposed Solution:*
Set the same class loader (userClassLoader) to both the main thread and user 
thread in ApplicationMaster.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31028) Add "-XX:ActiveProcessorCount" to Spark driver and executor in Yarn mode

2020-03-03 Thread shanyu zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shanyu zhao updated SPARK-31028:

Description: 
When starting Spark driver and executors on Yarn cluster, the JVM process can 
discover all CPU cores on the system and set thread-pool or GC threads based on 
that value. We should limit what the JVM sees for the number of cores set by 
the user (spark.driver.cores or spark.executor.cores) by 
"-XX:ActiveProcessorCount", which was introduced in Java 8u191.

Especially in running Spark on Yarn inside Kubernetes container, the number of 
CPU cores discovered sometimes is 1, which means it always use 1 thread in the 
default thread pool, or GC threads.

  was:When starting Spark driver and executors on Yarn cluster, the JVM process 
can discover all CPU cores on the system and set thread-pool or GC threads 
based on that value. We should limit what the JVM sees for the number of cores 
set by the user (spark.driver.cores or spark.executor.cores) by 
"-XX:ActiveProcessorCount", which was introduced in Java 8u191.


> Add "-XX:ActiveProcessorCount" to Spark driver and executor in Yarn mode
> 
>
> Key: SPARK-31028
> URL: https://issues.apache.org/jira/browse/SPARK-31028
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.5
>Reporter: shanyu zhao
>Priority: Major
>
> When starting Spark driver and executors on Yarn cluster, the JVM process can 
> discover all CPU cores on the system and set thread-pool or GC threads based 
> on that value. We should limit what the JVM sees for the number of cores set 
> by the user (spark.driver.cores or spark.executor.cores) by 
> "-XX:ActiveProcessorCount", which was introduced in Java 8u191.
> Especially in running Spark on Yarn inside Kubernetes container, the number 
> of CPU cores discovered sometimes is 1, which means it always use 1 thread in 
> the default thread pool, or GC threads.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31028) Add "-XX:ActiveProcessorCount" to Spark driver and executor in Yarn mode

2020-03-03 Thread shanyu zhao (Jira)
shanyu zhao created SPARK-31028:
---

 Summary: Add "-XX:ActiveProcessorCount" to Spark driver and 
executor in Yarn mode
 Key: SPARK-31028
 URL: https://issues.apache.org/jira/browse/SPARK-31028
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.4.5
Reporter: shanyu zhao


When starting Spark driver and executors on Yarn cluster, the JVM process can 
discover all CPU cores on the system and set thread-pool or GC threads based on 
that value. We should limit what the JVM sees for the number of cores set by 
the user (spark.driver.cores or spark.executor.cores) by 
"-XX:ActiveProcessorCount", which was introduced in Java 8u191.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30845) spark-submit pyspark app on yarn uploads local pyspark archives

2020-02-16 Thread shanyu zhao (Jira)
shanyu zhao created SPARK-30845:
---

 Summary: spark-submit pyspark app on yarn uploads local pyspark 
archives
 Key: SPARK-30845
 URL: https://issues.apache.org/jira/browse/SPARK-30845
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.4.5, 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4.0
Reporter: shanyu zhao


Use spark-submit to submit a pyspark app on Yarn, and set this in spark-env.sh:

{code:bash}
export 
PYSPARK_ARCHIVES_PATH=local:/opt/spark/python/lib/pyspark.zip,local:/opt/spark/python/lib/py4j-0.10.7-src.zip
{code}

You can see that these local archives are still uploaded to Yarn distributed 
cache.

yarn.Client: Uploading resource file:/opt/spark/python/lib/pyspark.zip -> 
hdfs://myhdfs/user/test1/.sparkStaging/application_1581024490249_0001/pyspark.zip




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2020-02-04 Thread shanyu zhao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030114#comment-17030114
 ] 

shanyu zhao commented on SPARK-30602:
-

Thanks for the effort Min! Riffle seems to only do map side worker level merge 
and didn't do push based shuffle. And it seems simpler to implement. I wonder 
what benefit from the "push based shuffle" brings on top of the Riffle's merge 
approach in terms of perf and scalability. 

I can imagine "push based shuffle" are more "responsive" by streamlining 
mappers and reducers, could this be a separate effort then?

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29003) Spark history server startup hang due to deadlock

2019-09-06 Thread shanyu zhao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924567#comment-16924567
 ] 

shanyu zhao commented on SPARK-29003:
-

Please see the full jstack attached.

> Spark history server startup hang due to deadlock
> -
>
> Key: SPARK-29003
> URL: https://issues.apache.org/jira/browse/SPARK-29003
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: shanyu zhao
>Priority: Major
> Attachments: sparkhistory-jstack.log
>
>
> Occasionally when starting Spark History Server, the service process will 
> hang before binding to the port so Spark History Server is not usable. One 
> has to kill the process and start again. You can write a simple bash program 
> to stop and start Spark History Server and you can reproduce this problem 
> approximately 10% of time.
> The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock. 
> This is what I collected with jstack:
> {code:java}
> "log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x7fca90028800 
> nid=0x6e8 in Object.wait() [0x7fcaa9471000]
> java.lang.Thread.State: RUNNABLE 
> at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
> ... 
> at java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked 
> <0xaaac1d40> (a java.lang.Runtime) 
> ... 
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698)
> "main" #1 prio=5 os_prio=0 tid=0x7fcad8016800 nid=0x6d8 waiting for 
> monitor entry [0x7fcae146c000]
>     java.lang.Thread.State: BLOCKED (on object monitor) 
> at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock 
> <0xaaac1d40> (a java.lang.Runtime) 
> ... 
> at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
> at java.io.File.toPath(File.java:2234) - locked <0x8699bb68> (a 
> java.io.File) 
> ... 
> at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code}
> Basically "main" thread and "log-replay-executor-0" thread simultaneously 
> calling java.nio,file.FileSystems.getDefault() and deadlocked. 
> This is similar to the reported JDK bug:
> [https://bugs.openjdk.java.net/browse/JDK-8037567]
> The problem is that during Spark History Server startup, there are two things 
> happening simultaneously that call into 
> java.nio.file.FileSystems.getDefault():
> 1) start jetty server
>  2) start ApplicationHistoryProvider (which reads files from HDFS)
> We should do this two things sequentially instead of in parallel.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29003) Spark history server startup hang due to deadlock

2019-09-06 Thread shanyu zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shanyu zhao updated SPARK-29003:

Attachment: sparkhistory-jstack.log

> Spark history server startup hang due to deadlock
> -
>
> Key: SPARK-29003
> URL: https://issues.apache.org/jira/browse/SPARK-29003
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: shanyu zhao
>Priority: Major
> Attachments: sparkhistory-jstack.log
>
>
> Occasionally when starting Spark History Server, the service process will 
> hang before binding to the port so Spark History Server is not usable. One 
> has to kill the process and start again. You can write a simple bash program 
> to stop and start Spark History Server and you can reproduce this problem 
> approximately 10% of time.
> The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock. 
> This is what I collected with jstack:
> {code:java}
> "log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x7fca90028800 
> nid=0x6e8 in Object.wait() [0x7fcaa9471000]
> java.lang.Thread.State: RUNNABLE 
> at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
> ... 
> at java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked 
> <0xaaac1d40> (a java.lang.Runtime) 
> ... 
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698)
> "main" #1 prio=5 os_prio=0 tid=0x7fcad8016800 nid=0x6d8 waiting for 
> monitor entry [0x7fcae146c000]
>     java.lang.Thread.State: BLOCKED (on object monitor) 
> at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock 
> <0xaaac1d40> (a java.lang.Runtime) 
> ... 
> at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
> at java.io.File.toPath(File.java:2234) - locked <0x8699bb68> (a 
> java.io.File) 
> ... 
> at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code}
> Basically "main" thread and "log-replay-executor-0" thread simultaneously 
> calling java.nio,file.FileSystems.getDefault() and deadlocked. 
> This is similar to the reported JDK bug:
> [https://bugs.openjdk.java.net/browse/JDK-8037567]
> The problem is that during Spark History Server startup, there are two things 
> happening simultaneously that call into 
> java.nio.file.FileSystems.getDefault():
> 1) start jetty server
>  2) start ApplicationHistoryProvider (which reads files from HDFS)
> We should do this two things sequentially instead of in parallel.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29003) Spark history server startup hang due to deadlock

2019-09-05 Thread shanyu zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shanyu zhao updated SPARK-29003:

Description: 
Occasionally when starting Spark History Server, the service process will hang 
before binding to the port so Spark History Server is not usable. One has to 
kill the process and start again. You can write a simple bash program to stop 
and start Spark History Server and you can reproduce this problem approximately 
10% of time.

The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock. 
This is what I collected with jstack:
{code:java}
"log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x7fca90028800 
nid=0x6e8 in Object.wait() [0x7fcaa9471000]
java.lang.Thread.State: RUNNABLE 
at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
... 
at java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked 
<0xaaac1d40> (a java.lang.Runtime) 
... 
at 
org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698)

"main" #1 prio=5 os_prio=0 tid=0x7fcad8016800 nid=0x6d8 waiting for monitor 
entry [0x7fcae146c000]
    java.lang.Thread.State: BLOCKED (on object monitor) 
at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock 
<0xaaac1d40> (a java.lang.Runtime) 
... 
at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
at java.io.File.toPath(File.java:2234) - locked <0x8699bb68> (a 
java.io.File) 
... 
at 
org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code}
Basically "main" thread and "log-replay-executor-0" thread simultaneously 
calling java.nio,file.FileSystems.getDefault() and deadlocked. 

This is similar to the reported JDK bug:

[https://bugs.openjdk.java.net/browse/JDK-8037567]

The problem is that during Spark History Server startup, there are two things 
happening simultaneously that call into java.nio.file.FileSystems.getDefault():

1) start jetty server
 2) start ApplicationHistoryProvider (which reads files from HDFS)

We should do this two things sequentially instead of in parallel.

 

  was:
Occasionally when starting Spark History Server, the service process will hang 
before binding to the port so Spark History Server is not usable. One has to 
kill the process and start again. You can write a simple bash program to stop 
and start Spark History Server and you can reproduce this problem approximately 
10% of time.

The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock. 
This is what I collected with jstack:
{code:java}
"log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x7fca90028800 
nid=0x6e8 in Object.wait() [0x7fcaa9471000]"log-replay-executor-0" #17 
daemon prio=5 os_prio=0 tid=0x7fca90028800 nid=0x6e8 in Object.wait() 
[0x7fcaa9471000]   java.lang.Thread.State: RUNNABLE at 
java.nio.file.FileSystems.getDefault(FileSystems.java:176) ... at 
java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked <0xaaac1d40> 
(a java.lang.Runtime) ... at 
org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698)
"main" #1 prio=5 os_prio=0 tid=0x7fcad8016800 nid=0x6d8 waiting for monitor 
entry [0x7fcae146c000]   java.lang.Thread.State: BLOCKED (on object 
monitor) at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock 
<0xaaac1d40> (a java.lang.Runtime) ... at 
java.nio.file.FileSystems.getDefault(FileSystems.java:176) at 
java.io.File.toPath(File.java:2234) - locked <0x8699bb68> (a 
java.io.File) ...    at 
org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code}
Basically "main" thread and "log-replay-executor-0" thread simultaneously 
calling java.nio,file.FileSystems.getDefault() and deadlocked. 

This is similar to the reported JDK bug:

[https://bugs.openjdk.java.net/browse/JDK-8037567]

The problem is that during Spark History Server startup, there are two things 
happening simultaneously that call into java.nio.file.FileSystems.getDefault():

1) start jetty server
2) start ApplicationHistoryProvider (which reads files from HDFS)

We should do this two things sequentially instead of in parallel.

 


> Spark history server startup hang due to deadlock
> -
>
> Key: SPARK-29003
> URL: https://issues.apache.org/jira/browse/SPARK-29003
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: shanyu zhao
>Priority: Major
>
> Occasionally when starting Spark History Server, the service process will 
> hang before binding to the port so Spark History Server is not usable. One 
> has to kill the process and start again. You can write a simple bash program 
> to stop and 

[jira] [Created] (SPARK-29003) Spark history server startup hang due to deadlock

2019-09-05 Thread shanyu zhao (Jira)
shanyu zhao created SPARK-29003:
---

 Summary: Spark history server startup hang due to deadlock
 Key: SPARK-29003
 URL: https://issues.apache.org/jira/browse/SPARK-29003
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: shanyu zhao


Occasionally when starting Spark History Server, the service process will hang 
before binding to the port so Spark History Server is not usable. One has to 
kill the process and start again. You can write a simple bash program to stop 
and start Spark History Server and you can reproduce this problem approximately 
10% of time.

The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock. 
This is what I collected with jstack:
{code:java}
"log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x7fca90028800 
nid=0x6e8 in Object.wait() [0x7fcaa9471000]"log-replay-executor-0" #17 
daemon prio=5 os_prio=0 tid=0x7fca90028800 nid=0x6e8 in Object.wait() 
[0x7fcaa9471000]   java.lang.Thread.State: RUNNABLE at 
java.nio.file.FileSystems.getDefault(FileSystems.java:176) ... at 
java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked <0xaaac1d40> 
(a java.lang.Runtime) ... at 
org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698)
"main" #1 prio=5 os_prio=0 tid=0x7fcad8016800 nid=0x6d8 waiting for monitor 
entry [0x7fcae146c000]   java.lang.Thread.State: BLOCKED (on object 
monitor) at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock 
<0xaaac1d40> (a java.lang.Runtime) ... at 
java.nio.file.FileSystems.getDefault(FileSystems.java:176) at 
java.io.File.toPath(File.java:2234) - locked <0x8699bb68> (a 
java.io.File) ...    at 
org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code}
Basically "main" thread and "log-replay-executor-0" thread simultaneously 
calling java.nio,file.FileSystems.getDefault() and deadlocked. 

This is similar to the reported JDK bug:

[https://bugs.openjdk.java.net/browse/JDK-8037567]

The problem is that during Spark History Server startup, there are two things 
happening simultaneously that call into java.nio.file.FileSystems.getDefault():

1) start jetty server
2) start ApplicationHistoryProvider (which reads files from HDFS)

We should do this two things sequentially instead of in parallel.

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors

2019-05-03 Thread shanyu zhao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shanyu zhao updated SPARK-12312:

Affects Version/s: 2.4.2

> JDBC connection to Kerberos secured databases fails on remote executors
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Priority: Minor
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2019-04-22 Thread shanyu zhao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823458#comment-16823458
 ] 

shanyu zhao commented on SPARK-18673:
-

Ping. What is the verdict here for users want to use Spark 2.4 and Hadoop 3.1?

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26011) pyspark app with "spark.jars.packages" config does not work

2018-11-11 Thread shanyu zhao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shanyu zhao updated SPARK-26011:

Description: 
Command "pyspark --packages" works as expected, but if submitting a livy 
pyspark job with "spark.jars.packages" config, the downloaded packages are not 
added to python's sys.path therefore the package is not available to use.

For example, this command works:

pyspark --packages Azure:mmlspark:0.14

However, using Jupyter notebook with sparkmagic kernel to open a pyspark 
session failed:

%%configure -f \{"conf": {spark.jars.packages": "Azure:mmlspark:0.14"}}
 import mmlspark

The root cause is that SparkSubmit determines pyspark app by the suffix of 
primary resource but Livy uses "spark-internal" as the primary resource when 
calling spark-submit, therefore args.isPython is set to false in 
SparkSubmit.scala.

  was:
Command "pyspark --packages" works as expected, but if submitting a livy 
pyspark job with "spark.jars.packages" config, the downloaded packages are not 
added to python's sys.path therefore the package is not available to use.

For example, this command works:

pyspark --packages Azure:mmlspark:0.14

However, using Jupyter notebook with sparkmagic kernel to open a pyspark 
session failed:

%%configure -f \{"conf": {spark.jars.packages": "Azure:mmlspark:0.14"}}
import mmlspark

The root cause is that SparkSubmit determines pyspark app by the suffix of 
primary resource but Livy uses "spark-internal" as the primary resource when 
calling spark-submit, therefore args.isPython is fails in SparkSubmit.scala.


> pyspark app with "spark.jars.packages" config does not work
> ---
>
> Key: SPARK-26011
> URL: https://issues.apache.org/jira/browse/SPARK-26011
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shanyu zhao
>Priority: Major
>
> Command "pyspark --packages" works as expected, but if submitting a livy 
> pyspark job with "spark.jars.packages" config, the downloaded packages are 
> not added to python's sys.path therefore the package is not available to use.
> For example, this command works:
> pyspark --packages Azure:mmlspark:0.14
> However, using Jupyter notebook with sparkmagic kernel to open a pyspark 
> session failed:
> %%configure -f \{"conf": {spark.jars.packages": "Azure:mmlspark:0.14"}}
>  import mmlspark
> The root cause is that SparkSubmit determines pyspark app by the suffix of 
> primary resource but Livy uses "spark-internal" as the primary resource when 
> calling spark-submit, therefore args.isPython is set to false in 
> SparkSubmit.scala.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26011) pyspark app with "spark.jars.packages" config does not work

2018-11-11 Thread shanyu zhao (JIRA)
shanyu zhao created SPARK-26011:
---

 Summary: pyspark app with "spark.jars.packages" config does not 
work
 Key: SPARK-26011
 URL: https://issues.apache.org/jira/browse/SPARK-26011
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.4.0, 2.3.2
Reporter: shanyu zhao


Command "pyspark --packages" works as expected, but if submitting a livy 
pyspark job with "spark.jars.packages" config, the downloaded packages are not 
added to python's sys.path therefore the package is not available to use.

For example, this command works:

pyspark --packages Azure:mmlspark:0.14

However, using Jupyter notebook with sparkmagic kernel to open a pyspark 
session failed:

%%configure -f \{"conf": {spark.jars.packages": "Azure:mmlspark:0.14"}}
import mmlspark

The root cause is that SparkSubmit determines pyspark app by the suffix of 
primary resource but Livy uses "spark-internal" as the primary resource when 
calling spark-submit, therefore args.isPython is fails in SparkSubmit.scala.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided

2018-11-09 Thread shanyu zhao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682089#comment-16682089
 ] 

shanyu zhao commented on SPARK-25999:
-

patch attached. Basically it creates an optional project that brings all 
dependencies to R/rjarsdep/target folder, and copy the missing jars to 
assembly/target folder before building R.

> make-distribution.sh failure with --r and -Phadoop-provided
> ---
>
> Key: SPARK-25999
> URL: https://issues.apache.org/jira/browse/SPARK-25999
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shanyu zhao
>Priority: Major
> Attachments: SPARK-25999.patch
>
>
> It is not possible to build a distribution that doesn't contain hadoop 
> dependencies but include SparkR. This is because R/check_cran.sh builds R 
> document which depends on hadoop dependencies in 
> assembly/target/scala-xxx/jars folder.
> To reproduce:
> MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive 
> -Psparkr -Phadoop-provided"
> ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS
>  
> Error:
> * creating vignettes ... ERROR
> ...
> Error: A JNI error has occurred, please check your installation and try again
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided

2018-11-09 Thread shanyu zhao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shanyu zhao updated SPARK-25999:

Attachment: SPARK-25999.patch

> make-distribution.sh failure with --r and -Phadoop-provided
> ---
>
> Key: SPARK-25999
> URL: https://issues.apache.org/jira/browse/SPARK-25999
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shanyu zhao
>Priority: Major
> Attachments: SPARK-25999.patch
>
>
> It is not possible to build a distribution that doesn't contain hadoop 
> dependencies but include SparkR. This is because R/check_cran.sh builds R 
> document which depends on hadoop dependencies in 
> assembly/target/scala-xxx/jars folder.
> To reproduce:
> MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive 
> -Psparkr -Phadoop-provided"
> ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS
>  
> Error:
> * creating vignettes ... ERROR
> ...
> Error: A JNI error has occurred, please check your installation and try again
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided

2018-11-09 Thread shanyu zhao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shanyu zhao updated SPARK-25999:

Summary: make-distribution.sh failure with --r and -Phadoop-provided  (was: 
Spark make-distribution failure with --r and -Phadoop-provided)

> make-distribution.sh failure with --r and -Phadoop-provided
> ---
>
> Key: SPARK-25999
> URL: https://issues.apache.org/jira/browse/SPARK-25999
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shanyu zhao
>Priority: Major
>
> It is not possible to build a distribution that doesn't contain hadoop 
> dependencies but include SparkR. This is because R/check_cran.sh builds R 
> document which depends on hadoop dependencies in 
> assembly/target/scala-xxx/jars folder.
> To reproduce:
> MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive 
> -Psparkr -Phadoop-provided"
> ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS
>  
> Error:
> * creating vignettes ... ERROR
> ...
> Error: A JNI error has occurred, please check your installation and try again
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25999) Spark make-distribution failure with --r and -Phadoop-provided

2018-11-09 Thread shanyu zhao (JIRA)
shanyu zhao created SPARK-25999:
---

 Summary: Spark make-distribution failure with --r and 
-Phadoop-provided
 Key: SPARK-25999
 URL: https://issues.apache.org/jira/browse/SPARK-25999
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.0, 2.3.2
Reporter: shanyu zhao


It is not possible to build a distribution that doesn't contain hadoop 
dependencies but include SparkR. This is because R/check_cran.sh builds R 
document which depends on hadoop dependencies in assembly/target/scala-xxx/jars 
folder.

To reproduce:

MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive -Psparkr 
-Phadoop-provided"

./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS

 

Error:
* creating vignettes ... ERROR
...
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24975) Spark history server REST API /api/v1/version returns error 404

2018-07-30 Thread shanyu zhao (JIRA)
shanyu zhao created SPARK-24975:
---

 Summary: Spark history server REST API /api/v1/version returns 
error 404
 Key: SPARK-24975
 URL: https://issues.apache.org/jira/browse/SPARK-24975
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1, 2.3.0
Reporter: shanyu zhao


Spark history server REST API provides /api/v1/version, according to doc:

[https://spark.apache.org/docs/latest/monitoring.html]

However, for Spark 2.3, we see:
{code:java}
curl http://localhost:18080/api/v1/version



Error 404 Not Found

HTTP ERROR 404
Problem accessing /api/v1/version. Reason:
 Not Foundhttp://eclipse.org/jetty;>Powered by 
Jetty:// 9.3.z-SNAPSHOT


{code}
On a Spark 2.2 cluster, we see:
{code:java}
curl http://localhost:18080/api/v1/version
{
"spark" : "2.2.0"
}{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9514) Add EventHubsReceiver to support Spark Streaming using Azure EventHubs

2015-08-03 Thread shanyu zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652883#comment-14652883
 ] 

shanyu zhao commented on SPARK-9514:


Thanks [~CodingCat], I've created the pull request here:
https://github.com/apache/spark/pull/7914


 Add EventHubsReceiver to support Spark Streaming using Azure EventHubs
 --

 Key: SPARK-9514
 URL: https://issues.apache.org/jira/browse/SPARK-9514
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.1
Reporter: shanyu zhao
 Fix For: 1.5.0

 Attachments: SPARK-9514.patch


 We need to add EventHubsReceiver implementation to support Spark Streaming 
 applications that receive data from Azure EventHubs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9514) Add EventHubsReceiver to support Spark Streaming using Azure EventHubs

2015-08-01 Thread shanyu zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shanyu zhao updated SPARK-9514:
---
Attachment: SPARK-9514.patch

Patch attached.

I put EventHubsReceiver in external folder and added an example in example 
project.

 Add EventHubsReceiver to support Spark Streaming using Azure EventHubs
 --

 Key: SPARK-9514
 URL: https://issues.apache.org/jira/browse/SPARK-9514
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.1
Reporter: shanyu zhao
 Fix For: 1.5.0

 Attachments: SPARK-9514.patch


 We need to add EventHubsReceiver implementation to support Spark Streaming 
 applications that receive data from Azure EventHubs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9514) Add EventHubsReceiver to support Spark Streaming using Azure EventHubs

2015-07-31 Thread shanyu zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shanyu zhao updated SPARK-9514:
---
Shepherd: shanyu zhao

 Add EventHubsReceiver to support Spark Streaming using Azure EventHubs
 --

 Key: SPARK-9514
 URL: https://issues.apache.org/jira/browse/SPARK-9514
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.1
Reporter: shanyu zhao
 Fix For: 1.5.0


 We need to add EventHubsReceiver implementation to support Spark Streaming 
 applications that receive data from Azure EventHubs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9514) Add EventHubsReceiver to support Spark Streaming using Azure EventHubs

2015-07-31 Thread shanyu zhao (JIRA)
shanyu zhao created SPARK-9514:
--

 Summary: Add EventHubsReceiver to support Spark Streaming using 
Azure EventHubs
 Key: SPARK-9514
 URL: https://issues.apache.org/jira/browse/SPARK-9514
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.1
Reporter: shanyu zhao
 Fix For: 1.5.0


We need to add EventHubsReceiver implementation to support Spark Streaming 
applications that receive data from Azure EventHubs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org