[jira] [Commented] (SPARK-34684) Hadoop config could not be successfully serilized from driver pods to executor pods
[ https://issues.apache.org/jira/browse/SPARK-34684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307375#comment-17307375 ] shanyu zhao commented on SPARK-34684: - [~attilapiros] What about we want to connect to HDFS HA with something like hdfs://mycluster/...? We need to deliver hdfs-site.xml to the driver or executor pods, right? Also we'd like to control the storage client's (Hadoop file system) behavior with the configuration file. > Hadoop config could not be successfully serilized from driver pods to > executor pods > --- > > Key: SPARK-34684 > URL: https://issues.apache.org/jira/browse/SPARK-34684 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1, 3.0.2 >Reporter: Yue Peng >Priority: Major > > I have set HADOOP_CONF_DIR correctly. And I have verified that hadoop configs > have been stored into a configmap and mounted to driver. However, spark pi > example job keeps failing where executor do not know how to talk to hdfs. I > highly suspect that there is a bug causing it, as I manually create a > configmap storing hadoop configs and mounted it to executor in template file, > which could fix the error. > > Spark submit command: > /opt/spark-3.0/bin/spark-submit --class org.apache.spark.examples.SparkPi > --deploy-mode cluster --master k8s://https://10.***.18.96:6443 > --num-executors 1 --conf spark.kubernetes.namespace=test --conf > spark.kubernetes.container.image= --conf > spark.kubernetes.driver.podTemplateFile=/opt/spark-3.0/conf/spark-driver.template > --conf > spark.kubernetes.executor.podTemplateFile=/opt/spark-3.0/conf/spark-executor.template > --conf spark.kubernetes.file.upload.path=/opt/spark-3.0/examples/jars > hdfs:///tmp/spark-examples_2.12-3.0.125067.jar 1000 > > > Error log: > > 21/03/10 06:59:58 INFO TransportClientFactory: Successfully created > connection to > org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc/100.64.0.191:7078 > after 608 ms (392 ms spent in bootstraps) > 21/03/10 06:59:58 INFO SecurityManager: Changing view acls to: root > 21/03/10 06:59:58 INFO SecurityManager: Changing modify acls to: root > 21/03/10 06:59:58 INFO SecurityManager: Changing view acls groups to: > 21/03/10 06:59:58 INFO SecurityManager: Changing modify acls groups to: > 21/03/10 06:59:58 INFO SecurityManager: SecurityManager: authentication > enabled; ui acls disabled; users with view permissions: Set(root); groups > with view permissions: Set(); users with modify permissions: Set(root); > groups with modify permissions: Set() > 21/03/10 06:59:59 INFO TransportClientFactory: Successfully created > connection to > org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc/100.64.0.191:7078 > after 130 ms (104 ms spent in bootstraps) > 21/03/10 06:59:59 INFO DiskBlockManager: Created local directory at > /var/data/spark-0f541e3d-994f-4c7a-843f-f7dac57dfc13/blockmgr-981cfb62-5b27-4d1a-8fbd-eddb466faf1d > 21/03/10 06:59:59 INFO MemoryStore: MemoryStore started with capacity 2047.2 > MiB > 21/03/10 06:59:59 INFO CoarseGrainedExecutorBackend: Connecting to driver: > spark://coarsegrainedschedu...@org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc:7078 > 21/03/10 06:59:59 INFO ResourceUtils: > == > 21/03/10 06:59:59 INFO ResourceUtils: Resources for spark.executor: > 21/03/10 06:59:59 INFO ResourceUtils: > == > 21/03/10 06:59:59 INFO CoarseGrainedExecutorBackend: Successfully registered > with driver > 21/03/10 06:59:59 INFO Executor: Starting executor ID 1 on host 100.64.0.192 > 21/03/10 07:00:00 INFO Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37956. > 21/03/10 07:00:00 INFO NettyBlockTransferService: Server created on > 100.64.0.192:37956 > 21/03/10 07:00:00 INFO BlockManager: Using > org.apache.spark.storage.RandomBlockReplicationPolicy for block replication > policy > 21/03/10 07:00:00 INFO BlockManagerMaster: Registering BlockManager > BlockManagerId(1, 100.64.0.192, 37956, None) > 21/03/10 07:00:00 INFO BlockManagerMaster: Registered BlockManager > BlockManagerId(1, 100.64.0.192, 37956, None) > 21/03/10 07:00:00 INFO BlockManager: Initialized BlockManager: > BlockManagerId(1, 100.64.0.192, 37956, None) > 21/03/10 07:00:01 INFO CoarseGrainedExecutorBackend: Got assigned task 0 > 21/03/10 07:00:01 INFO CoarseGrainedExecutorBackend: Got assigned task 1 > 21/03/10 07:00:01 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) > 21/03/10 07:00:01 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) > 21/03/10 07:00:01 INFO Executor: Fetching >
[jira] [Commented] (SPARK-30536) Sort-merge join operator spilling performance improvements
[ https://issues.apache.org/jira/browse/SPARK-30536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054555#comment-17054555 ] shanyu zhao commented on SPARK-30536: - Uploaded two slides to explain the optimization idea of this PR. > Sort-merge join operator spilling performance improvements > -- > > Key: SPARK-30536 > URL: https://issues.apache.org/jira/browse/SPARK-30536 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Sinisa Knezevic >Priority: Major > Attachments: spark-30536-explained.pdf > > > Testing with TPC-DS 100 TB benchmark data set showed that some of SQLs > (example query 14) are not able to run even with extremely large Spark > executor memory.Spark spilling feature has to be enabled, in order to be able > to process these SQLs. Processing of SQLs becomes extremely slow when > spilling is enabled.The Spark spilling feature is enabled via two parameters: > “spark.sql.sortMergeJoinExec.buffer.in.memory.threshold” and > “spark.sql.sortMergeJoinExec.buffer.spill.threshold” > “spark.sql.sortMergeJoinExec.buffer.in.memory.threshold” – when this > threshold is reached, the data will be moved from > ExternalAppendOnlyUnsafeRowArrey object into UnsafeExternalSorter object. > “spark.sql.sortMergeJoinExec.buffer.spill.threshold” – when this threshold is > reached, the data will be spilled from UnsafeExternalSorter object onto the > disk. > > During execution of sort-merge join (Left Semi Join ) for each left join row > “right matches” are found and stored into ExternalAppendOnlyUnsafeRowArrey > object.In the case of Query 14 there are millions of rows of “right matches”. > To run this query spilling is enabled and data is moved from > ExternalAppendOnlyUnsafeRowArrey into UnsafeExternalSorter and then spilled > onto the disk.When million rows are processed on left side of the join, the > iterator on top of spilled “right matches” rows is created each time. This > means that millions of time iterator on top of right matches (that are > spilled on the disk) is created.The current Spark implementation creates > iterator on top of spilled rows and producing I/0 which results into millions > of I/0 when million rows are processed. > > To avoid the performance bottleneck this JIRA introducing following solution: > 1. Implement lazy initialization of UnsafeSorterSpillReader - iterator on top > of spilled rows: > … During SortMergeJoin (Left Semi Join) execution, the iterator on the > spill data is created but no iteration over the data is done. > ... Having lazy initialization of UnsafeSorterSpillReader will enable > efficient processing of SortMergeJoin even if data is spilled onto disk. > Unnecessary I/O will be avoided. > 2. Decrease initial memory read buffer size in UnsafeSorterSpillReader from > 1MB to 1KB: > … UnsafeSorterSpillReader constructor takes lot of time due to size of > default 1MB memory read buffer. > … The code already has logic to increase the memory read buffer if it > cannot fit the data, so decreasing the size to 1K is safe and has positive > performance impact. > 3. Improve memory utilization when spilling is enabled in > ExternalAppendOnlyUnsafeRowArrey: > … In the current implementation, when spilling is enabled, > UnsafeExternalSorter object is created and then data moved from > ExternalAppendOnlyUnsafeRowArrey object into UnsafeExternalSorter and then > ExternalAppendOnlyUnsafeRowArrey object is emptied. Just before > ExternalAppendOnlyUnsafeRowArrey object is emptied there are both objects in > the memory with the same data. That require double memory and there is > duplication of data. This can be avoided. > … In the proposed solution, when > spark.sql.sortMergeJoinExec.buffer.in.memory.threshold is reached adding new > rows into ExternalAppendOnlyUnsafeRowArray object stops. UnsafeExternalSorter > object is created and new rows are added into this object. > ExternalAppendOnlyUnsafeRowArray object retains all rows already added into > this object. This approach will enable better memory utilization and avoid > unnecessary movement of data from one object into another. > > The test of this solution with query 14 and enabled spilling on the disk, > showed 500X performance improvements and it didn’t degrade performance of the > other SQLs from TPC-DS benchmark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30536) Sort-merge join operator spilling performance improvements
[ https://issues.apache.org/jira/browse/SPARK-30536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shanyu zhao updated SPARK-30536: Attachment: spark-30536-explained.pdf > Sort-merge join operator spilling performance improvements > -- > > Key: SPARK-30536 > URL: https://issues.apache.org/jira/browse/SPARK-30536 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Sinisa Knezevic >Priority: Major > Attachments: spark-30536-explained.pdf > > > Testing with TPC-DS 100 TB benchmark data set showed that some of SQLs > (example query 14) are not able to run even with extremely large Spark > executor memory.Spark spilling feature has to be enabled, in order to be able > to process these SQLs. Processing of SQLs becomes extremely slow when > spilling is enabled.The Spark spilling feature is enabled via two parameters: > “spark.sql.sortMergeJoinExec.buffer.in.memory.threshold” and > “spark.sql.sortMergeJoinExec.buffer.spill.threshold” > “spark.sql.sortMergeJoinExec.buffer.in.memory.threshold” – when this > threshold is reached, the data will be moved from > ExternalAppendOnlyUnsafeRowArrey object into UnsafeExternalSorter object. > “spark.sql.sortMergeJoinExec.buffer.spill.threshold” – when this threshold is > reached, the data will be spilled from UnsafeExternalSorter object onto the > disk. > > During execution of sort-merge join (Left Semi Join ) for each left join row > “right matches” are found and stored into ExternalAppendOnlyUnsafeRowArrey > object.In the case of Query 14 there are millions of rows of “right matches”. > To run this query spilling is enabled and data is moved from > ExternalAppendOnlyUnsafeRowArrey into UnsafeExternalSorter and then spilled > onto the disk.When million rows are processed on left side of the join, the > iterator on top of spilled “right matches” rows is created each time. This > means that millions of time iterator on top of right matches (that are > spilled on the disk) is created.The current Spark implementation creates > iterator on top of spilled rows and producing I/0 which results into millions > of I/0 when million rows are processed. > > To avoid the performance bottleneck this JIRA introducing following solution: > 1. Implement lazy initialization of UnsafeSorterSpillReader - iterator on top > of spilled rows: > … During SortMergeJoin (Left Semi Join) execution, the iterator on the > spill data is created but no iteration over the data is done. > ... Having lazy initialization of UnsafeSorterSpillReader will enable > efficient processing of SortMergeJoin even if data is spilled onto disk. > Unnecessary I/O will be avoided. > 2. Decrease initial memory read buffer size in UnsafeSorterSpillReader from > 1MB to 1KB: > … UnsafeSorterSpillReader constructor takes lot of time due to size of > default 1MB memory read buffer. > … The code already has logic to increase the memory read buffer if it > cannot fit the data, so decreasing the size to 1K is safe and has positive > performance impact. > 3. Improve memory utilization when spilling is enabled in > ExternalAppendOnlyUnsafeRowArrey: > … In the current implementation, when spilling is enabled, > UnsafeExternalSorter object is created and then data moved from > ExternalAppendOnlyUnsafeRowArrey object into UnsafeExternalSorter and then > ExternalAppendOnlyUnsafeRowArrey object is emptied. Just before > ExternalAppendOnlyUnsafeRowArrey object is emptied there are both objects in > the memory with the same data. That require double memory and there is > duplication of data. This can be avoided. > … In the proposed solution, when > spark.sql.sortMergeJoinExec.buffer.in.memory.threshold is reached adding new > rows into ExternalAppendOnlyUnsafeRowArray object stops. UnsafeExternalSorter > object is created and new rows are added into this object. > ExternalAppendOnlyUnsafeRowArray object retains all rows already added into > this object. This approach will enable better memory utilization and avoid > unnecessary movement of data from one object into another. > > The test of this solution with query 14 and enabled spilling on the disk, > showed 500X performance improvements and it didn’t degrade performance of the > other SQLs from TPC-DS benchmark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31029) Occasional class not found error in user's Future code using global ExecutionContext
[ https://issues.apache.org/jira/browse/SPARK-31029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shanyu zhao updated SPARK-31029: Description: *Problem:* When running tpc-ds test (https://github.com/databricks/spark-sql-perf), occasionally we see error related to class not found: 2020-02-04 20:00:26,673 ERROR yarn.ApplicationMaster: User class threw exception: scala.ScalaReflectionException: class com.databricks.spark.sql.perf.ExperimentRun in JavaMirror with sun.misc.Launcher$AppClassLoader@28ba21f3 of type class sun.misc.Launcher$AppClassLoader with classpath [...] and parent being sun.misc.Launcher$ExtClassLoader@3ff5d147 of type class sun.misc.Launcher$ExtClassLoader with classpath [...] and parent being primordial classloader with boot classpath [...] not found. *Root cause:* Spark driver starts ApplicationMaster in the main thread, which starts a user thread and set MutableURLClassLoader to that thread's ContextClassLoader. userClassThread = startUserApplication() The main thread then setup YarnSchedulerBackend RPC endpoints, which handles these calls using scala Future with the default global ExecutionContext: - doRequestTotalExecutors - doKillExecutors If main thread starts a future to handle doKillExecutors() before user thread does then the default thread pool thread's ContextClassLoader would be the default (AppClassLoader). If user thread starts a future first then the thread pool thread will have MutableURLClassLoader. So if user's code uses a future which references a user provided class (only MutableURLClassLoader can load), and before the future if there are executor lost, you will see errors related to class not found. *Proposed Solution:* We can potentially solve this problem in one of two ways: 1) Set the same class loader (userClassLoader) to both the main thread and user thread in ApplicationMaster.scala 2) Do not use "ExecutionContext.Implicits.global" in YarnSchedulerBackend was: *Problem:* When running tpc-ds test (https://github.com/databricks/spark-sql-perf), occasionally we see error related to class not found: 2020-02-04 20:00:26,673 ERROR yarn.ApplicationMaster: User class threw exception: scala.ScalaReflectionException: class com.databricks.spark.sql.perf.ExperimentRun in JavaMirror with sun.misc.Launcher$AppClassLoader@28ba21f3 of type class sun.misc.Launcher$AppClassLoader with classpath [...] and parent being sun.misc.Launcher$ExtClassLoader@3ff5d147 of type class sun.misc.Launcher$ExtClassLoader with classpath [...] and parent being primordial classloader with boot classpath [...] not found. *Root cause:* Spark driver starts ApplicationMaster in the main thread, which starts a user thread and set MutableURLClassLoader to that thread's ContextClassLoader. userClassThread = startUserApplication() The main thread then setup YarnSchedulerBackend RPC endpoints, which handles these calls using scala Future with the default global ExecutionContext: - doRequestTotalExecutors - doKillExecutors If main thread starts a future to handle doKillExecutors() before user thread does then the default thread pool thread's ContextClassLoader would be the default (AppClassLoader). If user thread starts a future first then the thread pool thread will have MutableURLClassLoader. So if user's code uses a future which references a user provided class (only MutableURLClassLoader can load), and before the future if there are executor lost, you will see errors related to class not found. *Proposed Solution:* Set the same class loader (userClassLoader) to both the main thread and user thread in ApplicationMaster.scala > Occasional class not found error in user's Future code using global > ExecutionContext > > > Key: SPARK-31029 > URL: https://issues.apache.org/jira/browse/SPARK-31029 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.4.5 >Reporter: shanyu zhao >Priority: Major > > *Problem:* > When running tpc-ds test (https://github.com/databricks/spark-sql-perf), > occasionally we see error related to class not found: > 2020-02-04 20:00:26,673 ERROR yarn.ApplicationMaster: User class threw > exception: scala.ScalaReflectionException: class > com.databricks.spark.sql.perf.ExperimentRun in JavaMirror with > sun.misc.Launcher$AppClassLoader@28ba21f3 of type class > sun.misc.Launcher$AppClassLoader with classpath [...] > and parent being sun.misc.Launcher$ExtClassLoader@3ff5d147 of type class > sun.misc.Launcher$ExtClassLoader with classpath [...] > and parent being primordial classloader with boot classpath [...] not found. > *Root cause:* > Spark driver starts ApplicationMaster in the main thread, which starts a user > thread and set
[jira] [Created] (SPARK-31029) Occasional class not found error in user's Future code using global ExecutionContext
shanyu zhao created SPARK-31029: --- Summary: Occasional class not found error in user's Future code using global ExecutionContext Key: SPARK-31029 URL: https://issues.apache.org/jira/browse/SPARK-31029 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.4.5 Reporter: shanyu zhao *Problem:* When running tpc-ds test (https://github.com/databricks/spark-sql-perf), occasionally we see error related to class not found: 2020-02-04 20:00:26,673 ERROR yarn.ApplicationMaster: User class threw exception: scala.ScalaReflectionException: class com.databricks.spark.sql.perf.ExperimentRun in JavaMirror with sun.misc.Launcher$AppClassLoader@28ba21f3 of type class sun.misc.Launcher$AppClassLoader with classpath [...] and parent being sun.misc.Launcher$ExtClassLoader@3ff5d147 of type class sun.misc.Launcher$ExtClassLoader with classpath [...] and parent being primordial classloader with boot classpath [...] not found. *Root cause:* Spark driver starts ApplicationMaster in the main thread, which starts a user thread and set MutableURLClassLoader to that thread's ContextClassLoader. userClassThread = startUserApplication() The main thread then setup YarnSchedulerBackend RPC endpoints, which handles these calls using scala Future with the default global ExecutionContext: - doRequestTotalExecutors - doKillExecutors If main thread starts a future to handle doKillExecutors() before user thread does then the default thread pool thread's ContextClassLoader would be the default (AppClassLoader). If user thread starts a future first then the thread pool thread will have MutableURLClassLoader. So if user's code uses a future which references a user provided class (only MutableURLClassLoader can load), and before the future if there are executor lost, you will see errors related to class not found. *Proposed Solution:* Set the same class loader (userClassLoader) to both the main thread and user thread in ApplicationMaster.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31028) Add "-XX:ActiveProcessorCount" to Spark driver and executor in Yarn mode
[ https://issues.apache.org/jira/browse/SPARK-31028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shanyu zhao updated SPARK-31028: Description: When starting Spark driver and executors on Yarn cluster, the JVM process can discover all CPU cores on the system and set thread-pool or GC threads based on that value. We should limit what the JVM sees for the number of cores set by the user (spark.driver.cores or spark.executor.cores) by "-XX:ActiveProcessorCount", which was introduced in Java 8u191. Especially in running Spark on Yarn inside Kubernetes container, the number of CPU cores discovered sometimes is 1, which means it always use 1 thread in the default thread pool, or GC threads. was:When starting Spark driver and executors on Yarn cluster, the JVM process can discover all CPU cores on the system and set thread-pool or GC threads based on that value. We should limit what the JVM sees for the number of cores set by the user (spark.driver.cores or spark.executor.cores) by "-XX:ActiveProcessorCount", which was introduced in Java 8u191. > Add "-XX:ActiveProcessorCount" to Spark driver and executor in Yarn mode > > > Key: SPARK-31028 > URL: https://issues.apache.org/jira/browse/SPARK-31028 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.4.5 >Reporter: shanyu zhao >Priority: Major > > When starting Spark driver and executors on Yarn cluster, the JVM process can > discover all CPU cores on the system and set thread-pool or GC threads based > on that value. We should limit what the JVM sees for the number of cores set > by the user (spark.driver.cores or spark.executor.cores) by > "-XX:ActiveProcessorCount", which was introduced in Java 8u191. > Especially in running Spark on Yarn inside Kubernetes container, the number > of CPU cores discovered sometimes is 1, which means it always use 1 thread in > the default thread pool, or GC threads. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31028) Add "-XX:ActiveProcessorCount" to Spark driver and executor in Yarn mode
shanyu zhao created SPARK-31028: --- Summary: Add "-XX:ActiveProcessorCount" to Spark driver and executor in Yarn mode Key: SPARK-31028 URL: https://issues.apache.org/jira/browse/SPARK-31028 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.4.5 Reporter: shanyu zhao When starting Spark driver and executors on Yarn cluster, the JVM process can discover all CPU cores on the system and set thread-pool or GC threads based on that value. We should limit what the JVM sees for the number of cores set by the user (spark.driver.cores or spark.executor.cores) by "-XX:ActiveProcessorCount", which was introduced in Java 8u191. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30845) spark-submit pyspark app on yarn uploads local pyspark archives
shanyu zhao created SPARK-30845: --- Summary: spark-submit pyspark app on yarn uploads local pyspark archives Key: SPARK-30845 URL: https://issues.apache.org/jira/browse/SPARK-30845 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.4.5, 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4.0 Reporter: shanyu zhao Use spark-submit to submit a pyspark app on Yarn, and set this in spark-env.sh: {code:bash} export PYSPARK_ARCHIVES_PATH=local:/opt/spark/python/lib/pyspark.zip,local:/opt/spark/python/lib/py4j-0.10.7-src.zip {code} You can see that these local archives are still uploaded to Yarn distributed cache. yarn.Client: Uploading resource file:/opt/spark/python/lib/pyspark.zip -> hdfs://myhdfs/user/test1/.sparkStaging/application_1581024490249_0001/pyspark.zip -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency
[ https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030114#comment-17030114 ] shanyu zhao commented on SPARK-30602: - Thanks for the effort Min! Riffle seems to only do map side worker level merge and didn't do push based shuffle. And it seems simpler to implement. I wonder what benefit from the "push based shuffle" brings on top of the Riffle's merge approach in terms of perf and scalability. I can imagine "push based shuffle" are more "responsive" by streamlining mappers and reducers, could this be a separate effort then? > SPIP: Support push-based shuffle to improve shuffle efficiency > -- > > Key: SPARK-30602 > URL: https://issues.apache.org/jira/browse/SPARK-30602 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: Min Shen >Priority: Major > > In a large deployment of a Spark compute infrastructure, Spark shuffle is > becoming a potential scaling bottleneck and a source of inefficiency in the > cluster. When doing Spark on YARN for a large-scale deployment, people > usually enable Spark external shuffle service and store the intermediate > shuffle files on HDD. Because the number of blocks generated for a particular > shuffle grows quadratically compared to the size of shuffled data (# mappers > and reducers grows linearly with the size of shuffled data, but # blocks is # > mappers * # reducers), one general trend we have observed is that the more > data a Spark application processes, the smaller the block size becomes. In a > few production clusters we have seen, the average shuffle block size is only > 10s of KBs. Because of the inefficiency of performing random reads on HDD for > small amount of data, the overall efficiency of the Spark external shuffle > services serving the shuffle blocks degrades as we see an increasing # of > Spark applications processing an increasing amount of data. In addition, > because Spark external shuffle service is a shared service in a multi-tenancy > cluster, the inefficiency with one Spark application could propagate to other > applications as well. > In this ticket, we propose a solution to improve Spark shuffle efficiency in > above mentioned environments with push-based shuffle. With push-based > shuffle, shuffle is performed at the end of mappers and blocks get pre-merged > and move towards reducers. In our prototype implementation, we have seen > significant efficiency improvements when performing large shuffles. We take a > Spark-native approach to achieve this, i.e., extending Spark’s existing > shuffle netty protocol, and the behaviors of Spark mappers, reducers and > drivers. This way, we can bring the benefits of more efficient shuffle in > Spark without incurring the dependency or overhead of either specialized > storage layer or external infrastructure pieces. > > Link to dev mailing list discussion: > http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29003) Spark history server startup hang due to deadlock
[ https://issues.apache.org/jira/browse/SPARK-29003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924567#comment-16924567 ] shanyu zhao commented on SPARK-29003: - Please see the full jstack attached. > Spark history server startup hang due to deadlock > - > > Key: SPARK-29003 > URL: https://issues.apache.org/jira/browse/SPARK-29003 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: shanyu zhao >Priority: Major > Attachments: sparkhistory-jstack.log > > > Occasionally when starting Spark History Server, the service process will > hang before binding to the port so Spark History Server is not usable. One > has to kill the process and start again. You can write a simple bash program > to stop and start Spark History Server and you can reproduce this problem > approximately 10% of time. > The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock. > This is what I collected with jstack: > {code:java} > "log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x7fca90028800 > nid=0x6e8 in Object.wait() [0x7fcaa9471000] > java.lang.Thread.State: RUNNABLE > at java.nio.file.FileSystems.getDefault(FileSystems.java:176) > ... > at java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked > <0xaaac1d40> (a java.lang.Runtime) > ... > at > org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698) > "main" #1 prio=5 os_prio=0 tid=0x7fcad8016800 nid=0x6d8 waiting for > monitor entry [0x7fcae146c000] > java.lang.Thread.State: BLOCKED (on object monitor) > at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock > <0xaaac1d40> (a java.lang.Runtime) > ... > at java.nio.file.FileSystems.getDefault(FileSystems.java:176) > at java.io.File.toPath(File.java:2234) - locked <0x8699bb68> (a > java.io.File) > ... > at > org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code} > Basically "main" thread and "log-replay-executor-0" thread simultaneously > calling java.nio,file.FileSystems.getDefault() and deadlocked. > This is similar to the reported JDK bug: > [https://bugs.openjdk.java.net/browse/JDK-8037567] > The problem is that during Spark History Server startup, there are two things > happening simultaneously that call into > java.nio.file.FileSystems.getDefault(): > 1) start jetty server > 2) start ApplicationHistoryProvider (which reads files from HDFS) > We should do this two things sequentially instead of in parallel. > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29003) Spark history server startup hang due to deadlock
[ https://issues.apache.org/jira/browse/SPARK-29003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shanyu zhao updated SPARK-29003: Attachment: sparkhistory-jstack.log > Spark history server startup hang due to deadlock > - > > Key: SPARK-29003 > URL: https://issues.apache.org/jira/browse/SPARK-29003 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: shanyu zhao >Priority: Major > Attachments: sparkhistory-jstack.log > > > Occasionally when starting Spark History Server, the service process will > hang before binding to the port so Spark History Server is not usable. One > has to kill the process and start again. You can write a simple bash program > to stop and start Spark History Server and you can reproduce this problem > approximately 10% of time. > The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock. > This is what I collected with jstack: > {code:java} > "log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x7fca90028800 > nid=0x6e8 in Object.wait() [0x7fcaa9471000] > java.lang.Thread.State: RUNNABLE > at java.nio.file.FileSystems.getDefault(FileSystems.java:176) > ... > at java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked > <0xaaac1d40> (a java.lang.Runtime) > ... > at > org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698) > "main" #1 prio=5 os_prio=0 tid=0x7fcad8016800 nid=0x6d8 waiting for > monitor entry [0x7fcae146c000] > java.lang.Thread.State: BLOCKED (on object monitor) > at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock > <0xaaac1d40> (a java.lang.Runtime) > ... > at java.nio.file.FileSystems.getDefault(FileSystems.java:176) > at java.io.File.toPath(File.java:2234) - locked <0x8699bb68> (a > java.io.File) > ... > at > org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code} > Basically "main" thread and "log-replay-executor-0" thread simultaneously > calling java.nio,file.FileSystems.getDefault() and deadlocked. > This is similar to the reported JDK bug: > [https://bugs.openjdk.java.net/browse/JDK-8037567] > The problem is that during Spark History Server startup, there are two things > happening simultaneously that call into > java.nio.file.FileSystems.getDefault(): > 1) start jetty server > 2) start ApplicationHistoryProvider (which reads files from HDFS) > We should do this two things sequentially instead of in parallel. > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29003) Spark history server startup hang due to deadlock
[ https://issues.apache.org/jira/browse/SPARK-29003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shanyu zhao updated SPARK-29003: Description: Occasionally when starting Spark History Server, the service process will hang before binding to the port so Spark History Server is not usable. One has to kill the process and start again. You can write a simple bash program to stop and start Spark History Server and you can reproduce this problem approximately 10% of time. The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock. This is what I collected with jstack: {code:java} "log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x7fca90028800 nid=0x6e8 in Object.wait() [0x7fcaa9471000] java.lang.Thread.State: RUNNABLE at java.nio.file.FileSystems.getDefault(FileSystems.java:176) ... at java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked <0xaaac1d40> (a java.lang.Runtime) ... at org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698) "main" #1 prio=5 os_prio=0 tid=0x7fcad8016800 nid=0x6d8 waiting for monitor entry [0x7fcae146c000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock <0xaaac1d40> (a java.lang.Runtime) ... at java.nio.file.FileSystems.getDefault(FileSystems.java:176) at java.io.File.toPath(File.java:2234) - locked <0x8699bb68> (a java.io.File) ... at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code} Basically "main" thread and "log-replay-executor-0" thread simultaneously calling java.nio,file.FileSystems.getDefault() and deadlocked. This is similar to the reported JDK bug: [https://bugs.openjdk.java.net/browse/JDK-8037567] The problem is that during Spark History Server startup, there are two things happening simultaneously that call into java.nio.file.FileSystems.getDefault(): 1) start jetty server 2) start ApplicationHistoryProvider (which reads files from HDFS) We should do this two things sequentially instead of in parallel. was: Occasionally when starting Spark History Server, the service process will hang before binding to the port so Spark History Server is not usable. One has to kill the process and start again. You can write a simple bash program to stop and start Spark History Server and you can reproduce this problem approximately 10% of time. The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock. This is what I collected with jstack: {code:java} "log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x7fca90028800 nid=0x6e8 in Object.wait() [0x7fcaa9471000]"log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x7fca90028800 nid=0x6e8 in Object.wait() [0x7fcaa9471000] java.lang.Thread.State: RUNNABLE at java.nio.file.FileSystems.getDefault(FileSystems.java:176) ... at java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked <0xaaac1d40> (a java.lang.Runtime) ... at org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698) "main" #1 prio=5 os_prio=0 tid=0x7fcad8016800 nid=0x6d8 waiting for monitor entry [0x7fcae146c000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock <0xaaac1d40> (a java.lang.Runtime) ... at java.nio.file.FileSystems.getDefault(FileSystems.java:176) at java.io.File.toPath(File.java:2234) - locked <0x8699bb68> (a java.io.File) ... at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code} Basically "main" thread and "log-replay-executor-0" thread simultaneously calling java.nio,file.FileSystems.getDefault() and deadlocked. This is similar to the reported JDK bug: [https://bugs.openjdk.java.net/browse/JDK-8037567] The problem is that during Spark History Server startup, there are two things happening simultaneously that call into java.nio.file.FileSystems.getDefault(): 1) start jetty server 2) start ApplicationHistoryProvider (which reads files from HDFS) We should do this two things sequentially instead of in parallel. > Spark history server startup hang due to deadlock > - > > Key: SPARK-29003 > URL: https://issues.apache.org/jira/browse/SPARK-29003 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: shanyu zhao >Priority: Major > > Occasionally when starting Spark History Server, the service process will > hang before binding to the port so Spark History Server is not usable. One > has to kill the process and start again. You can write a simple bash program > to stop and
[jira] [Created] (SPARK-29003) Spark history server startup hang due to deadlock
shanyu zhao created SPARK-29003: --- Summary: Spark history server startup hang due to deadlock Key: SPARK-29003 URL: https://issues.apache.org/jira/browse/SPARK-29003 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.4 Reporter: shanyu zhao Occasionally when starting Spark History Server, the service process will hang before binding to the port so Spark History Server is not usable. One has to kill the process and start again. You can write a simple bash program to stop and start Spark History Server and you can reproduce this problem approximately 10% of time. The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock. This is what I collected with jstack: {code:java} "log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x7fca90028800 nid=0x6e8 in Object.wait() [0x7fcaa9471000]"log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x7fca90028800 nid=0x6e8 in Object.wait() [0x7fcaa9471000] java.lang.Thread.State: RUNNABLE at java.nio.file.FileSystems.getDefault(FileSystems.java:176) ... at java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked <0xaaac1d40> (a java.lang.Runtime) ... at org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698) "main" #1 prio=5 os_prio=0 tid=0x7fcad8016800 nid=0x6d8 waiting for monitor entry [0x7fcae146c000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock <0xaaac1d40> (a java.lang.Runtime) ... at java.nio.file.FileSystems.getDefault(FileSystems.java:176) at java.io.File.toPath(File.java:2234) - locked <0x8699bb68> (a java.io.File) ... at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code} Basically "main" thread and "log-replay-executor-0" thread simultaneously calling java.nio,file.FileSystems.getDefault() and deadlocked. This is similar to the reported JDK bug: [https://bugs.openjdk.java.net/browse/JDK-8037567] The problem is that during Spark History Server startup, there are two things happening simultaneously that call into java.nio.file.FileSystems.getDefault(): 1) start jetty server 2) start ApplicationHistoryProvider (which reads files from HDFS) We should do this two things sequentially instead of in parallel. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors
[ https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shanyu zhao updated SPARK-12312: Affects Version/s: 2.4.2 > JDBC connection to Kerberos secured databases fails on remote executors > --- > > Key: SPARK-12312 > URL: https://issues.apache.org/jira/browse/SPARK-12312 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 2.4.2 >Reporter: nabacg >Priority: Minor > > When loading DataFrames from JDBC datasource with Kerberos authentication, > remote executors (yarn-client/cluster etc. modes) fail to establish a > connection due to lack of Kerberos ticket or ability to generate it. > This is a real issue when trying to ingest data from kerberized data sources > (SQL Server, Oracle) in enterprise environment where exposing simple > authentication access is not an option due to IT policy issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[ https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823458#comment-16823458 ] shanyu zhao commented on SPARK-18673: - Ping. What is the verdict here for users want to use Spark 2.4 and Hadoop 3.1? > Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version > -- > > Key: SPARK-18673 > URL: https://issues.apache.org/jira/browse/SPARK-18673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT >Reporter: Steve Loughran >Priority: Major > > Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader > considers 3.x to be an unknown Hadoop version. > Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it > will need to be updated to match. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26011) pyspark app with "spark.jars.packages" config does not work
[ https://issues.apache.org/jira/browse/SPARK-26011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shanyu zhao updated SPARK-26011: Description: Command "pyspark --packages" works as expected, but if submitting a livy pyspark job with "spark.jars.packages" config, the downloaded packages are not added to python's sys.path therefore the package is not available to use. For example, this command works: pyspark --packages Azure:mmlspark:0.14 However, using Jupyter notebook with sparkmagic kernel to open a pyspark session failed: %%configure -f \{"conf": {spark.jars.packages": "Azure:mmlspark:0.14"}} import mmlspark The root cause is that SparkSubmit determines pyspark app by the suffix of primary resource but Livy uses "spark-internal" as the primary resource when calling spark-submit, therefore args.isPython is set to false in SparkSubmit.scala. was: Command "pyspark --packages" works as expected, but if submitting a livy pyspark job with "spark.jars.packages" config, the downloaded packages are not added to python's sys.path therefore the package is not available to use. For example, this command works: pyspark --packages Azure:mmlspark:0.14 However, using Jupyter notebook with sparkmagic kernel to open a pyspark session failed: %%configure -f \{"conf": {spark.jars.packages": "Azure:mmlspark:0.14"}} import mmlspark The root cause is that SparkSubmit determines pyspark app by the suffix of primary resource but Livy uses "spark-internal" as the primary resource when calling spark-submit, therefore args.isPython is fails in SparkSubmit.scala. > pyspark app with "spark.jars.packages" config does not work > --- > > Key: SPARK-26011 > URL: https://issues.apache.org/jira/browse/SPARK-26011 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.2, 2.4.0 >Reporter: shanyu zhao >Priority: Major > > Command "pyspark --packages" works as expected, but if submitting a livy > pyspark job with "spark.jars.packages" config, the downloaded packages are > not added to python's sys.path therefore the package is not available to use. > For example, this command works: > pyspark --packages Azure:mmlspark:0.14 > However, using Jupyter notebook with sparkmagic kernel to open a pyspark > session failed: > %%configure -f \{"conf": {spark.jars.packages": "Azure:mmlspark:0.14"}} > import mmlspark > The root cause is that SparkSubmit determines pyspark app by the suffix of > primary resource but Livy uses "spark-internal" as the primary resource when > calling spark-submit, therefore args.isPython is set to false in > SparkSubmit.scala. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26011) pyspark app with "spark.jars.packages" config does not work
shanyu zhao created SPARK-26011: --- Summary: pyspark app with "spark.jars.packages" config does not work Key: SPARK-26011 URL: https://issues.apache.org/jira/browse/SPARK-26011 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.4.0, 2.3.2 Reporter: shanyu zhao Command "pyspark --packages" works as expected, but if submitting a livy pyspark job with "spark.jars.packages" config, the downloaded packages are not added to python's sys.path therefore the package is not available to use. For example, this command works: pyspark --packages Azure:mmlspark:0.14 However, using Jupyter notebook with sparkmagic kernel to open a pyspark session failed: %%configure -f \{"conf": {spark.jars.packages": "Azure:mmlspark:0.14"}} import mmlspark The root cause is that SparkSubmit determines pyspark app by the suffix of primary resource but Livy uses "spark-internal" as the primary resource when calling spark-submit, therefore args.isPython is fails in SparkSubmit.scala. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided
[ https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682089#comment-16682089 ] shanyu zhao commented on SPARK-25999: - patch attached. Basically it creates an optional project that brings all dependencies to R/rjarsdep/target folder, and copy the missing jars to assembly/target folder before building R. > make-distribution.sh failure with --r and -Phadoop-provided > --- > > Key: SPARK-25999 > URL: https://issues.apache.org/jira/browse/SPARK-25999 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: shanyu zhao >Priority: Major > Attachments: SPARK-25999.patch > > > It is not possible to build a distribution that doesn't contain hadoop > dependencies but include SparkR. This is because R/check_cran.sh builds R > document which depends on hadoop dependencies in > assembly/target/scala-xxx/jars folder. > To reproduce: > MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive > -Psparkr -Phadoop-provided" > ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS > > Error: > * creating vignettes ... ERROR > ... > Error: A JNI error has occurred, please check your installation and try again > Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided
[ https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shanyu zhao updated SPARK-25999: Attachment: SPARK-25999.patch > make-distribution.sh failure with --r and -Phadoop-provided > --- > > Key: SPARK-25999 > URL: https://issues.apache.org/jira/browse/SPARK-25999 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: shanyu zhao >Priority: Major > Attachments: SPARK-25999.patch > > > It is not possible to build a distribution that doesn't contain hadoop > dependencies but include SparkR. This is because R/check_cran.sh builds R > document which depends on hadoop dependencies in > assembly/target/scala-xxx/jars folder. > To reproduce: > MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive > -Psparkr -Phadoop-provided" > ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS > > Error: > * creating vignettes ... ERROR > ... > Error: A JNI error has occurred, please check your installation and try again > Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25999) make-distribution.sh failure with --r and -Phadoop-provided
[ https://issues.apache.org/jira/browse/SPARK-25999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shanyu zhao updated SPARK-25999: Summary: make-distribution.sh failure with --r and -Phadoop-provided (was: Spark make-distribution failure with --r and -Phadoop-provided) > make-distribution.sh failure with --r and -Phadoop-provided > --- > > Key: SPARK-25999 > URL: https://issues.apache.org/jira/browse/SPARK-25999 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: shanyu zhao >Priority: Major > > It is not possible to build a distribution that doesn't contain hadoop > dependencies but include SparkR. This is because R/check_cran.sh builds R > document which depends on hadoop dependencies in > assembly/target/scala-xxx/jars folder. > To reproduce: > MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive > -Psparkr -Phadoop-provided" > ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS > > Error: > * creating vignettes ... ERROR > ... > Error: A JNI error has occurred, please check your installation and try again > Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25999) Spark make-distribution failure with --r and -Phadoop-provided
shanyu zhao created SPARK-25999: --- Summary: Spark make-distribution failure with --r and -Phadoop-provided Key: SPARK-25999 URL: https://issues.apache.org/jira/browse/SPARK-25999 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.4.0, 2.3.2 Reporter: shanyu zhao It is not possible to build a distribution that doesn't contain hadoop dependencies but include SparkR. This is because R/check_cran.sh builds R document which depends on hadoop dependencies in assembly/target/scala-xxx/jars folder. To reproduce: MAVEN_BUILD_OPTS="-Dmaven.javadoc.skip=true -Pyarn -Phadoop-2.7 -Phive -Psparkr -Phadoop-provided" ./dev/make-distribution.sh --tgz --r $MAVEN_BUILD_OPTS Error: * creating vignettes ... ERROR ... Error: A JNI error has occurred, please check your installation and try again Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24975) Spark history server REST API /api/v1/version returns error 404
shanyu zhao created SPARK-24975: --- Summary: Spark history server REST API /api/v1/version returns error 404 Key: SPARK-24975 URL: https://issues.apache.org/jira/browse/SPARK-24975 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.1, 2.3.0 Reporter: shanyu zhao Spark history server REST API provides /api/v1/version, according to doc: [https://spark.apache.org/docs/latest/monitoring.html] However, for Spark 2.3, we see: {code:java} curl http://localhost:18080/api/v1/version Error 404 Not Found HTTP ERROR 404 Problem accessing /api/v1/version. Reason: Not Foundhttp://eclipse.org/jetty;>Powered by Jetty:// 9.3.z-SNAPSHOT {code} On a Spark 2.2 cluster, we see: {code:java} curl http://localhost:18080/api/v1/version { "spark" : "2.2.0" }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9514) Add EventHubsReceiver to support Spark Streaming using Azure EventHubs
[ https://issues.apache.org/jira/browse/SPARK-9514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652883#comment-14652883 ] shanyu zhao commented on SPARK-9514: Thanks [~CodingCat], I've created the pull request here: https://github.com/apache/spark/pull/7914 Add EventHubsReceiver to support Spark Streaming using Azure EventHubs -- Key: SPARK-9514 URL: https://issues.apache.org/jira/browse/SPARK-9514 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.1 Reporter: shanyu zhao Fix For: 1.5.0 Attachments: SPARK-9514.patch We need to add EventHubsReceiver implementation to support Spark Streaming applications that receive data from Azure EventHubs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9514) Add EventHubsReceiver to support Spark Streaming using Azure EventHubs
[ https://issues.apache.org/jira/browse/SPARK-9514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shanyu zhao updated SPARK-9514: --- Attachment: SPARK-9514.patch Patch attached. I put EventHubsReceiver in external folder and added an example in example project. Add EventHubsReceiver to support Spark Streaming using Azure EventHubs -- Key: SPARK-9514 URL: https://issues.apache.org/jira/browse/SPARK-9514 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.1 Reporter: shanyu zhao Fix For: 1.5.0 Attachments: SPARK-9514.patch We need to add EventHubsReceiver implementation to support Spark Streaming applications that receive data from Azure EventHubs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9514) Add EventHubsReceiver to support Spark Streaming using Azure EventHubs
[ https://issues.apache.org/jira/browse/SPARK-9514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shanyu zhao updated SPARK-9514: --- Shepherd: shanyu zhao Add EventHubsReceiver to support Spark Streaming using Azure EventHubs -- Key: SPARK-9514 URL: https://issues.apache.org/jira/browse/SPARK-9514 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.1 Reporter: shanyu zhao Fix For: 1.5.0 We need to add EventHubsReceiver implementation to support Spark Streaming applications that receive data from Azure EventHubs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9514) Add EventHubsReceiver to support Spark Streaming using Azure EventHubs
shanyu zhao created SPARK-9514: -- Summary: Add EventHubsReceiver to support Spark Streaming using Azure EventHubs Key: SPARK-9514 URL: https://issues.apache.org/jira/browse/SPARK-9514 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.1 Reporter: shanyu zhao Fix For: 1.5.0 We need to add EventHubsReceiver implementation to support Spark Streaming applications that receive data from Azure EventHubs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org