Re: [PR] [GLUTEN-5074][VL] fix: UDF load error in yarn-cluster mode [incubator-gluten]
marin-ma merged PR #5075: URL: https://github.com/apache/incubator-gluten/pull/5075 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-5074][VL] fix: UDF load error in yarn-cluster mode [incubator-gluten]
marin-ma commented on PR #5075: URL: https://github.com/apache/incubator-gluten/pull/5075#issuecomment-2017121035 Could you also update the examples in the document with either relative or URI path? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-5074][VL] fix: UDF load error in yarn-cluster mode [incubator-gluten]
marin-ma commented on code in PR #5075: URL: https://github.com/apache/incubator-gluten/pull/5075#discussion_r1537003020 ## backends-velox/src/test/scala/io/glutenproject/expression/VeloxUdfSuite.scala: ## @@ -82,7 +82,8 @@ class VeloxUdfSuiteLocal extends VeloxUdfSuite { override val master: String = "local[2]" override protected def sparkConf: SparkConf = { super.sparkConf - .set("spark.gluten.sql.columnar.backend.velox.udfLibraryPaths", udfLibPath) + .set("spark.files", udfLibPath) + .set("spark.gluten.sql.columnar.backend.velox.udfLibraryPaths", "libmyudf.so") Review Comment: Please do not hard code "libmyudf.so" here. Use a helper function to extract the filename from `udfLibPath`. Note the `udfLibPath` can also be a comma-separated string with multiple paths. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-5074][VL] fix: UDF load error in yarn-cluster mode [incubator-gluten]
kecookier commented on code in PR #5075: URL: https://github.com/apache/incubator-gluten/pull/5075#discussion_r1536655002 ## backends-velox/src/main/scala/org/apache/spark/sql/expression/UDFResolver.scala: ## @@ -152,34 +143,39 @@ object UDFResolver extends Logging { // Get the full paths of all libraries. // If it's a directory, get all files ends with ".so" recursively. - def getAllLibraries(files: String, sparkConf: SparkConf, canAccessSparkFiles: Boolean): String = { + def getAllLibraries(files: String, sparkConf: SparkConf): String = { val hadoopConf = SparkHadoopUtil.newConfiguration(sparkConf) files .split(",") .map { f => - val file = new File(f) - // Relative paths should be uploaded via --files or --archives - // Use SparkFiles.get to download and unpack - if (!file.isAbsolute) { -if (!canAccessSparkFiles) { - throw new IllegalArgumentException( -"On yarn-client mode, driver only accepts absolute paths, but got " + f) + val uri = Utils.resolveURI(f) Review Comment: @marin-ma I have already fix the code according to the previous discussion in the comments, could you help review it again? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-5074][VL] fix: UDF load error in yarn-cluster mode [incubator-gluten]
marin-ma commented on code in PR #5075: URL: https://github.com/apache/incubator-gluten/pull/5075#discussion_r1535293176 ## backends-velox/src/main/scala/org/apache/spark/sql/expression/UDFResolver.scala: ## @@ -152,34 +143,39 @@ object UDFResolver extends Logging { // Get the full paths of all libraries. // If it's a directory, get all files ends with ".so" recursively. - def getAllLibraries(files: String, sparkConf: SparkConf, canAccessSparkFiles: Boolean): String = { + def getAllLibraries(files: String, sparkConf: SparkConf): String = { val hadoopConf = SparkHadoopUtil.newConfiguration(sparkConf) files .split(",") .map { f => - val file = new File(f) - // Relative paths should be uploaded via --files or --archives - // Use SparkFiles.get to download and unpack - if (!file.isAbsolute) { -if (!canAccessSparkFiles) { - throw new IllegalArgumentException( -"On yarn-client mode, driver only accepts absolute paths, but got " + f) + val uri = Utils.resolveURI(f) Review Comment: I would suggest preserving 2 kinds of configuration for `spark.gluten.sql.columnar.backend.velox.udfLibraryPaths` - relative path - URI And make `spark.gluten.sql.columnar.backend.velox.udfLibraryPaths=/path/to/..." invalid. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-5074][VL] fix: UDF load error in yarn-cluster mode [incubator-gluten]
kecookier commented on code in PR #5075: URL: https://github.com/apache/incubator-gluten/pull/5075#discussion_r1535237040 ## backends-velox/src/main/scala/org/apache/spark/sql/expression/UDFResolver.scala: ## @@ -152,34 +143,39 @@ object UDFResolver extends Logging { // Get the full paths of all libraries. // If it's a directory, get all files ends with ".so" recursively. - def getAllLibraries(files: String, sparkConf: SparkConf, canAccessSparkFiles: Boolean): String = { + def getAllLibraries(files: String, sparkConf: SparkConf): String = { val hadoopConf = SparkHadoopUtil.newConfiguration(sparkConf) files .split(",") .map { f => - val file = new File(f) - // Relative paths should be uploaded via --files or --archives - // Use SparkFiles.get to download and unpack - if (!file.isAbsolute) { -if (!canAccessSparkFiles) { - throw new IllegalArgumentException( -"On yarn-client mode, driver only accepts absolute paths, but got " + f) + val uri = Utils.resolveURI(f) Review Comment: Thank you for your patient explanation, Rong! I previously thought that under yarn-client mode, `spark.yarn.dist.files` would also copy the files to the driver. I got the AM and driver mixed up. Let's fix this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-5074][VL] fix: UDF load error in yarn-cluster mode [incubator-gluten]
kecookier commented on code in PR #5075: URL: https://github.com/apache/incubator-gluten/pull/5075#discussion_r1535181373 ## backends-velox/src/main/scala/org/apache/spark/sql/expression/UDFResolver.scala: ## @@ -152,34 +143,39 @@ object UDFResolver extends Logging { // Get the full paths of all libraries. // If it's a directory, get all files ends with ".so" recursively. - def getAllLibraries(files: String, sparkConf: SparkConf, canAccessSparkFiles: Boolean): String = { + def getAllLibraries(files: String, sparkConf: SparkConf): String = { val hadoopConf = SparkHadoopUtil.newConfiguration(sparkConf) files .split(",") .map { f => - val file = new File(f) - // Relative paths should be uploaded via --files or --archives - // Use SparkFiles.get to download and unpack - if (!file.isAbsolute) { -if (!canAccessSparkFiles) { - throw new IllegalArgumentException( -"On yarn-client mode, driver only accepts absolute paths, but got " + f) + val uri = Utils.resolveURI(f) Review Comment: > It would be better to check whether it's a relative path first. Although it's resolved as a local file and pass the if condition, the path doesn't even exist. ## backends-velox/src/main/scala/org/apache/spark/sql/expression/UDFResolver.scala: ## @@ -152,34 +143,39 @@ object UDFResolver extends Logging { // Get the full paths of all libraries. // If it's a directory, get all files ends with ".so" recursively. - def getAllLibraries(files: String, sparkConf: SparkConf, canAccessSparkFiles: Boolean): String = { + def getAllLibraries(files: String, sparkConf: SparkConf): String = { val hadoopConf = SparkHadoopUtil.newConfiguration(sparkConf) files .split(",") .map { f => - val file = new File(f) - // Relative paths should be uploaded via --files or --archives - // Use SparkFiles.get to download and unpack - if (!file.isAbsolute) { -if (!canAccessSparkFiles) { - throw new IllegalArgumentException( -"On yarn-client mode, driver only accepts absolute paths, but got " + f) + val uri = Utils.resolveURI(f) Review Comment: > It would be better to check whether it's a relative path first. Although it's resolved as a local file and pass the if condition, the path doesn't even exist. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-5074][VL] fix: UDF load error in yarn-cluster mode [incubator-gluten]
marin-ma commented on code in PR #5075: URL: https://github.com/apache/incubator-gluten/pull/5075#discussion_r1535176984 ## backends-velox/src/main/scala/org/apache/spark/sql/expression/UDFResolver.scala: ## @@ -152,34 +143,39 @@ object UDFResolver extends Logging { // Get the full paths of all libraries. // If it's a directory, get all files ends with ".so" recursively. - def getAllLibraries(files: String, sparkConf: SparkConf, canAccessSparkFiles: Boolean): String = { + def getAllLibraries(files: String, sparkConf: SparkConf): String = { val hadoopConf = SparkHadoopUtil.newConfiguration(sparkConf) files .split(",") .map { f => - val file = new File(f) - // Relative paths should be uploaded via --files or --archives - // Use SparkFiles.get to download and unpack - if (!file.isAbsolute) { -if (!canAccessSparkFiles) { - throw new IllegalArgumentException( -"On yarn-client mode, driver only accepts absolute paths, but got " + f) + val uri = Utils.resolveURI(f) Review Comment: > In your example, libmyudf.so is a local file That doesn't make sense to me. `libmyudf.so` here should refer to the file uploaded by `--files`. But here it's resolved as a relative path to the runtime directory. > The call to Utils.resolveURI() is used to determine whether the file is local or remote. It would be better to check whether it's a relative path first. Although it's resolved as a local file and pass the `if` condition, the path doesn't even exist. > When --master=yarn is specified, the file is copied to the working directory on all nodes (both the driver and executors). Based on my previous experience, for yarn client mode, the files will be copied to all executor container + AM container, so they won't be copied to the driver node. In this case, if we only use the two configurations below ``` --files /path/to/gluten/cpp/build/velox/udf/examples/libmyudf.so --conf spark.gluten.sql.columnar.backend.velox.udfLibraryPaths=libmyudf.so ``` the driver will fail to get the actual path for `libmyudf.so`, and that's the reason for adding `spark.gluten.sql.columnar.backend.velox.driver.udfLibraryPaths` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-5074][VL] fix: UDF load error in yarn-cluster mode [incubator-gluten]
kecookier commented on code in PR #5075: URL: https://github.com/apache/incubator-gluten/pull/5075#discussion_r1535162020 ## backends-velox/src/main/scala/org/apache/spark/sql/expression/UDFResolver.scala: ## @@ -152,34 +143,39 @@ object UDFResolver extends Logging { // Get the full paths of all libraries. // If it's a directory, get all files ends with ".so" recursively. - def getAllLibraries(files: String, sparkConf: SparkConf, canAccessSparkFiles: Boolean): String = { + def getAllLibraries(files: String, sparkConf: SparkConf): String = { val hadoopConf = SparkHadoopUtil.newConfiguration(sparkConf) files .split(",") .map { f => - val file = new File(f) - // Relative paths should be uploaded via --files or --archives - // Use SparkFiles.get to download and unpack - if (!file.isAbsolute) { -if (!canAccessSparkFiles) { - throw new IllegalArgumentException( -"On yarn-client mode, driver only accepts absolute paths, but got " + f) + val uri = Utils.resolveURI(f) Review Comment: The call to Utils.resolveURI() is used to determine whether the file is local or remote. In your example, libmyudf.so is a local file, and we will not use uri.path directly. The --files argument copies this file to a different destination directory. When --master=yarn is specified, the file is copied to the working directory on all nodes (both the driver and executors). In local mode, the files are added using SparkContext.addFile, and they can then be accessed using SparkFiles.get(f). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-5074][VL] fix: UDF load error in yarn-cluster mode [incubator-gluten]
marin-ma commented on code in PR #5075: URL: https://github.com/apache/incubator-gluten/pull/5075#discussion_r1534994137 ## backends-velox/src/main/scala/org/apache/spark/sql/expression/UDFResolver.scala: ## @@ -152,34 +143,39 @@ object UDFResolver extends Logging { // Get the full paths of all libraries. // If it's a directory, get all files ends with ".so" recursively. - def getAllLibraries(files: String, sparkConf: SparkConf, canAccessSparkFiles: Boolean): String = { + def getAllLibraries(files: String, sparkConf: SparkConf): String = { val hadoopConf = SparkHadoopUtil.newConfiguration(sparkConf) files .split(",") .map { f => - val file = new File(f) - // Relative paths should be uploaded via --files or --archives - // Use SparkFiles.get to download and unpack - if (!file.isAbsolute) { -if (!canAccessSparkFiles) { - throw new IllegalArgumentException( -"On yarn-client mode, driver only accepts absolute paths, but got " + f) + val uri = Utils.resolveURI(f) Review Comment: But `file://${PWD}/libmyudf.so` is not the expected path, right? It should be `/path/to/gluten/cpp/build/velox/udf/examples/libmyudf.so` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-5074][VL] fix: UDF load error in yarn-cluster mode [incubator-gluten]
kecookier commented on code in PR #5075: URL: https://github.com/apache/incubator-gluten/pull/5075#discussion_r1534992522 ## backends-velox/src/main/scala/org/apache/spark/sql/expression/UDFResolver.scala: ## @@ -152,34 +143,39 @@ object UDFResolver extends Logging { // Get the full paths of all libraries. // If it's a directory, get all files ends with ".so" recursively. - def getAllLibraries(files: String, sparkConf: SparkConf, canAccessSparkFiles: Boolean): String = { + def getAllLibraries(files: String, sparkConf: SparkConf): String = { val hadoopConf = SparkHadoopUtil.newConfiguration(sparkConf) files .split(",") .map { f => - val file = new File(f) - // Relative paths should be uploaded via --files or --archives - // Use SparkFiles.get to download and unpack - if (!file.isAbsolute) { -if (!canAccessSparkFiles) { - throw new IllegalArgumentException( -"On yarn-client mode, driver only accepts absolute paths, but got " + f) + val uri = Utils.resolveURI(f) Review Comment: yes,if f is `libmyudf.so`, Utils.resolveURI() will return `file://${PWD}/libmyudf.so` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-5074][VL] fix: UDF load error in yarn-cluster mode [incubator-gluten]
kecookier commented on code in PR #5075: URL: https://github.com/apache/incubator-gluten/pull/5075#discussion_r1534992690 ## backends-velox/src/main/scala/org/apache/spark/sql/expression/UDFResolver.scala: ## @@ -152,34 +143,39 @@ object UDFResolver extends Logging { // Get the full paths of all libraries. // If it's a directory, get all files ends with ".so" recursively. - def getAllLibraries(files: String, sparkConf: SparkConf, canAccessSparkFiles: Boolean): String = { + def getAllLibraries(files: String, sparkConf: SparkConf): String = { val hadoopConf = SparkHadoopUtil.newConfiguration(sparkConf) files .split(",") .map { f => - val file = new File(f) - // Relative paths should be uploaded via --files or --archives - // Use SparkFiles.get to download and unpack - if (!file.isAbsolute) { -if (!canAccessSparkFiles) { - throw new IllegalArgumentException( -"On yarn-client mode, driver only accepts absolute paths, but got " + f) + val uri = Utils.resolveURI(f) + var file: File = null Review Comment: ok, I'll fix it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-5074][VL] fix: UDF load error in yarn-cluster mode [incubator-gluten]
marin-ma commented on code in PR #5075: URL: https://github.com/apache/incubator-gluten/pull/5075#discussion_r1534990803 ## backends-velox/src/main/scala/org/apache/spark/sql/expression/UDFResolver.scala: ## @@ -152,34 +143,39 @@ object UDFResolver extends Logging { // Get the full paths of all libraries. // If it's a directory, get all files ends with ".so" recursively. - def getAllLibraries(files: String, sparkConf: SparkConf, canAccessSparkFiles: Boolean): String = { + def getAllLibraries(files: String, sparkConf: SparkConf): String = { val hadoopConf = SparkHadoopUtil.newConfiguration(sparkConf) files .split(",") .map { f => - val file = new File(f) - // Relative paths should be uploaded via --files or --archives - // Use SparkFiles.get to download and unpack - if (!file.isAbsolute) { -if (!canAccessSparkFiles) { - throw new IllegalArgumentException( -"On yarn-client mode, driver only accepts absolute paths, but got " + f) + val uri = Utils.resolveURI(f) + var file: File = null Review Comment: Let's avoid using `var` here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-5074][VL] fix: UDF load error in yarn-cluster mode [incubator-gluten]
marin-ma commented on code in PR #5075: URL: https://github.com/apache/incubator-gluten/pull/5075#discussion_r1534990188 ## backends-velox/src/main/scala/org/apache/spark/sql/expression/UDFResolver.scala: ## @@ -152,34 +143,39 @@ object UDFResolver extends Logging { // Get the full paths of all libraries. // If it's a directory, get all files ends with ".so" recursively. - def getAllLibraries(files: String, sparkConf: SparkConf, canAccessSparkFiles: Boolean): String = { + def getAllLibraries(files: String, sparkConf: SparkConf): String = { val hadoopConf = SparkHadoopUtil.newConfiguration(sparkConf) files .split(",") .map { f => - val file = new File(f) - // Relative paths should be uploaded via --files or --archives - // Use SparkFiles.get to download and unpack - if (!file.isAbsolute) { -if (!canAccessSparkFiles) { - throw new IllegalArgumentException( -"On yarn-client mode, driver only accepts absolute paths, but got " + f) + val uri = Utils.resolveURI(f) Review Comment: Looks like `f` can also be a relative filename or file tag here without a scheme. Would this logic be able to handle such case? e.g. ``` --files /path/to/gluten/cpp/build/velox/udf/examples/libmyudf.so --conf spark.gluten.sql.columnar.backend.velox.udfLibraryPaths=libmyudf.so ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org
Re: [PR] [GLUTEN-5074][VL] fix: UDF load error in yarn-cluster mode [incubator-gluten]
github-actions[bot] commented on PR #5075: URL: https://github.com/apache/incubator-gluten/pull/5075#issuecomment-2013017698 https://github.com/apache/incubator-gluten/issues/5074 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@gluten.apache.org For additional commands, e-mail: commits-h...@gluten.apache.org