Re: Spark 3.3 + parquet 1.10
Spark3.3 in OSS built with parquet 1.12. Just compiling with parquet 1.10 results in build failure , so just wondering if any one have build & compiled Spark 3.3 with parquet 1.10. Regards Pralabh Kumar On Mon, Jul 24, 2023 at 3:04 PM Mich Talebzadeh wrote: > Hi, > > Where is this limitation coming from (using 1.1.0)? That is 2018 build > > Have you tried Spark 3.3 with parquet writes as is? Just a small PoC will > prove it. > > HTH > Mich Talebzadeh, > Solutions Architect/Engineering Lead > Palantir Technologies Limited > London > United Kingdom > > >view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Mon, 24 Jul 2023 at 10:15, Pralabh Kumar > wrote: > >> Hi Dev community. >> >> I have a quick question with respect to Spark 3.3. Currently Spark 3.3 is >> built with parquet 1.12. >> However, anyone tried Spark 3.3 with parquet 1.10 . >> >> We are at Uber , planning to migrate Spark 3.3 but we have limitations of >> using parquet 1.10 . Has anyone tried building Spark 3.3 with parquet 1.10 >> ? What are the dos/ don't for it ? >> >> >> Regards >> Pralabh Kumar >> >
Spark 3.3 + parquet 1.10
Hi Dev community. I have a quick question with respect to Spark 3.3. Currently Spark 3.3 is built with parquet 1.12. However, anyone tried Spark 3.3 with parquet 1.10 . We are at Uber , planning to migrate Spark 3.3 but we have limitations of using parquet 1.10 . Has anyone tried building Spark 3.3 with parquet 1.10 ? What are the dos/ don't for it ? Regards Pralabh Kumar
Spark 3.0.0 EOL
Hi Dev Team If possible , can you please provide the Spark 3.0.0 EOL timelines . Regards Pralabh Kumar
SPARK-43235
Hi Dev Please find some time to review the Jira. Regards Pralabh Kumar
Re: Setting spark.kubernetes.driver.connectionTimeout, spark.kubernetes.submission.connectionTimeout to default spark.network.timeout
allows K8s API server-side caching via SPARK-36334. > You may want to try the following configuration if you have very > limited control plan resources. > > spark.kubernetes.executor.enablePollingWithResourceVersion=true > > Dongjoon. > > On Mon, Aug 1, 2022 at 7:52 AM Pralabh Kumar > wrote: > > > > Hi Dev team > > > > > > > > Since spark.network.timeout is default for all the network transactions > . Shouldn’t spark.kubernetes.driver.connectionTimeout, > spark.kubernetes.submission.connectionTimeout by default to be set > spark.network.timeout . > > > > Users migrating from Yarn to K8s are familiar with spark.network.timeout > and if time out occurs on K8s , they need to explicitly set the above two > properties. If the above properties are default set to > spark.network.timeout then user don’t need to explicitly set above > properties and it can work with spark.network.timeout. > > > > > > > > Please let me know if my understanding is correct > > > > > > > > Regards > > > > Pralabh Kumar >
Setting spark.kubernetes.driver.connectionTimeout, spark.kubernetes.submission.connectionTimeout to default spark.network.timeout
Hi Dev team Since* spark.network.timeout* is default for all the network transactions . Shouldn’t *spark.kubernetes.driver.connectionTimeout*, *spark.kubernetes.submission.connectionTimeout* by default to be set spark.network.timeout . Users migrating from Yarn to K8s are familiar with spark.network.timeout and if time out occurs on K8s , they need to explicitly set the above two properties. If the above properties are default set to spark.network.timeout then user don’t need to explicitly set above properties and it can work with spark.network.timeout. Please let me know if my understanding is correct Regards Pralabh Kumar
Spark-39755 Review/comment
Hi Dev community Please review/comment https://issues.apache.org/jira/browse/SPARK-39755 Regards Pralabh kumar
Re: Spark32 + Java 11 . Reading parquet java.lang.NoSuchMethodError: 'sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner()'
Thx for the reply @Steve Loughran @Martin . It helps . However just a minor suggestion - Should we update the documentation https://spark.apache.org/docs/latest/#downloading , which talks about java.nio.DirectByteBuffer . We can add another case , where user will get the same error for Spark on K8s on Spark32 running on version < Hadoop 3.2 (since the default value in Docker file for Spark32 is Java 11) Please let me know if it make sense to you. Regards Pralabh Kumar On Tue, Jun 14, 2022 at 4:21 PM Steve Loughran wrote: > hadoop 3.2.x is the oldest of the hadoop branch 3 branches which gets > active security patches, as was done last month. I would strongly recommend > using it unless there are other compatibility issues (hive?) > > On Tue, 14 Jun 2022 at 05:31, Pralabh Kumar > wrote: > >> Hi Steve / Dev team >> >> Thx for the help . Have a quick question , How can we fix the above >> error in Hadoop 3.1 . >> >>- Spark docker file have (Java 11) >> >> https://github.com/apache/spark/blob/branch-3.2/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile >> >>- Now if we build Spark32 , Spark image will be having Java 11 . If >>we run on a Hadoop version less than 3.2 , it will throw an exception. >> >> >> >>- Should there be a separate docker file for Spark32 for Java 8 for >>Hadoop version < 3.2 . Spark 3.0.1 have Java 8 in docker file which works >>fine in our environment (with Hadoop3.1) >> >> >> Regards >> Pralabh Kumar >> >> >> >> On Mon, Jun 13, 2022 at 3:25 PM Steve Loughran >> wrote: >> >>> >>> >>> On Mon, 13 Jun 2022 at 08:52, Pralabh Kumar >>> wrote: >>> >>>> Hi Dev team >>>> >>>> I have a spark32 image with Java 11 (Running Spark on K8s) . While >>>> reading a huge parquet file via spark.read.parquet("") . I am >>>> getting the following error . The same error is mentioned in Spark docs >>>> https://spark.apache.org/docs/latest/#downloading but w.r.t to apache >>>> arrow. >>>> >>>> >>>>- IMHO , I think the error is coming from Parquet 1.12.1 which is >>>>based on Hadoop 2.10 which is not java 11 compatible. >>>> >>>> >>> correct. see https://issues.apache.org/jira/browse/HADOOP-12760 >>> >>> >>> Please let me know if this understanding is correct and is there a way >>>> to fix it. >>>> >>> >>> >>> >>> upgrade to a version of hadoop with the fix. That's any version >= >>> hadoop 3.2.0 which shipped since 2018 >>> >>>> >>>> >>>> java.lang.NoSuchMethodError: 'sun.misc.Cleaner >>>> sun.nio.ch.DirectBuffer.cleaner()' >>>> >>>> at >>>> org.apache.hadoop.crypto.CryptoStreamUtils.freeDB(CryptoStreamUtils.java:41) >>>> >>>> at >>>> org.apache.hadoop.crypto.CryptoInputStream.freeBuffers(CryptoInputStream.java:687) >>>> >>>> at >>>> org.apache.hadoop.crypto.CryptoInputStream.close(CryptoInputStream.java:320) >>>> >>>> at java.base/java.io.FilterInputStream.close(Unknown >>>> Source) >>>> >>>> at >>>> org.apache.parquet.hadoop.util.H2SeekableInputStream.close(H2SeekableInputStream.java:50) >>>> >>>> at >>>> org.apache.parquet.hadoop.ParquetFileReader.close(ParquetFileReader.java:1299) >>>> >>>> at >>>> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:54) >>>> >>>> at >>>> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44) >>>> >>>> at >>>> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:467) >>>> >>>> at >>>> org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:372) >>>> >>>> at >>>> scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) >>>> >>>> at scala.util.Success.$anonfun$map$1(Try.scala:255) >>>> >>>> at scala.util.Success.map(Try.scala:213) >>>> >>>> at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) >>>> >>>> at >>>> scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) >>>> >>>> at >>>> scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) >>>> >>>> at >>>> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) >>>> >>>> at >>>> java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown >>>> Source) >>>> >>>> at >>>> java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source) >>>> >>>> at >>>> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown >>>> Source) >>>> >>>> at >>>> java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source) >>>> >>>> at >>>> java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) >>>> >>>> at >>>> java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) >>>> >>>
Re: Spark32 + Java 11 . Reading parquet java.lang.NoSuchMethodError: 'sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner()'
Hi Steve / Dev team Thx for the help . Have a quick question , How can we fix the above error in Hadoop 3.1 . - Spark docker file have (Java 11) https://github.com/apache/spark/blob/branch-3.2/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile - Now if we build Spark32 , Spark image will be having Java 11 . If we run on a Hadoop version less than 3.2 , it will throw an exception. - Should there be a separate docker file for Spark32 for Java 8 for Hadoop version < 3.2 . Spark 3.0.1 have Java 8 in docker file which works fine in our environment (with Hadoop3.1) Regards Pralabh Kumar On Mon, Jun 13, 2022 at 3:25 PM Steve Loughran wrote: > > > On Mon, 13 Jun 2022 at 08:52, Pralabh Kumar > wrote: > >> Hi Dev team >> >> I have a spark32 image with Java 11 (Running Spark on K8s) . While >> reading a huge parquet file via spark.read.parquet("") . I am getting >> the following error . The same error is mentioned in Spark docs >> https://spark.apache.org/docs/latest/#downloading but w.r.t to apache >> arrow. >> >> >>- IMHO , I think the error is coming from Parquet 1.12.1 which is >>based on Hadoop 2.10 which is not java 11 compatible. >> >> > correct. see https://issues.apache.org/jira/browse/HADOOP-12760 > > > Please let me know if this understanding is correct and is there a way to >> fix it. >> > > > > upgrade to a version of hadoop with the fix. That's any version >= hadoop > 3.2.0 which shipped since 2018 > >> >> >> java.lang.NoSuchMethodError: 'sun.misc.Cleaner >> sun.nio.ch.DirectBuffer.cleaner()' >> >> at >> org.apache.hadoop.crypto.CryptoStreamUtils.freeDB(CryptoStreamUtils.java:41) >> >> at >> org.apache.hadoop.crypto.CryptoInputStream.freeBuffers(CryptoInputStream.java:687) >> >> at >> org.apache.hadoop.crypto.CryptoInputStream.close(CryptoInputStream.java:320) >> >> at java.base/java.io.FilterInputStream.close(Unknown Source) >> >> at >> org.apache.parquet.hadoop.util.H2SeekableInputStream.close(H2SeekableInputStream.java:50) >> >> at >> org.apache.parquet.hadoop.ParquetFileReader.close(ParquetFileReader.java:1299) >> >> at >> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:54) >> >> at >> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44) >> >> at >> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:467) >> >> at >> org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:372) >> >> at >> scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) >> >> at scala.util.Success.$anonfun$map$1(Try.scala:255) >> >> at scala.util.Success.map(Try.scala:213) >> >> at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) >> >> at >> scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) >> >> at >> scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) >> >> at >> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) >> >> at >> java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown >> Source) >> >> at >> java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source) >> >> at >> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown >> Source) >> >> at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown >> Source) >> >> at >> java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) >> >> at >> java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) >> >
Re: Spark32 + Java 11 . Reading parquet java.lang.NoSuchMethodError: 'sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner()'
Steve . Thx for your help ,please ignore last comment. Regards Pralabh Kumar On Mon, 13 Jun 2022, 15:43 Pralabh Kumar, wrote: > Hi steve > > Thx for help . We are on Hadoop3.2 ,however we are building Hadoop3.2 with > Java 8 . > > Do you suggest to build Hadoop with Java 11 > > Regards > Pralabh kumar > > On Mon, 13 Jun 2022, 15:25 Steve Loughran, wrote: > >> >> >> On Mon, 13 Jun 2022 at 08:52, Pralabh Kumar >> wrote: >> >>> Hi Dev team >>> >>> I have a spark32 image with Java 11 (Running Spark on K8s) . While >>> reading a huge parquet file via spark.read.parquet("") . I am getting >>> the following error . The same error is mentioned in Spark docs >>> https://spark.apache.org/docs/latest/#downloading but w.r.t to apache >>> arrow. >>> >>> >>>- IMHO , I think the error is coming from Parquet 1.12.1 which is >>>based on Hadoop 2.10 which is not java 11 compatible. >>> >>> >> correct. see https://issues.apache.org/jira/browse/HADOOP-12760 >> >> >> Please let me know if this understanding is correct and is there a way to >>> fix it. >>> >> >> >> >> upgrade to a version of hadoop with the fix. That's any version >= hadoop >> 3.2.0 which shipped since 2018 >> >>> >>> >>> java.lang.NoSuchMethodError: 'sun.misc.Cleaner >>> sun.nio.ch.DirectBuffer.cleaner()' >>> >>> at >>> org.apache.hadoop.crypto.CryptoStreamUtils.freeDB(CryptoStreamUtils.java:41) >>> >>> at >>> org.apache.hadoop.crypto.CryptoInputStream.freeBuffers(CryptoInputStream.java:687) >>> >>> at >>> org.apache.hadoop.crypto.CryptoInputStream.close(CryptoInputStream.java:320) >>> >>> at java.base/java.io.FilterInputStream.close(Unknown Source) >>> >>> at >>> org.apache.parquet.hadoop.util.H2SeekableInputStream.close(H2SeekableInputStream.java:50) >>> >>> at >>> org.apache.parquet.hadoop.ParquetFileReader.close(ParquetFileReader.java:1299) >>> >>> at >>> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:54) >>> >>> at >>> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44) >>> >>> at >>> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:467) >>> >>> at >>> org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:372) >>> >>> at >>> scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) >>> >>> at scala.util.Success.$anonfun$map$1(Try.scala:255) >>> >>> at scala.util.Success.map(Try.scala:213) >>> >>> at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) >>> >>> at >>> scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) >>> >>> at >>> scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) >>> >>> at >>> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) >>> >>> at >>> java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown >>> Source) >>> >>> at >>> java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source) >>> >>> at >>> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown >>> Source) >>> >>> at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown >>> Source) >>> >>> at >>> java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) >>> >>> at >>> java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) >>> >>
Re: Spark32 + Java 11 . Reading parquet java.lang.NoSuchMethodError: 'sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner()'
Hi steve Thx for help . We are on Hadoop3.2 ,however we are building Hadoop3.2 with Java 8 . Do you suggest to build Hadoop with Java 11 Regards Pralabh kumar On Mon, 13 Jun 2022, 15:25 Steve Loughran, wrote: > > > On Mon, 13 Jun 2022 at 08:52, Pralabh Kumar > wrote: > >> Hi Dev team >> >> I have a spark32 image with Java 11 (Running Spark on K8s) . While >> reading a huge parquet file via spark.read.parquet("") . I am getting >> the following error . The same error is mentioned in Spark docs >> https://spark.apache.org/docs/latest/#downloading but w.r.t to apache >> arrow. >> >> >>- IMHO , I think the error is coming from Parquet 1.12.1 which is >>based on Hadoop 2.10 which is not java 11 compatible. >> >> > correct. see https://issues.apache.org/jira/browse/HADOOP-12760 > > > Please let me know if this understanding is correct and is there a way to >> fix it. >> > > > > upgrade to a version of hadoop with the fix. That's any version >= hadoop > 3.2.0 which shipped since 2018 > >> >> >> java.lang.NoSuchMethodError: 'sun.misc.Cleaner >> sun.nio.ch.DirectBuffer.cleaner()' >> >> at >> org.apache.hadoop.crypto.CryptoStreamUtils.freeDB(CryptoStreamUtils.java:41) >> >> at >> org.apache.hadoop.crypto.CryptoInputStream.freeBuffers(CryptoInputStream.java:687) >> >> at >> org.apache.hadoop.crypto.CryptoInputStream.close(CryptoInputStream.java:320) >> >> at java.base/java.io.FilterInputStream.close(Unknown Source) >> >> at >> org.apache.parquet.hadoop.util.H2SeekableInputStream.close(H2SeekableInputStream.java:50) >> >> at >> org.apache.parquet.hadoop.ParquetFileReader.close(ParquetFileReader.java:1299) >> >> at >> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:54) >> >> at >> org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44) >> >> at >> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:467) >> >> at >> org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:372) >> >> at >> scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) >> >> at scala.util.Success.$anonfun$map$1(Try.scala:255) >> >> at scala.util.Success.map(Try.scala:213) >> >> at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) >> >> at >> scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) >> >> at >> scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) >> >> at >> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) >> >> at >> java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown >> Source) >> >> at >> java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source) >> >> at >> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown >> Source) >> >> at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown >> Source) >> >> at >> java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) >> >> at >> java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) >> >
Spark32 + Java 11 . Reading parquet java.lang.NoSuchMethodError: 'sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner()'
Hi Dev team I have a spark32 image with Java 11 (Running Spark on K8s) . While reading a huge parquet file via spark.read.parquet("") . I am getting the following error . The same error is mentioned in Spark docs https://spark.apache.org/docs/latest/#downloading but w.r.t to apache arrow. - IMHO , I think the error is coming from Parquet 1.12.1 which is based on Hadoop 2.10 which is not java 11 compatible. Please let me know if this understanding is correct and is there a way to fix it. java.lang.NoSuchMethodError: 'sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner()' at org.apache.hadoop.crypto.CryptoStreamUtils.freeDB(CryptoStreamUtils.java:41) at org.apache.hadoop.crypto.CryptoInputStream.freeBuffers(CryptoInputStream.java:687) at org.apache.hadoop.crypto.CryptoInputStream.close(CryptoInputStream.java:320) at java.base/java.io.FilterInputStream.close(Unknown Source) at org.apache.parquet.hadoop.util.H2SeekableInputStream.close(H2SeekableInputStream.java:50) at org.apache.parquet.hadoop.ParquetFileReader.close(ParquetFileReader.java:1299) at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:54) at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:467) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:372) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown Source) at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
CVE-2020-13936
Hi Dev Team Please let me know if there is a jira to track this CVE changes with respect to Spark . Searched jira but couldn't find anything. Please help Regards Pralabh Kumar
CVE-2021-22569
Hi Dev Team Spark is using protobuf 2.5.0 which is vulnerable to CVE-2021-22569. CVE recommends to use protobuf 3.19.2 Please let me know , if there is a jira to track the update w.r.t CVE and Spark or should I create the one ? Regards Pralabh Kumar
Re: Issue on Spark on K8s with Proxy user on Kerberized HDFS : Spark-25355
java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:163) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052) Again Thx for the input and advice regarding documentation and apologies for putting the wrong error stack earlier. Regards Pralabh Kumar On Tue, May 3, 2022 at 7:39 PM Steve Loughran wrote: > > Prablah, did you follow the URL provided in the exception message? i put a > lot of effort in to improving the diagnostics, where the wiki articles are > part of the troubleshooing process > https://issues.apache.org/jira/browse/HADOOP-7469 > > it's really disappointing when people escalate the problem to open source > developers before trying to fix the problem themselves, in this case, read > the error message. > > now, if there is some k8s related issue which makes this more common, you > are encouraged to update the wiki entry with a new cause. documentation is > an important contribution to open source projects, and if you have > discovered a new way to recreate the failure, it would be welcome. which > reminds me, i have to add something to connection reset and docker which > comes down to "turn off http keepalive in maven builds" > > -Steve > > > > > > On Sat, 30 Apr 2022 at 10:45, Gabor Somogyi > wrote: > >> Hi, >> >> Please be aware that ConnectionRefused exception is has nothing to do w/ >> authentication. See the description from Hadoop wiki: >> "You get a ConnectionRefused >> <https://cwiki.apache.org/confluence/display/HADOOP2/ConnectionRefused> >> Exception >> when there is a machine at the address specified, but there is no program >> listening on the specific TCP port the client is using -and there is no >> firewall in the way silently dropping TCP connection requests. If you do >> not know what a TCP connection request is, please consult the >> specification <http://www.ietf.org/rfc/rfc793.txt>." >> >> This means the namenode on host:port is not reachable in the TCP layer. >> Maybe there are multiple issues but I'm pretty sure that something is wrong >> in the K8S net config. >> >> BR, >> G >> >> >> On Fri, Apr 29, 2022 at 6:23 PM Pralabh Kumar >> wrote: >> >>> Hi dev Team >>> >>> Spark-25355 added the functionality of the proxy user on K8s . However >>> proxy user on K8s with Kerberized HDFS is not working . It is throwing >>> exception and >>> >>> 22/04/21 17:50:30 WARN Client: Exception encountered while connecting to >>> the server : org.apache.hadoop.security.AccessControlException: Client >>> cannot authenticate via:[TOKEN, KERBEROS] >>> >>> >>> Exception in thread "main" java.net.ConnectException: Call From >>> to failed on connection exception: >>> java.net.ConnectException: Connection refused; For more details see: http: >>> //wiki.apache.org/hadoop/ConnectionRefused >>> >>> at >>> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native >>> Method) >>> >>> at >>> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown >>> Source) >>> >>> at >>> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown >>> Source) >>> >>> at java.base/java.lang.reflect.Constructor.newInstance(Unknown >>> Source) >>> >>> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) >>> >>> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:755) >>> >>> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1501) >>> >>> at org.apache.hadoop.ipc.Client.call(Client.java:1443) >>> >>> at org.apache.hadoop.ipc.Client.call(Client.java:1353) >>> >>> at >>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) >>> >>> at >>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) >>> >>> at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source) >>> >>> at >>> >>> >>> >>> On debugging deep , we found the proxy user doesn't have access to >>> delegation tokens in case of K8s .SparkSubmit.submit explicitly creating >>> the proxy user and this user doesn't have delegation token. >>> >>> >>> Please help me with the same. >>> >>> >>> Regards >>> >>> Pralabh Kumar >>> >>> >>> >>>
Issue on Spark on K8s with Proxy user on Kerberized HDFS : Spark-25355
Hi dev Team Spark-25355 added the functionality of the proxy user on K8s . However proxy user on K8s with Kerberized HDFS is not working . It is throwing exception and 22/04/21 17:50:30 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] Exception in thread "main" java.net.ConnectException: Call From to failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http:// wiki.apache.org/hadoop/ConnectionRefused at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.base/java.lang.reflect.Constructor.newInstance(Unknown Source) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:755) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1501) at org.apache.hadoop.ipc.Client.call(Client.java:1443) at org.apache.hadoop.ipc.Client.call(Client.java:1353) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source) at On debugging deep , we found the proxy user doesn't have access to delegation tokens in case of K8s .SparkSubmit.submit explicitly creating the proxy user and this user doesn't have delegation token. Please help me with the same. Regards Pralabh Kumar
Spark3.2 on K8s with proxy-user kerberized environment
Hi dev team Please help me on the below problem I have kerberized cluster and am also doing the kinit . Problem is only coming when the proxy user is being used . > > Running Spark 3.2 on K8s with --proxy-user and getting below error and > then the job fails . However when running without a proxy user job is > running fine . Can anyone please help me with the same . > > > 22/04/21 17:50:30 WARN Client: Exception encountered while connecting to > the server : org.apache.hadoop.security.AccessControlException: Client > cannot authenticate via:[TOKEN, KERBEROS] > > 22/04/21 17:50:31 WARN Client: Exception encountered while connecting to > the server : org.apache.hadoop.security.AccessControlException: Client > cannot authenticate via:[TOKEN, KERBEROS] > > 22/04/21 17:50:37 WARN Client: Exception encountered while connecting to > the server : org.apache.hadoop.security.AccessControlException: Client > cannot authenticate via:[TOKEN, KERBEROS] > > 22/04/21 17:50:53 WARN Client: Exception encountered while connecting to > the server : org.apache.hadoop.security.AccessControlException: Client > cannot authenticate via:[TOKEN, KERBEROS] > > 22/04/21 17:51:32 WARN Client: Exception encountered while connecting to > the server : org.apache.hadoop.security.AccessControlException: Client > cannot authenticate via:[TOKEN, KERBEROS] > > 22/04/21 17:52:07 WARN Client: Exception encountered while connecting to > the server : org.apache.hadoop.security.AccessControlException: Client > cannot authenticate via:[TOKEN, KERBEROS] > > 22/04/21 17:52:27 WARN Client: Exception encountered while connecting to > the server : org.apache.hadoop.security.AccessControlException: Client > cannot authenticate via:[TOKEN, KERBEROS] > > 22/04/21 17:52:53 WARN Client: Exception encountered while connecting to > the server : org.apache.hadoop.security.AccessControlException: Client > cannot authenticate via:[TOKEN, KERBEROS] > >
Re: CVE -2020-28458, How to upgrade datatables dependency
Thx for the update. ( I was about to create the PR). Thx for looking into it. On Sun, 17 Apr 2022, 00:26 Sean Owen, wrote: > FWIW here's an update to 1.10.25: > https://github.com/apache/spark/pull/36226 > > > On Wed, Apr 13, 2022 at 8:28 AM Sean Owen wrote: > >> You can see the files in >> core/src/main/resources/org/apache/spark/ui/static - you can try dropping >> in the new minified versions and see if the UI is OK. >> You can open a pull request if it works to update it, in case this >> affects Spark. >> It looks like the smaller upgrade to 1.10.22 is also sufficient. >> >> On Wed, Apr 13, 2022 at 7:43 AM Pralabh Kumar >> wrote: >> >>> Hi Dev Team >>> >>> Spark 3.2 (and 3.3 might also) have CVE 2020-28458. Therefore in my >>> local repo of Spark I would like to update DataTables to 1.11.5. >>> >>> Can you please help me to point out where I should upgrade DataTables >>> dependency ?. >>> >>> Regards >>> Pralabh Kumar >>> >>
CVE -2020-28458, How to upgrade datatables dependency
Hi Dev Team Spark 3.2 (and 3.3 might also) have CVE 2020-28458. Therefore in my local repo of Spark I would like to update DataTables to 1.11.5. Can you please help me to point out where I should upgrade DataTables dependency ?. Regards Pralabh Kumar
Spark 3.0.1 and spark 3.2 compatibility
Hi spark community I have quick question .I am planning to migrate from spark 3.0.1 to spark 3.2. Do I need to recompile my application with 3.2 dependencies or application compiled with 3.0.1 will work fine on 3.2 ? Regards Pralabh kumar
Spark on K8s , some applications ended ungracefully
Hi Spark Team Some of my spark applications on K8s ended with the below error . These applications though completed successfully (as per the event log SparkListenerApplicationEnd event at the end) stil have even files with .inprogress. This causes the application to be shown as inprogress in SHS. Spark v : 3.0.1 22/03/31 08:33:34 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:68) 22/03/31 08:33:34 WARN SparkContext: Ignoring Exception while stopping SparkContext from shutdown hook java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475) at org.apache.spark.util.ThreadUtils$.shutdown(ThreadUtils.scala:348) Please let me know if there is a solution for it .. Regards Pralabh Kumar
Skip single integration test case in Spark on K8s
Hi Spark team I am running Spark kubernetes integration test suite on cloud. build/mvn install \ -f pom.xml \ -pl resource-managers/kubernetes/integration-tests -am -Pscala-2.12 -Phadoop-3.1.1 -Phive -Phive-thriftserver -Pyarn -Pkubernetes -Pkubernetes-integration-tests \ -Djava.version=8 \ -Dspark.kubernetes.test.sparkTgz= \ -Dspark.kubernetes.test.imageTag=<> \ -Dspark.kubernetes.test.imageRepo=< <http://reg.visa.com/>repo> \ -Dspark.kubernetes.test.deployMode=cloud \ -Dtest.include.tags=k8s \ -Dspark.kubernetes.test.javaImageTag= \ -Dspark.kubernetes.test.namespace= \ -Dspark.kubernetes.test.serviceAccountName=spark \ -Dspark.kubernetes.test.kubeConfigContext=<> \ -Dspark.kubernetes.test.master=<> \ -Dspark.kubernetes.test.jvmImage=<> \ -Dspark.kubernetes.test.pythonImage=<> \ -Dlog4j.logger.org.apache.spark=DEBUG I am successfully able to run some test cases and some are failing . For e.g "Run SparkRemoteFileTest using a Remote data file" in KuberneterSuite is failing. Is there a way to skip running some of the test cases ?. Please help me on the same. Regards Pralabh Kumar
Spark on K8s : property simillar to yarn.max.application.attempt
Hi Spark Team I am running spark on K8s and looking for a property/mechanism similar to yarn.max.application.attempt . I know this is not really a spark question , but i thought if anyone have faced the similar issue, Basically I want if my driver pod fails , it should be retried on a different machine . Is there a way to do the same . Regards Pralabh Kumar
Re: Spark on k8s : spark 3.0.1 spark.kubernetes.executor.deleteontermination issue
Does this property spark.kubernetes.executor.deleteontermination checks whether the executor which is deleted have shuffle data or not ? On Tue, 18 Jan 2022, 11:20 Pralabh Kumar, wrote: > Hi spark team > > Have cluster wide property spark.kubernetis.executor.deleteontermination > to true. > During the long running job, some of the executor got deleted which have > shuffle data. Because of this, in the subsequent stage , we get lot of > spark shuffle fetch fail exceptions. > > > Please let me know , is there a way to fix it. Note if setting above > property to false , I face no shuffle fetch exception. > > > Regards > Pralabh >
Spark on k8s : spark 3.0.1 spark.kubernetes.executor.deleteontermination issue
Hi spark team Have cluster wide property spark.kubernetis.executor.deleteontermination to true. During the long running job, some of the executor got deleted which have shuffle data. Because of this, in the subsequent stage , we get lot of spark shuffle fetch fail exceptions. Please let me know , is there a way to fix it. Note if setting above property to false , I face no shuffle fetch exception. Regards Pralabh
Difference in behavior for Spark 3.0 vs Spark 3.1 "create database "
Hi Spark Team When creating a database via Spark 3.0 on Hive 1) spark.sql("create database test location '/user/hive'"). It creates the database location on hdfs . As expected 2) When running the same command on 3.1 the database is created on the local file system by default. I have to prefix with hdfs to create db on hdfs. Why is there a difference in the behavior, Can you please point me to the jira which causes this change. Note : spark.sql.warehouse.dir and hive.metastore.warehouse.dir both are having default values(not explicitly set) Regards Pralabh Kumar
ivy unit test case filing for Spark
Hi Spark Team I am building a spark in VPN . But the unit test case below is failing. This is pointing to ivy location which cannot be reached within VPN . Any help would be appreciated test("SPARK-33084: Add jar support Ivy URI -- default transitive = true") { *sc *= new SparkContext(new SparkConf().setAppName("test").setMaster("local-cluster[3, 1, 1024]")) *sc*.addJar("*ivy://org.apache.hive:hive-storage-api:2.7.0*") assert(*sc*.listJars().exists(_.contains( "org.apache.hive_hive-storage-api-2.7.0.jar"))) assert(*sc*.listJars().exists(_.contains( "commons-lang_commons-lang-2.6.jar"))) } Error - SPARK-33084: Add jar support Ivy URI -- default transitive = true *** FAILED *** java.lang.RuntimeException: [unresolved dependency: org.apache.hive#hive-storage-api;2.7.0: not found] at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates( SparkSubmit.scala:1447) at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies( DependencyUtils.scala:185) at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies( DependencyUtils.scala:159) at org.apache.spark.SparkContext.addJar(SparkContext.scala:1996) at org.apache.spark.SparkContext.addJar(SparkContext.scala:1928) at org.apache.spark.SparkContextSuite.$anonfun$new$115(SparkContextSuite. scala:1041) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) Regards Pralabh Kumar
Log4j 1.2.17 spark CVE
Hi developers, users Spark is built using log4j 1.2.17 . Is there a plan to upgrade based on recent CVE detected ? Regards Pralabh kumar
https://issues.apache.org/jira/browse/SPARK-36622
Hi Spark dev Community Please let me know your opinion about https://issues.apache.org/jira/browse/SPARK-36622 Regards Pralabh Kumar
Spark Thriftserver is failing for when submitting command from beeline
Hi Dev Environment details Hadoop 3.2 Hive 3.1 Spark 3.0.3 Cluster : Kerborized . 1) Hive server is running fine 2) Spark sql , sparkshell, spark submit everything is working as expected. 3) Connecting Hive through beeline is working fine (after kinit) beeline -u "jdbc:hive2://:/default;principal= Now launched Spark thrift server and try to connect it through beeline. beeline client perfectly connects with STS . 4) beeline -u "jdbc:hive2://:/default;principal= a) Log says connected to Spark sql Drive : Hive JDBC Now when I run any commands ("show tables") it fails . Log ins STS says *21/08/19 19:30:12 DEBUG UserGroupInformation: PrivilegedAction as: (auth:PROXY) via (auth:KERBEROS) from:org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Client.createClientTransport(HadoopThriftAuthBridge.java:208)* *21/08/19 19:30:12 DEBUG UserGroupInformation: PrivilegedAction as:* ** * (auth:PROXY) via * ** * (auth:KERBEROS) from:org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)* 21/08/19 19:30:12 DEBUG TSaslTransport: opening transport org.apache.thrift.transport.TSaslClientTransport@f43fd2f 21/08/19 19:30:12 ERROR TSaslTransport: SASL negotiation failure javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) at org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:95) at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271) at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:38) at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52) at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:480) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:247) at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1707) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:83) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:133) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3600) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3652) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3632) at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1556) at org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1545) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$databaseExists$1(HiveClientImpl.scala:384) My guess is authorization through proxy is not working . Please help Regards Pralabh Kumar
Re: Hive on Spark vs Spark on Hive(HiveContext)
Hi mich Thx for replying.your answer really helps. The comparison was done in 2016. I would like to know the latest comparison with spark 3.0 Also what you are suggesting is to migrate queries to Spark ,which is hivecontxt or hive on spark, which is what Facebook also did . Is that understanding correct ? Regards Pralabh On Thu, 1 Jul 2021, 15:44 Mich Talebzadeh, wrote: > Hi Prahabh, > > This question has been asked before :) > > Few years ago (late 2016), I made a presentation on running Hive Queries > on the Spark execution engine for Hortonworks. > > > https://www.slideshare.net/MichTalebzadeh1/query-engines-for-hive-mr-spark-tez-with-llap-considerations > > The issue you will face will be compatibility problems with versions of > Hive and Spark. > > My suggestion would be to use Spark as a massive parallel processing and > Hive as a storage layer. However, you need to test what can be migrated or > not. > > HTH > > > Mich > > >view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Thu, 1 Jul 2021 at 10:52, Pralabh Kumar wrote: > >> Hi Dev >> >> I am having thousands of legacy hive queries . As a plan to move to >> Spark , we are planning to migrate Hive queries on Spark . Now there are >> two approaches >> >> >>1. One is Hive on Spark , which is similar to changing the execution >>engine in hive queries like TEZ. >>2. Another one is migrating hive queries to Hivecontext/sparksql , an >>approach used by Facebook and presented in Spark conference. >> >> https://databricks.com/session/experiences-migrating-hive-workload-to-sparksql#:~:text=Spark%20SQL%20in%20Apache%20Spark,SQL%20with%20minimal%20user%20intervention >>. >> >> >> Can you please guide me which option to go for . I am personally inclined >> to go for option 2 . It also allows the use of the latest spark . >> >> Please help me on the same , as there are not much comparisons online >> available keeping Spark 3.0 in perspective. >> >> Regards >> Pralabh Kumar >> >> >>
Hive on Spark vs Spark on Hive(HiveContext)
Hi Dev I am having thousands of legacy hive queries . As a plan to move to Spark , we are planning to migrate Hive queries on Spark . Now there are two approaches 1. One is Hive on Spark , which is similar to changing the execution engine in hive queries like TEZ. 2. Another one is migrating hive queries to Hivecontext/sparksql , an approach used by Facebook and presented in Spark conference. https://databricks.com/session/experiences-migrating-hive-workload-to-sparksql#:~:text=Spark%20SQL%20in%20Apache%20Spark,SQL%20with%20minimal%20user%20intervention . Can you please guide me which option to go for . I am personally inclined to go for option 2 . It also allows the use of the latest spark . Please help me on the same , as there are not much comparisons online available keeping Spark 3.0 in perspective. Regards Pralabh Kumar
Unable to pickle pySpark PipelineModel
Hi Dev , User I want to store spark ml model in databases , so that I can reuse them later on . I am unable to pickle them . However while using scala I am able to convert them into byte array stream . So for .eg I am able to do something below in scala but not in python val modelToByteArray = new ByteArrayOutputStream() val oos = new ObjectOutputStream(modelToByteArray) oos.writeObject(model) oos.close() oos.flush() spark.sparkContext.parallelize(Seq((model.uid, "my-neural-network-model", modelToByteArray.toByteArray))) .saveToCassandra("dfsdfs", "models", SomeColumns("uid", "name", "model") But pickle.dumps(model) in pyspark throws error cannot pickle '_thread.RLock' object Please help on the same Regards Pralabh
Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:
I am performing join operation , if I convert reduce side join to map side (no shuffle will happen) and I assume in that case this error shouldn't come. Let me know if this understanding is correct On Tue, May 1, 2018 at 9:37 PM, Ryan Blue wrote: > This is usually caused by skew. Sometimes you can work around it by in > creasing the number of partitions like you tried, but when that doesn’t > work you need to change the partitioning that you’re using. > > If you’re aggregating, try adding an intermediate aggregation. For > example, if your query is select sum(x), a from t group by a, then try select > sum(partial), a from (select sum(x) as partial, a, b from t group by a, b) > group by a. > > rb > > > On Tue, May 1, 2018 at 4:21 AM, Pralabh Kumar > wrote: > >> Hi >> >> I am getting the above error in Spark SQL . I have increase (using 5000 ) >> number of partitions but still getting the same error . >> >> My data most probably is skew. >> >> >> >> org.apache.spark.shuffle.FetchFailedException: Too large frame: 4247124829 >> at >> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:419) >> at >> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:349) >> >> > > > -- > Ryan Blue > Software Engineer > Netflix >
org.apache.spark.shuffle.FetchFailedException: Too large frame:
Hi I am getting the above error in Spark SQL . I have increase (using 5000 ) number of partitions but still getting the same error . My data most probably is skew. org.apache.spark.shuffle.FetchFailedException: Too large frame: 4247124829 at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:419) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:349)
Re: Best way to Hive to Spark migration
Hi I have lot of ETL jobs (complex ones) , since they are SLA critical , I am planning them to migrate to spark. On Thu, Apr 5, 2018 at 10:46 AM, Jörn Franke wrote: > You need to provide more context on what you do currently in Hive and what > do you expect from the migration. > > On 5. Apr 2018, at 05:43, Pralabh Kumar wrote: > > Hi Spark group > > What's the best way to Migrate Hive to Spark > > 1) Use HiveContext of Spark > 2) Use Hive on Spark (https://cwiki.apache.org/ > confluence/display/Hive/Hive+on+Spark%3A+Getting+Started) > 3) Migrate Hive to Calcite to Spark SQL > > > Regards > >
Best way to Hive to Spark migration
Hi Spark group What's the best way to Migrate Hive to Spark 1) Use HiveContext of Spark 2) Use Hive on Spark ( https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started ) 3) Migrate Hive to Calcite to Spark SQL Regards
Are there any alternatives to Hive "stored by" clause as Spark 2.0 does not support it
Hi Spark 2.0 doesn't support stored by . Is there any alternative to achieve the same.
Re: Kryo serialization failed: Buffer overflow : Broadcast Join
I am using spark 2.1.0 On Fri, Feb 2, 2018 at 5:08 PM, Pralabh Kumar wrote: > Hi > > I am performing broadcast join where my small table is 1 gb . I am > getting following error . > > I am using > > > org.apache.spark.SparkException: > . Available: 0, required: 28869232. To avoid this, increase > spark.kryoserializer.buffer.max value > > > > I increase the value to > > spark.conf.set("spark.kryoserializer.buffer.max","2g") > > > But I am still getting the error . > > Please help > > Thx >
Kryo serialization failed: Buffer overflow : Broadcast Join
Hi I am performing broadcast join where my small table is 1 gb . I am getting following error . I am using org.apache.spark.SparkException: . Available: 0, required: 28869232. To avoid this, increase spark.kryoserializer.buffer.max value I increase the value to spark.conf.set("spark.kryoserializer.buffer.max","2g") But I am still getting the error . Please help Thx
Does Spark and Hive use Same SQL parser : ANTLR
Hi Does hive and spark uses same SQL parser provided by ANTLR . Did they generate the same logical plan . Please help on the same. Regards Pralabh Kumar
Spark build is failing in amplab Jenkins
Hi Dev Spark build is failing in Jenkins https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83353/consoleFull Python versions prior to 2.7 are not supported. Build step 'Execute shell' marked build as failure Archiving artifacts Recording test results ERROR: Step ?Publish JUnit test result report? failed: No test report files were found. Configuration error? Please help Regards Pralabh Kumar
[SPARK-20199][ML] : Provided featureSubsetStrategy to GBTClassifier and GBTRegressor
Hi Developers Can somebody look into this pull request . Its being reviewed by MLnick <https://github.com/apache/spark/pull/18118/files/79702933d321051222073057b25305831df84c6d> , sethah <https://github.com/apache/spark/pull/18118/files/12d83aaa5f24e17304de760c5664b8ede55d678a> , mpjlu <https://github.com/apache/spark/pull/18118/files/16ccbdfd8862c528c90fdde94c8ec20d6631126e> ? . Please review it . Regards Pralabh Kumar
Re: How to tune the performance of Tpch query5 within Spark
Hi To read file parallely , you can follow the below code. case class readData (fileName : String , spark : SparkSession) extends Callable[Dataset[Row]]{ override def call(): Dataset[Row] = { spark.read.parquet(fileName) // spark.read.csv(fileName) } } val spark = SparkSession.builder() .appName("practice") .config("spark.scheduler.mode","FAIR") .enableHiveSupport().getOrCreate() val pool = Executors.newFixedThreadPool(6) val list = new util.ArrayList[Future[Dataset[Row]]]() for(fileName<-"orders,lineitem,customer,supplier,region,nation".split(",")){ val o1 = new readData(fileName,spark) //pool.submit(o1). list.add(pool.submit(o1)) } val rddList = new ArrayBuffer[Dataset[Row]]() for(result <- list){ rddList += result.get() } pool.shutdown() pool.awaitTermination(Long.MaxValue, TimeUnit.NANOSECONDS) for(finalData<-rddList){ finalData.show() } This will read data in parallel ,which I think is your main bottleneck. Regards Pralabh Kumar On Mon, Jul 17, 2017 at 6:25 PM, vaquar khan wrote: > Could you please let us know your Spark version? > > > Regards, > vaquar khan > > > On Jul 17, 2017 12:18 AM, "163" wrote: > >> I change the UDF but the performance seems still slow. What can I do else? >> >> >> 在 2017年7月14日,下午8:34,Wenchen Fan 写道: >> >> Try to replace your UDF with Spark built-in expressions, it should be as >> simple as `$”x” * (lit(1) - $”y”)`. >> >> On 14 Jul 2017, at 5:46 PM, 163 wrote: >> >> I modify the tech query5 to DataFrame: >> >> val forders = >> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/orders*”*).filter("o_orderdate >> < 1995-01-01 and o_orderdate >= 1994-01-01").select("o_custkey", >> "o_orderkey") >> val flineitem = >> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/lineitem") >> val fcustomer = >> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/customer") >> val fsupplier = >> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/supplier") >> val fregion = >> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/region*”*).where("r_name >> = 'ASIA'").select($"r_regionkey") >> val fnation = >> spark.read.parquet("hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/nation*”*) >> >> val decrease = udf { (x: Double, y: Double) => x * (1 - y) } >> >> val res = flineitem.join(forders, $"l_orderkey" === forders("o_orderkey")) >> .join(fcustomer, $"o_custkey" === fcustomer("c_custkey")) >> .join(fsupplier, $"l_suppkey" === fsupplier("s_suppkey") && >> $"c_nationkey" === fsupplier("s_nationkey")) >> .join(fnation, $"s_nationkey" === fnation("n_nationkey")) >> .join(fregion, $"n_regionkey" === fregion("r_regionkey")) >> .select($"n_name", decrease($"l_extendedprice", >> $"l_discount").as("value")) >> .groupBy($"n_name") >> .agg(sum($"value").as("revenue")) >> .sort($"revenue".desc).show() >> >> >> My environment is one master(Hdfs-namenode), four workers(HDFS-datanode), >> each with 40 cores and 128GB memory. TPCH 100G stored on HDFS using parquet >> format. >> >> It executed about 1.5m, I found that read these 6 tables using >> spark.read.parqeut is sequential, How can I made this to run parallelly ? >> >> I’ve already set data locality and spark.default.parallelism, >> spark.serializer, using G1, But the runtime is still not reduced. >> >> And is there any advices for me to tuning this performance? >> >> Thank you. >> >> Wenting He >> >> >> >> >>
Re: Memory issue in pyspark for 1.6 mb file
Hi Naga Is it failing because of driver memory full or executor memory full ? can you please try setting this property spark.cleaner.ttl ? . So that older RDDs /metadata should also get clear automatically. Can you please provide the complete error stacktrace and code snippet ?. Regards Pralabh Kumar On Sun, Jun 18, 2017 at 12:06 AM, Naga Guduru wrote: > Hi, > > I am trying to load 1.6 mb excel file which has 16 tabs. We converted > excel to csv and loaded 16 csv files to 8 tables. Job was running > successful in 1st run in pyspark. When trying to run the same job 2 time, > container getting killed due to memory issues. > > I am using unpersist and clearcache on all rdds and dataframes after each > file loaded into table. Each csv file is loaded in sequence process ( for > loop) as some of the files should go to same table. Job will run 15 min if > it was success and 12-15 min if it was failed. If i increase the driver > memory and executor memory to more than 5 gb, its getting success. > > My assumption is driver memory full, and unpersist clear cache not working. > > Error: physical memory of 2 gb used and virtual memory of 4.6 gb used. > > Spark 1.6 version running in Cloudera Enterprise . > > Please let me know, if you need any info. > > > Thanks > >
Re: featureSubsetStrategy parameter for GradientBoostedTreesModel
Hi everyone Currently GBT doesn't expose featureSubsetStrategy as exposed by Random Forest. . GradientBoostedTrees in Spark have hardcoded feature subset strategy to "all" while calling random forest in DecisionTreeRegressor.scala val trees = RandomForest.run(data, oldStrategy, numTrees = 1, featureSubsetStrategy = "all", It should provide functionality to the user to set featureSubsetStrategy ("auto", "all" ,"sqrt" , "log2" , "onethird") , the way random forest does. This will help GBT to have randomness at feature level. Jira SPARK-20199 <https://issues.apache.org/jira/browse/SPARK-20199> Please let me know , if my understanding is correct. Regards Pralabh Kumar On Fri, Jun 16, 2017 at 7:53 AM, Pralabh Kumar wrote: > Hi everyone > > Currently GBT doesn't expose featureSubsetStrategy as exposed by Random > Forest. > > . > GradientBoostedTrees in Spark have hardcoded feature subset strategy to > "all" while calling random forest in DecisionTreeRegressor.scala > > val trees = RandomForest.run(data, oldStrategy, numTrees = 1, > featureSubsetStrategy = "all", > > > It should provide functionality to the user to set featureSubsetStrategy > ("auto", "all" ,"sqrt" , "log2" , "onethird") , > the way random forest does. > > This will help GBT to have randomness at feature level. > > Jira SPARK-20199 <https://issues.apache.org/jira/browse/SPARK-20199> > > Please let me know , if my understanding is correct. > > Regards > Pralabh Kumar >
featureSubsetStrategy parameter for GradientBoostedTreesModel
Hi everyone Currently GBT doesn't expose featureSubsetStrategy as exposed by Random Forest. . GradientBoostedTrees in Spark have hardcoded feature subset strategy to "all" while calling random forest in DecisionTreeRegressor.scala val trees = RandomForest.run(data, oldStrategy, numTrees = 1, featureSubsetStrategy = "all", It should provide functionality to the user to set featureSubsetStrategy ("auto", "all" ,"sqrt" , "log2" , "onethird") , the way random forest does. This will help GBT to have randomness at feature level. Jira SPARK-20199 <https://issues.apache.org/jira/browse/SPARK-20199> Please let me know , if my understanding is correct. Regards Pralabh Kumar