Unsubscribe
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Unsubscribe
Re: AWS EMR SPARK 3.1.1 date issues
Hi Nicolas, thanks a ton for your kind response, I will surely try this out. Regards, Gourav Sengupta On Sun, Aug 29, 2021 at 11:01 PM Nicolas Paris wrote: > as a workaround turn off pruning : > > spark.sql.hive.metastorePartitionPruning false > spark.sql.hive.convertMetastoreParquet false > > see > https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/issues/45 > > On Tue Aug 24, 2021 at 9:18 AM CEST, Gourav Sengupta wrote: > > Hi, > > > > I received a response from AWS, this is an issue with EMR, and they are > > working on resolving the issue I believe. > > > > Thanks and Regards, > > Gourav Sengupta > > > > On Mon, Aug 23, 2021 at 1:35 PM Gourav Sengupta < > > gourav.sengupta.develo...@gmail.com> wrote: > > > > > Hi, > > > > > > the query still gives the same error if we write "SELECT * FROM > table_name > > > WHERE data_partition > CURRENT_DATE() - INTERVAL 10 DAYS". > > > > > > Also the queries work fine in SPARK 3.0.x, or in EMR 6.2.0. > > > > > > > > > Thanks and Regards, > > > Gourav Sengupta > > > > > > On Mon, Aug 23, 2021 at 1:16 PM Sean Owen wrote: > > > > > >> Date handling was tightened up in Spark 3. I think you need to > compare to > > >> a date literal, not a string literal. > > >> > > >> On Mon, Aug 23, 2021 at 5:12 AM Gourav Sengupta < > > >> gourav.sengupta.develo...@gmail.com> wrote: > > >> > > >>> Hi, > > >>> > > >>> while I am running in EMR 6.3.0 (SPARK 3.1.1) a simple query as > "SELECT > > >>> * FROM WHERE > '2021-03-01'" the > query > > >>> is failing with error: > > >>> > > >>> > --- > > >>> pyspark.sql.utils.AnalysisException: > > >>> org.apache.hadoop.hive.metastore.api.InvalidObjectException: > Unsupported > > >>> expression '2021 - 03 - 01' (Service: AWSGlue; Status Code: 400; > Error > > >>> Code: InvalidInputException; Request ID: > > >>> dd3549c2-2eeb-4616-8dc5-5887ba43dd22; Proxy: null) > > >>> > > >>> > --- > > >>> > > >>> The above query works fine in all previous versions of SPARK. > > >>> > > >>> Is this the expected behaviour in SPARK 3.1.1? If so can someone > please > > >>> let me know how to write this query. > > >>> > > >>> Also if this is the expected behaviour I think that a lot of users > will > > >>> have to make these changes in their existing code making transition > to > > >>> SPARK 3.1.1 expensive I think. > > >>> > > >>> Regards, > > >>> Gourav Sengupta > > >>> > > >> > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Performance Degradation in Spark 3.0.2 compared to Spark 3.0.1
Sessional Greetings , We're doing tpc-ds query tests using Spark 3.0.2 on kubernetes with data on HDFS and we're observing delays in query execution time when compared to Spark 3.0.1 on same environment. We've observed that some stages fail, but looks like it is taking some time to realise this failure and re-trigger these stages. I am attaching the configuration also which we used for the spark driver . We observe the same behaviour with sapark 3.0.3 also. Please let us know if anyone has observed similar issues. Configuration which we use for spark driver: spark.io.compression.codec=snappy spark.sql.parquet.filterPushdown=true spark.sql.inMemoryColumnarStorage.batchSize=15000 spark.shuffle.file.buffer=1024k spark.ui.retainedStages=1 spark.kerberos.keytab= spark.speculation=false spark.submit.deployMode=cluster spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true spark.sql.orc.filterPushdown=true spark.serializer=org.apache.spark.serializer.KryoSerializer spark.sql.crossJoin.enabled=true spark.kubernetes.kerberos.keytab= spark.sql.adaptive.enabled=true spark.kryo.unsafe=true spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id= spark.executor.cores=2 spark.ui.retainedTasks=20 spark.network.timeout=2400 spark.rdd.compress=true spark.executor.memoryoverhead=3G spark.master=k8s\: spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name= spark.kubernetes.driver.limit.cores=6144m spark.kubernetes.submission.waitAppCompletion=false spark.kerberos.principal= spark.kubernetes.kerberos.enabled=true spark.kubernetes.allocation.batch.size=5 spark.kubernetes.authenticate.driver.serviceAccountName= spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true spark.reducer.maxSizeInFlight=1024m spark.storage.memoryFraction=0.25 spark.kubernetes.namespace= spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name= spark.rpc.numRetries=5 spark.shuffle.consolidateFiles=true spark.sql.shuffle.partitions=400 spark.kubernetes.kerberos.krb5.path=/ spark.sql.codegen=true spark.ui.strictTransportSecurity=max-age\=31557600 spark.ui.retainedJobs=1 spark.driver.port=7078 spark.shuffle.io.backLog=256 spark.ssl.ui.enabled=true spark.kubernetes.memoryOverheadFactor=0.1 spark.driver.blockManager.port=7079 spark.kubernetes.executor.limit.cores=4096m spark.submit.pyFiles= spark.kubernetes.container.image= spark.shuffle.io.numConnectionsPerPeer=10 spark.sql.broadcastTimeout=7200 spark.driver.cores=3 spark.executor.memory=9g spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=dfbd9c75-3771-4392-928e-10bf28d94099 spark.driver.maxResultSize=4g spark.sql.parquet.mergeSchema=false spark.sql.inMemoryColumnarStorage.compressed=true spark.rpc.retry.wait=5 spark.hadoop.parquet.enable.summary-metadata=false spark.kubernetes.allocation.batch.delay=9 spark.driver.memory=16g spark.sql.starJoinOptimization=true spark.kubernetes.submitInDriver=true spark.shuffle.compress=true spark.memory.useLegacyMode=true spark.jars= spark.kubernetes.resource.type=java spark.locality.wait=0s spark.kubernetes.driver.ui.svc.port=4040 spark.sql.orc.splits.include.file.footer=true spark.kubernetes.kerberos.principal= spark.sql.orc.cache.stripe.details.size=1 spark.executor.instances=22 spark.hadoop.fs.hdfs.impl.disable.cache=true spark.sql.hive.metastorePartitionPruning=true Thanks and Regards Prakash
Re: AWS EMR SPARK 3.1.1 date issues
as a workaround turn off pruning : spark.sql.hive.metastorePartitionPruning false spark.sql.hive.convertMetastoreParquet false see https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/issues/45 On Tue Aug 24, 2021 at 9:18 AM CEST, Gourav Sengupta wrote: > Hi, > > I received a response from AWS, this is an issue with EMR, and they are > working on resolving the issue I believe. > > Thanks and Regards, > Gourav Sengupta > > On Mon, Aug 23, 2021 at 1:35 PM Gourav Sengupta < > gourav.sengupta.develo...@gmail.com> wrote: > > > Hi, > > > > the query still gives the same error if we write "SELECT * FROM table_name > > WHERE data_partition > CURRENT_DATE() - INTERVAL 10 DAYS". > > > > Also the queries work fine in SPARK 3.0.x, or in EMR 6.2.0. > > > > > > Thanks and Regards, > > Gourav Sengupta > > > > On Mon, Aug 23, 2021 at 1:16 PM Sean Owen wrote: > > > >> Date handling was tightened up in Spark 3. I think you need to compare to > >> a date literal, not a string literal. > >> > >> On Mon, Aug 23, 2021 at 5:12 AM Gourav Sengupta < > >> gourav.sengupta.develo...@gmail.com> wrote: > >> > >>> Hi, > >>> > >>> while I am running in EMR 6.3.0 (SPARK 3.1.1) a simple query as "SELECT > >>> * FROM WHERE > '2021-03-01'" the query > >>> is failing with error: > >>> > >>> --- > >>> pyspark.sql.utils.AnalysisException: > >>> org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unsupported > >>> expression '2021 - 03 - 01' (Service: AWSGlue; Status Code: 400; Error > >>> Code: InvalidInputException; Request ID: > >>> dd3549c2-2eeb-4616-8dc5-5887ba43dd22; Proxy: null) > >>> > >>> --- > >>> > >>> The above query works fine in all previous versions of SPARK. > >>> > >>> Is this the expected behaviour in SPARK 3.1.1? If so can someone please > >>> let me know how to write this query. > >>> > >>> Also if this is the expected behaviour I think that a lot of users will > >>> have to make these changes in their existing code making transition to > >>> SPARK 3.1.1 expensive I think. > >>> > >>> Regards, > >>> Gourav Sengupta > >>> > >> - To unsubscribe e-mail: user-unsubscr...@spark.apache.org