Unsubscribe

2021-08-29 Thread Lisa Fiedler




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Unsubscribe

2021-08-29 Thread Agostino Calamita



Re: AWS EMR SPARK 3.1.1 date issues

2021-08-29 Thread Gourav Sengupta
Hi Nicolas,

thanks a ton for your kind response, I will surely try this out.

Regards,
Gourav Sengupta

On Sun, Aug 29, 2021 at 11:01 PM Nicolas Paris 
wrote:

> as a workaround turn off pruning :
>
> spark.sql.hive.metastorePartitionPruning false
> spark.sql.hive.convertMetastoreParquet false
>
> see
> https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/issues/45
>
> On Tue Aug 24, 2021 at 9:18 AM CEST, Gourav Sengupta wrote:
> > Hi,
> >
> > I received a response from AWS, this is an issue with EMR, and they are
> > working on resolving the issue I believe.
> >
> > Thanks and Regards,
> > Gourav Sengupta
> >
> > On Mon, Aug 23, 2021 at 1:35 PM Gourav Sengupta <
> > gourav.sengupta.develo...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > the query still gives the same error if we write "SELECT * FROM
> table_name
> > > WHERE data_partition > CURRENT_DATE() - INTERVAL 10 DAYS".
> > >
> > > Also the queries work fine in SPARK 3.0.x, or in EMR 6.2.0.
> > >
> > >
> > > Thanks and Regards,
> > > Gourav Sengupta
> > >
> > > On Mon, Aug 23, 2021 at 1:16 PM Sean Owen  wrote:
> > >
> > >> Date handling was tightened up in Spark 3. I think you need to
> compare to
> > >> a date literal, not a string literal.
> > >>
> > >> On Mon, Aug 23, 2021 at 5:12 AM Gourav Sengupta <
> > >> gourav.sengupta.develo...@gmail.com> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> while I am running in EMR 6.3.0 (SPARK 3.1.1) a simple query as
> "SELECT
> > >>> * FROM  WHERE  > '2021-03-01'" the
> query
> > >>> is failing with error:
> > >>>
> > >>>
> ---
> > >>> pyspark.sql.utils.AnalysisException:
> > >>> org.apache.hadoop.hive.metastore.api.InvalidObjectException:
> Unsupported
> > >>> expression '2021 - 03 - 01' (Service: AWSGlue; Status Code: 400;
> Error
> > >>> Code: InvalidInputException; Request ID:
> > >>> dd3549c2-2eeb-4616-8dc5-5887ba43dd22; Proxy: null)
> > >>>
> > >>>
> ---
> > >>>
> > >>> The above query works fine in all previous versions of SPARK.
> > >>>
> > >>> Is this the expected behaviour in SPARK 3.1.1? If so can someone
> please
> > >>> let me know how to write this query.
> > >>>
> > >>> Also if this is the expected behaviour I think that a lot of users
> will
> > >>> have to make these changes in their existing code making transition
> to
> > >>> SPARK 3.1.1 expensive I think.
> > >>>
> > >>> Regards,
> > >>> Gourav Sengupta
> > >>>
> > >>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Performance Degradation in Spark 3.0.2 compared to Spark 3.0.1

2021-08-29 Thread Sharma, Prakash (Nokia - IN/Bangalore)
Sessional Greetings ,
 We're doing tpc-ds query tests using Spark 3.0.2 on kubernetes with data 
on HDFS and we're observing delays in query execution time when compared to 
Spark 3.0.1 on same environment. We've observed that some stages fail, but 
looks like it is taking some time to realise this failure and re-trigger these 
stages.  I am attaching the configuration also which we used for the spark 
driver . We observe the same behaviour with sapark 3.0.3 also.
Please let us know if anyone has observed similar issues.

Configuration which we use for spark driver:
spark.io.compression.codec=snappy
spark.sql.parquet.filterPushdown=true

spark.sql.inMemoryColumnarStorage.batchSize=15000
spark.shuffle.file.buffer=1024k
spark.ui.retainedStages=1
spark.kerberos.keytab=

spark.speculation=false
spark.submit.deployMode=cluster

spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true

spark.sql.orc.filterPushdown=true
spark.serializer=org.apache.spark.serializer.KryoSerializer

spark.sql.crossJoin.enabled=true
spark.kubernetes.kerberos.keytab=

spark.sql.adaptive.enabled=true
spark.kryo.unsafe=true
spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=
spark.executor.cores=2
spark.ui.retainedTasks=20
spark.network.timeout=2400


spark.rdd.compress=true
spark.executor.memoryoverhead=3G
spark.master=k8s\:

spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=
spark.kubernetes.driver.limit.cores=6144m
spark.kubernetes.submission.waitAppCompletion=false
spark.kerberos.principal=
spark.kubernetes.kerberos.enabled=true
spark.kubernetes.allocation.batch.size=5

spark.kubernetes.authenticate.driver.serviceAccountName=

spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true
spark.reducer.maxSizeInFlight=1024m

spark.storage.memoryFraction=0.25

spark.kubernetes.namespace=
spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=
spark.rpc.numRetries=5

spark.shuffle.consolidateFiles=true
spark.sql.shuffle.partitions=400
spark.kubernetes.kerberos.krb5.path=/
spark.sql.codegen=true
spark.ui.strictTransportSecurity=max-age\=31557600
spark.ui.retainedJobs=1

spark.driver.port=7078
spark.shuffle.io.backLog=256
spark.ssl.ui.enabled=true
spark.kubernetes.memoryOverheadFactor=0.1

spark.driver.blockManager.port=7079
spark.kubernetes.executor.limit.cores=4096m
spark.submit.pyFiles=
spark.kubernetes.container.image=
spark.shuffle.io.numConnectionsPerPeer=10

spark.sql.broadcastTimeout=7200

spark.driver.cores=3
spark.executor.memory=9g
spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=dfbd9c75-3771-4392-928e-10bf28d94099

spark.driver.maxResultSize=4g
spark.sql.parquet.mergeSchema=false

spark.sql.inMemoryColumnarStorage.compressed=true
spark.rpc.retry.wait=5
spark.hadoop.parquet.enable.summary-metadata=false


spark.kubernetes.allocation.batch.delay=9
spark.driver.memory=16g
spark.sql.starJoinOptimization=true
spark.kubernetes.submitInDriver=true
spark.shuffle.compress=true
spark.memory.useLegacyMode=true
spark.jars=
spark.kubernetes.resource.type=java
spark.locality.wait=0s
spark.kubernetes.driver.ui.svc.port=4040
spark.sql.orc.splits.include.file.footer=true
spark.kubernetes.kerberos.principal=

spark.sql.orc.cache.stripe.details.size=1

spark.executor.instances=22
spark.hadoop.fs.hdfs.impl.disable.cache=true
spark.sql.hive.metastorePartitionPruning=true

Thanks and Regards
Prakash



Re: AWS EMR SPARK 3.1.1 date issues

2021-08-29 Thread Nicolas Paris
as a workaround turn off pruning :

spark.sql.hive.metastorePartitionPruning false
spark.sql.hive.convertMetastoreParquet false

see 
https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/issues/45

On Tue Aug 24, 2021 at 9:18 AM CEST, Gourav Sengupta wrote:
> Hi,
>
> I received a response from AWS, this is an issue with EMR, and they are
> working on resolving the issue I believe.
>
> Thanks and Regards,
> Gourav Sengupta
>
> On Mon, Aug 23, 2021 at 1:35 PM Gourav Sengupta <
> gourav.sengupta.develo...@gmail.com> wrote:
>
> > Hi,
> >
> > the query still gives the same error if we write "SELECT * FROM table_name
> > WHERE data_partition > CURRENT_DATE() - INTERVAL 10 DAYS".
> >
> > Also the queries work fine in SPARK 3.0.x, or in EMR 6.2.0.
> >
> >
> > Thanks and Regards,
> > Gourav Sengupta
> >
> > On Mon, Aug 23, 2021 at 1:16 PM Sean Owen  wrote:
> >
> >> Date handling was tightened up in Spark 3. I think you need to compare to
> >> a date literal, not a string literal.
> >>
> >> On Mon, Aug 23, 2021 at 5:12 AM Gourav Sengupta <
> >> gourav.sengupta.develo...@gmail.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> while I am running in EMR 6.3.0 (SPARK 3.1.1) a simple query as "SELECT
> >>> * FROM  WHERE  > '2021-03-01'" the query
> >>> is failing with error:
> >>>
> >>> ---
> >>> pyspark.sql.utils.AnalysisException:
> >>> org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unsupported
> >>> expression '2021 - 03 - 01' (Service: AWSGlue; Status Code: 400; Error
> >>> Code: InvalidInputException; Request ID:
> >>> dd3549c2-2eeb-4616-8dc5-5887ba43dd22; Proxy: null)
> >>>
> >>> ---
> >>>
> >>> The above query works fine in all previous versions of SPARK.
> >>>
> >>> Is this the expected behaviour in SPARK 3.1.1? If so can someone please
> >>> let me know how to write this query.
> >>>
> >>> Also if this is the expected behaviour I think that a lot of users will
> >>> have to make these changes in their existing code making transition to
> >>> SPARK 3.1.1 expensive I think.
> >>>
> >>> Regards,
> >>> Gourav Sengupta
> >>>
> >>


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org