spark schema conflict behavior records being silently dropped
hi all, i noticed a weird behavior to when spark parses nested json with schema conflict. i also just noticed that spark "fixed" this in the most recent release 3.5.0 but since i'm working with AWS services being: * EMR 6: spark 3.3.* spark3.4.* * Glue 3: spark3.1.1 * Glue 4: spark 3.3.0 https://docs.aws.amazon.com/glue/latest/dg/release-notes.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-app-versions-6.x.html ..we're still facing this issue company-wide. the problem was that spark is silently dropping records (or even whole files) when there's an schema conflict or empty string values ( https://kb.databricks.com/notebooks/json-reader-parses-value-as-null). My whole concern here is that spark is not even echoing a warn or error or exception when such cases occurs. To reproduce: i'm using amaznlinux2 with Python 3.7.16 and pyspark==3.4.1. echo '{"Records":[{"evtid":"1","requestParameters":{"DescribeHostsRequest":{"MaxResults":500}}}]}' > one.json echo '{"Records":[{"evtid":"2","requestParameters":{"lol":{},"lol2":{}}},{"evtid":"3","requestParameters":{"DescribeHostsRequest":""}}]}' > two.json Output of command: > spark.read.json(["one.json", "two.json"]).show() using 3.1.0 and 3.3.0 +--+ | Records| +--+ | null| |[{1, {{500}}}]| +--+ drops the second (two.json) file using Spark 3.4.0 +--+ | Records| +--+ | null| |[{1, {{500}}}]| +--+ it completely drops the second (two.json) file. Spark 3.5.0 ++ | Records| ++ |[{2, {NULL}}, {3,...| | [{1, {{500}}}]| ++ it reads both files but completely drops the "requestParameters" content of all the records in the second (two.json) file. {"evtid":"2","requestParameters":{}} <-- not good {"evtid":"3","requestParameters":{}} <-- not good {"evtid":"1","requestParameters":{"DescribeHostsRequest":{"MaxResults":500}}} enabling spark.conf.set("spark.sql.legacy.json.allowEmptyString.enabled", True) as suggested by https://kb.databricks.com/notebooks/json-reader-parses-value-as-null in spark 3.1 and 3.3 yields the same result seen in spark 3.5. which is not ideal if one wants the later fetch the records as is. to this. the only solution I found was to explicitly enforce the schema when reading. that said. does anyone the exact thread or changelog where this issue was handled? i've checked it on the links below but was non conclusive: https://spark.apache.org/docs/latest/sql-migration-guide.html https://spark.apache.org/releases/spark-release-3-5-0.html another question. how would you guys handle this scenario? I could not see a clue even after enabling full verbose. I could maybe force spark to issue an exception when such a case is encountered. or maybe send those bad/broken records to another file or bucket (dlq-ish) best regards,c.
Re: automatically/dinamically renew aws temporary token
hi all, thank you for your reply. > Can’t you attach the cross account permission to the glue job role? Why the detour via AssumeRole ? yes Jorn, i also believe this is the best approach. but here we're dealing with company policies and all the bureaucracy that comes along. in parallel i'm trying to argue on that path. by now even requesting an increase on the session duration is a struggle. but at the moment, since I was only allowed the AssumeRole approach i'm figuring out a way through this path. > https://github.com/zillow/aws-custom-credential-provider thank you Pol. I'll take a look into the project. regards,c. On Mon, Oct 23, 2023 at 7:03 AM Pol Santamaria wrote: > Hi Carlos! > > Take a look at this project, it's 6 years old but the approach is still > valid: > > https://github.com/zillow/aws-custom-credential-provider > > The credential provider gets called each time an S3 or Glue Catalog is > accessed, and then you can decide whether to use a cached token or renew. > > Best, > > *Pol Santamaria* > > > On Mon, Oct 23, 2023 at 8:08 AM Jörn Franke wrote: > >> Can’t you attach the cross account permission to the glue job role? Why >> the detour via AssumeRole ? >> >> Assumerole can make sense if you use an AWS IAM user and STS >> authentication, but this would make no sense within AWS for cross-account >> access as attaching the permissions to the Glue job role is more secure (no >> need for static credentials, automatically renew permissions in shorter >> time without any specific configuration in Spark). >> >> Have you checked with AWS support? >> >> Am 22.10.2023 um 21:14 schrieb Carlos Aguni : >> >> >> hi all, >> >> i've a scenario where I need to assume a cross account role to have S3 >> bucket access. >> >> the problem is that this role only allows for 1h time span (no >> negotiation). >> >> that said. >> does anyone know a way to tell spark to automatically renew the token >> or to dinamically renew the token on each node? >> i'm currently using spark on AWS glue. >> >> wonder what options do I have. >> >> regards,c. >> >>
Re: Maximum executors in EC2 Machine
Hi, I would refer to their documentation to better understand the concepts behind cluster overview and submitting applications: - https://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types - https://spark.apache.org/docs/latest/submitting-applications.html When using local[*] you can get as many worker threads as your cores in the same jvm running your driver and not executors. If you want to test against a real cluster you can look into using stand-alone mode. HTH, Riccardo On Mon, Oct 23, 2023 at 5:31 PM KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi, > > I am running a spark job in spark EC2 machine whiich has 40 cores. Driver > and executor memory is 16 GB. I am using local[*] but I still get only one > executor(driver). Is there a way to get more executors with this config. > > I am not using yarn or mesos in this case. Only one machine which is > enough for our work load but the data is increased. > > Thanks, > Asmath >