spark schema conflict behavior records being silently dropped

2023-10-24 Thread Carlos Aguni
hi all,

i noticed a weird behavior to when spark parses nested json with schema
conflict.

i also just noticed that spark "fixed" this in the most recent release
3.5.0 but since i'm working with AWS services being:
* EMR 6: spark 3.3.* spark3.4.*
* Glue 3: spark3.1.1
* Glue 4: spark 3.3.0
https://docs.aws.amazon.com/glue/latest/dg/release-notes.html
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-app-versions-6.x.html

..we're still facing this issue company-wide.

the problem was that spark is silently dropping records (or even whole
files) when there's an schema conflict or empty string values (
https://kb.databricks.com/notebooks/json-reader-parses-value-as-null).

My whole concern here is that spark is not even echoing a warn or error or
exception when such cases occurs.

To reproduce:
i'm using amaznlinux2 with Python 3.7.16 and pyspark==3.4.1.

echo
'{"Records":[{"evtid":"1","requestParameters":{"DescribeHostsRequest":{"MaxResults":500}}}]}'
> one.json
echo
'{"Records":[{"evtid":"2","requestParameters":{"lol":{},"lol2":{}}},{"evtid":"3","requestParameters":{"DescribeHostsRequest":""}}]}'
> two.json

Output of command:
> spark.read.json(["one.json", "two.json"]).show()

using 3.1.0 and 3.3.0
+--+
|   Records|
+--+
|  null|
|[{1, {{500}}}]|
+--+
drops the second (two.json) file

using Spark 3.4.0
+--+
|   Records|
+--+
|  null|
|[{1, {{500}}}]|
+--+
it completely drops the second (two.json) file.

Spark 3.5.0
++
| Records|
++
|[{2, {NULL}}, {3,...|
|  [{1, {{500}}}]|
++
it reads both files but completely drops the "requestParameters" content of
all the records in the second (two.json) file.
{"evtid":"2","requestParameters":{}} <-- not good
{"evtid":"3","requestParameters":{}} <-- not good
{"evtid":"1","requestParameters":{"DescribeHostsRequest":{"MaxResults":500}}}

enabling spark.conf.set("spark.sql.legacy.json.allowEmptyString.enabled",
True) as suggested by
https://kb.databricks.com/notebooks/json-reader-parses-value-as-null in
spark 3.1 and 3.3 yields the same result seen in spark 3.5. which is not
ideal if one wants the later fetch the records as is.
to this. the only solution I found was to explicitly enforce the schema
when reading.

that said.
does anyone the exact thread or changelog where this issue was handled?
i've checked it on the links below but was non conclusive:
https://spark.apache.org/docs/latest/sql-migration-guide.html
https://spark.apache.org/releases/spark-release-3-5-0.html

another question.
how would you guys handle this scenario?
I could not see a clue even after enabling full verbose.
I could maybe force spark to issue an exception when such a case is
encountered.

or maybe send those bad/broken records to another file or bucket (dlq-ish)

best regards,c.


Re: automatically/dinamically renew aws temporary token

2023-10-24 Thread Carlos Aguni
hi all,

thank you for your reply.

> Can’t you attach the cross account permission to the glue job role? Why
the detour via AssumeRole ?
yes Jorn, i also believe this is the best approach. but here we're dealing
with company policies and all the bureaucracy that comes along.
in parallel i'm trying to argue on that path. by now even requesting an
increase on the session duration is a struggle.
but at the moment, since I was only allowed the AssumeRole approach i'm
figuring out a way through this path.

> https://github.com/zillow/aws-custom-credential-provider
thank you Pol. I'll take a look into the project.

regards,c.

On Mon, Oct 23, 2023 at 7:03 AM Pol Santamaria  wrote:

> Hi Carlos!
>
> Take a look at this project, it's 6 years old but the approach is still
> valid:
>
> https://github.com/zillow/aws-custom-credential-provider
>
> The credential provider gets called each time an S3 or Glue Catalog is
> accessed, and then you can decide whether to use a cached token or renew.
>
> Best,
>
> *Pol Santamaria*
>
>
> On Mon, Oct 23, 2023 at 8:08 AM Jörn Franke  wrote:
>
>> Can’t you attach the cross account permission to the glue job role? Why
>> the detour via AssumeRole ?
>>
>> Assumerole can make sense if you use an AWS IAM user and STS
>> authentication, but this would make no sense within AWS for cross-account
>> access as attaching the permissions to the Glue job role is more secure (no
>> need for static credentials, automatically renew permissions in shorter
>> time without any specific configuration in Spark).
>>
>> Have you checked with AWS support?
>>
>> Am 22.10.2023 um 21:14 schrieb Carlos Aguni :
>>
>> 
>> hi all,
>>
>> i've a scenario where I need to assume a cross account role to have S3
>> bucket access.
>>
>> the problem is that this role only allows for 1h time span (no
>> negotiation).
>>
>> that said.
>> does anyone know a way to tell spark to automatically renew the token
>> or to dinamically renew the token on each node?
>> i'm currently using spark on AWS glue.
>>
>> wonder what options do I have.
>>
>> regards,c.
>>
>>


automatically/dinamically renew aws temporary token

2023-10-22 Thread Carlos Aguni
hi all,

i've a scenario where I need to assume a cross account role to have S3 bucket 
access.

the problem is that this role only allows for 1h time span (no negotiation).

that said.
does anyone know a way to tell spark to automatically renew the token
or to dinamically renew the token on each node?
i'm currently using spark on AWS glue.

wonder what options do I have.

regards,c.