date:20231024

spark schema conflict behavior records being silently dropped

2023-10-24 Thread Carlos Aguni

hi all,

i noticed a weird behavior to when spark parses nested json with schema
conflict.

i also just noticed that spark "fixed" this in the most recent release
3.5.0 but since i'm working with AWS services being:
* EMR 6: spark 3.3.* spark3.4.*
* Glue 3: spark3.1.1
* Glue 4: spark 3.3.0
https://docs.aws.amazon.com/glue/latest/dg/release-notes.html
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-app-versions-6.x.html

..we're still facing this issue company-wide.

the problem was that spark is silently dropping records (or even whole
files) when there's an schema conflict or empty string values (
https://kb.databricks.com/notebooks/json-reader-parses-value-as-null).

My whole concern here is that spark is not even echoing a warn or error or
exception when such cases occurs.

To reproduce:
i'm using amaznlinux2 with Python 3.7.16 and pyspark==3.4.1.

echo
'{"Records":[{"evtid":"1","requestParameters":{"DescribeHostsRequest":{"MaxResults":500}}}]}'
> one.json
echo
'{"Records":[{"evtid":"2","requestParameters":{"lol":{},"lol2":{}}},{"evtid":"3","requestParameters":{"DescribeHostsRequest":""}}]}'
> two.json

Output of command:
> spark.read.json(["one.json", "two.json"]).show()

using 3.1.0 and 3.3.0
+--+
|   Records|
+--+
|  null|
|[{1, {{500}}}]|
+--+
drops the second (two.json) file

using Spark 3.4.0
+--+
|   Records|
+--+
|  null|
|[{1, {{500}}}]|
+--+
it completely drops the second (two.json) file.

Spark 3.5.0
++
| Records|
++
|[{2, {NULL}}, {3,...|
|  [{1, {{500}}}]|
++
it reads both files but completely drops the "requestParameters" content of
all the records in the second (two.json) file.
{"evtid":"2","requestParameters":{}} <-- not good
{"evtid":"3","requestParameters":{}} <-- not good
{"evtid":"1","requestParameters":{"DescribeHostsRequest":{"MaxResults":500}}}

enabling spark.conf.set("spark.sql.legacy.json.allowEmptyString.enabled",
True) as suggested by
https://kb.databricks.com/notebooks/json-reader-parses-value-as-null in
spark 3.1 and 3.3 yields the same result seen in spark 3.5. which is not
ideal if one wants the later fetch the records as is.
to this. the only solution I found was to explicitly enforce the schema
when reading.

that said.
does anyone the exact thread or changelog where this issue was handled?
i've checked it on the links below but was non conclusive:
https://spark.apache.org/docs/latest/sql-migration-guide.html
https://spark.apache.org/releases/spark-release-3-5-0.html

another question.
how would you guys handle this scenario?
I could not see a clue even after enabling full verbose.
I could maybe force spark to issue an exception when such a case is
encountered.

or maybe send those bad/broken records to another file or bucket (dlq-ish)

best regards,c.

Re: automatically/dinamically renew aws temporary token

2023-10-24 Thread Carlos Aguni

hi all,

thank you for your reply.

> Can’t you attach the cross account permission to the glue job role? Why
the detour via AssumeRole ?
yes Jorn, i also believe this is the best approach. but here we're dealing
with company policies and all the bureaucracy that comes along.
in parallel i'm trying to argue on that path. by now even requesting an
increase on the session duration is a struggle.
but at the moment, since I was only allowed the AssumeRole approach i'm
figuring out a way through this path.

> https://github.com/zillow/aws-custom-credential-provider
thank you Pol. I'll take a look into the project.

regards,c.

On Mon, Oct 23, 2023 at 7:03 AM Pol Santamaria  wrote:

> Hi Carlos!
>
> Take a look at this project, it's 6 years old but the approach is still
> valid:
>
> https://github.com/zillow/aws-custom-credential-provider
>
> The credential provider gets called each time an S3 or Glue Catalog is
> accessed, and then you can decide whether to use a cached token or renew.
>
> Best,
>
> *Pol Santamaria*
>
>
> On Mon, Oct 23, 2023 at 8:08 AM Jörn Franke  wrote:
>
>> Can’t you attach the cross account permission to the glue job role? Why
>> the detour via AssumeRole ?
>>
>> Assumerole can make sense if you use an AWS IAM user and STS
>> authentication, but this would make no sense within AWS for cross-account
>> access as attaching the permissions to the Glue job role is more secure (no
>> need for static credentials, automatically renew permissions in shorter
>> time without any specific configuration in Spark).
>>
>> Have you checked with AWS support?
>>
>> Am 22.10.2023 um 21:14 schrieb Carlos Aguni :
>>
>> 
>> hi all,
>>
>> i've a scenario where I need to assume a cross account role to have S3
>> bucket access.
>>
>> the problem is that this role only allows for 1h time span (no
>> negotiation).
>>
>> that said.
>> does anyone know a way to tell spark to automatically renew the token
>> or to dinamically renew the token on each node?
>> i'm currently using spark on AWS glue.
>>
>> wonder what options do I have.
>>
>> regards,c.
>>
>>

Re: Maximum executors in EC2 Machine

2023-10-24 Thread Riccardo Ferrari

Hi,

I would refer to their documentation to better understand the concepts
behind cluster overview and submitting applications:

   -

https://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types
   - https://spark.apache.org/docs/latest/submitting-applications.html

When using local[*]  you can get as many worker threads as your cores  in
the same jvm running your driver and not executors. If you want to test
against a real cluster you can look into using stand-alone mode.

HTH,
Riccardo

On Mon, Oct 23, 2023 at 5:31 PM KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:

> Hi,
>
> I am running a spark job in spark EC2 machine whiich has 40 cores. Driver
> and executor memory is 16 GB. I am using local[*] but I still get only one
> executor(driver). Is there a way to get more executors with this config.
>
> I am not using yarn or mesos in this case. Only one machine which is
> enough for our work load but the data is increased.
>
> Thanks,
> Asmath
>

spark schema conflict behavior records being silently dropped

Re: automatically/dinamically renew aws temporary token

Re: Maximum executors in EC2 Machine

3 matches

Site Navigation

Mail list logo

Footer information