Inquiry Regarding Security Compliance of Apache Spark Docker Image

2024-06-05 Thread Tonmoy Sagar
Dear Apache Team, I hope this email finds you well. We are a team from Ernst and Young LLP - India, dedicated to providing innovative supply chain solutions for a diverse range of clients. Our team recently encountered a pivotal use case necessitating the utilization of PySpark for a project a

Do we need partitioning while loading data from JDBC sources?

2024-06-05 Thread Perez
Hello experts, I was just wondering if I could leverage the below thing to expedite the loading of the data process in Spark. def extract_data_from_mongodb(mongo_config): df = glueContext.create_dynamic_frame.from_options( connection_type="mongodb", connection_options=mongo_config ) return df m

Re: Terabytes data processing via Glue

2024-06-05 Thread Perez
Thanks Nitin and Russel for your responses. Much appreciated. On Mon, Jun 3, 2024 at 9:47 PM Russell Jurney wrote: > You could use either Glue or Spark for your job. Use what you’re more > comfortable with. > > Thanks, > Russell Jurney @rjurney > russell.jur...@gmail

[SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-05 Thread Chhavi Bansal
Hello team I was exploring feature transformation exposed via Mllib on nested dataset, and encountered an error while applying any transformer to a column with dot notation naming. I thought of raising a ticket on spark https://issues.apache.org/jira/browse/SPARK-48463, where I have mentioned the e

[SPARK-48423] Unable to save ML Pipeline to azure blob storage

2024-06-05 Thread Chhavi Bansal
Hello team, I was exploring on how to save ML pipeline to azure blob storage, but was setback by an issue where it complains of `fs.azure.account.key` not being found in the configuration even when I have provided the values in the pipelineModel.option(key1,value1) field. I considered raising a t