Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-03 Thread Vikas Kumar
Have you tried dropmalformed option ? On Mon, Jul 3, 2023, 1:34 PM Shashank Rao wrote: > Update: Got it working by using the *_corrupt_record *field for the first > case (record 4) > > schema = schema.add("_corrupt_record", DataTypes.StringType); > Dataset ds =

Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Gavin Ray
Wow, really neat -- thanks for sharing! On Mon, Jul 3, 2023 at 8:12 PM Gengliang Wang wrote: > Dear Apache Spark community, > > We are delighted to announce the launch of a groundbreaking tool that aims > to make Apache Spark more user-friendly and accessible - the English SDK >

Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Hyukjin Kwon
The demo was really amazing. On Tue, 4 Jul 2023 at 09:17, Farshid Ashouri wrote: > This is wonderful news! > > On Tue, 4 Jul 2023 at 01:14, Gengliang Wang wrote: > >> Dear Apache Spark community, >> >> We are delighted to announce the launch of a groundbreaking tool that >> aims to make Apache

Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Farshid Ashouri
This is wonderful news! On Tue, 4 Jul 2023 at 01:14, Gengliang Wang wrote: > Dear Apache Spark community, > > We are delighted to announce the launch of a groundbreaking tool that aims > to make Apache Spark more user-friendly and accessible - the English SDK >

Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Gengliang Wang
Dear Apache Spark community, We are delighted to announce the launch of a groundbreaking tool that aims to make Apache Spark more user-friendly and accessible - the English SDK . Powered by the application of Generative AI, the English SDK

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-03 Thread Shashank Rao
Update: Got it working by using the *_corrupt_record *field for the first case (record 4) schema = schema.add("_corrupt_record", DataTypes.StringType); Dataset ds = spark.read().schema(schema).option("mode", "PERMISSIVE").json("path").collect(); ds =

Re: [Spark SQL] Data objects from query history

2023-07-03 Thread Jack Wells
Hi Ruben, I’m not sure if this answers your question, but if you’re interested in exploring the underlying tables, you could always try something like the below in a Databricks notebook: display(spark.read.table(’samples.nyctaxi.trips’)) (For vanilla Spark users, it would be

Filtering JSON records when there isn't an exact schema match in Spark

2023-07-03 Thread Shashank Rao
Hi all, I'm trying to read around 1,000,000 JSONL files present in S3 using Spark. Once read, I need to write them to BigQuery. I have a schema that may not be an exact match with all the records. How can I filter records where there isn't an exact schema match: Eg: if my records were: {"x": 1,

CFP for the 2nd Performance Engineering track at Community over Code NA 2023

2023-07-03 Thread Brebner, Paul
Hi Apache Spark people - There are only 10 days left to submit a talk proposal (title and abstract only) for Community over Code NA 2023 - the 2nd Performance Engineering track is on this year so any Apache project-related performance and scalability talks are welcome, here's the CFP for more

PySpark error java.lang.IllegalArgumentException

2023-07-03 Thread elango vaidyanathan
Hi all, I am reading a parquet file like this and it gives java.lang.IllegalArgumentException. However i can work with other parquet files (such as nyc taxi parquet files) without any issue. I have copied the full error log as well. Can you please check once and let me know how to fix this?