Re: Profiling data quality with Spark

2022-12-28 Thread infa elance
You can also look at informatica data quality that runs on spark. Of course it’s not free but you can sign up for a 30 day free trial. They have both profiling and prebuilt data quality rules and accelerators. Sent from my iPhoneOn Dec 28, 2022, at 10:02 PM, vaquar khan wrote:@ Gourav Sengupta

Re: EXT: Re: Check if shuffle is caused for repartitioned pyspark dataframes

2022-12-28 Thread Vibhor Gupta
Hi Shivam, I think what you are looking for is bucket optimization. The execution engine (spark) knows how the data was shuffled before persisting it. Unfortunately this is not supported when you use vanilla parquet files. Try saving the dataframe using the

Cannot build Apache Spark 3.3.1 with Apache Hive 3.1.2 and Apache Hadoop 3.1.1

2022-12-28 Thread שוהם יהודה
Hi Team I have a problem with building Apache Spark compatible with Apache Hive 3.1.2. I believe Apache Spark supports Hive 3.1.2 as I saw it in the docs. https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html I saw in the docs the following guide to build spark:

Re: Profiling data quality with Spark

2022-12-28 Thread vaquar khan
@ Gourav Sengupta why you are sending unnecessary emails ,if you think snowflake good plz use it ,here question was different and you are talking totally different topic. Plz respects group guidelines Regards, Vaquar khan On Wed, Dec 28, 2022, 10:29 AM vaquar khan wrote: > Here you can find

Re: Profiling data quality with Spark

2022-12-28 Thread vaquar khan
Here you can find all details , you just need to pass spark dataframe and deequ also generate recommendations for rules and you can also write custom complex rules. https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/ Regards, Vaquar khan On Wed, Dec 28, 2022, 9:40 AM

Re: Profiling data quality with Spark

2022-12-28 Thread rajat kumar
Thanks for the input folks. Hi Vaquar , I saw that we have various types of checks in GE and Deequ. Could you please suggest what types of check did you use for Metric based columns Regards Rajat On Wed, Dec 28, 2022 at 12:15 PM vaquar khan wrote: > I would suggest Deequ , I have

Re: [Spark Core] [Advanced] [How-to] How to map any external field to job ids spawned by Spark.

2022-12-28 Thread Gourav Sengupta
Hi Khalid, just out of curiosity, does the API help us in setting JOB ID's or just job Descriptions? Regards, Gourav Sengupta On Wed, Dec 28, 2022 at 10:58 AM Khalid Mammadov wrote: > There is a feature in SparkContext to set localProperties > (setLocalProperty) where you can set your Request

Re: [Spark Core] [Advanced] [How-to] How to map any external field to job ids spawned by Spark.

2022-12-28 Thread Khalid Mammadov
There is a feature in SparkContext to set localProperties (setLocalProperty) where you can set your Request ID and then using SparkListener instance read that ID with Job ID using onJobStart event. Hope this helps. On Tue, 27 Dec 2022, 13:04 Dhruv Toshniwal, wrote: > TL;Dr - >