Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Walaa Eldin Moustafa
I do not think the issue is with DROP MATERIALIZED VIEW only, but also with
CREATE MATERIALIZED VIEW, because neither is supported in Spark. I guess
you must have created the view from Hive and are trying to drop it from
Spark and that is why you are running to the issue with DROP first.

There is some work in the Iceberg community to add the support to Spark
through SQL extensions, and Iceberg support for views and
materialization tables. Some recent discussions can be found here [1] along
with a WIP Iceberg-Spark PR.

[1] https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc

Thanks,
Walaa.

On Thu, May 2, 2024 at 4:55 PM Mich Talebzadeh 
wrote:

> An issue I encountered while working with Materialized Views in Spark SQL.
> It appears that there is an inconsistency between the behavior of
> Materialized Views in Spark SQL and Hive.
>
> When attempting to execute a statement like DROP MATERIALIZED VIEW IF
> EXISTS test.mv in Spark SQL, I encountered a syntax error indicating that
> the keyword MATERIALIZED is not recognized. However, the same statement
> executes successfully in Hive without any errors.
>
> pyspark.errors.exceptions.captured.ParseException:
> [PARSE_SYNTAX_ERROR] Syntax error at or near 'MATERIALIZED'.(line 1, pos 5)
>
> == SQL ==
> DROP MATERIALIZED VIEW IF EXISTS test.mv
> -^^^
>
> Here are the versions I am using:
>
>
>
> *Hive: 3.1.1Spark: 3.4*
> my Spark session:
>
> spark = SparkSession.builder \
>   .appName("test") \
>   .enableHiveSupport() \
>   .getOrCreate()
>
> Has anyone seen this behaviour or encountered a similar issue or if there
> are any insights into why this discrepancy exists between Spark SQL and
> Hive.
>
> Thanks
>
> Mich Talebzadeh,
>
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>


Re: Profiling data quality with Spark

2022-12-27 Thread Walaa Eldin Moustafa
Rajat,

You might want to read about Data Sentinel, a data validation tool on Spark
that is developed at LinkedIn.

https://engineering.linkedin.com/blog/2020/data-sentinel-automating-data-validation

The project is not open source, but the blog post might give you insights
about how such a system could be built.

Thanks,
Walaa.

On Tue, Dec 27, 2022 at 8:13 PM Sean Owen  wrote:

> I think this is kind of mixed up. Data warehouses are simple SQL
> creatures; Spark is (also) a distributed compute framework. Kind of like
> comparing maybe a web server to Java.
> Are you thinking of Spark SQL? then I dunno sure you may well find it more
> complicated, but it's also just a data warehousey SQL surface.
>
> But none of that relates to the question of data quality tools. You could
> use GE with Redshift, or indeed with Spark - are you familiar with it? It's
> probably one of the most common tools people use with Spark for this in
> fact. It's just a Python lib at heart and you can apply it with Spark, but
> _not_ with a data warehouse, so I'm not sure what you're getting at.
>
> Deequ is also commonly seen. It's actually built on Spark, so again,
> confused about this "use Redshift or Snowflake not Spark".
>
> On Tue, Dec 27, 2022 at 9:55 PM Gourav Sengupta 
> wrote:
>
>> Hi,
>>
>> SPARK is just another querying engine with a lot of hype.
>>
>> I would highly suggest using Redshift (storage and compute decoupled
>> mode) or Snowflake without all this super complicated understanding of
>> containers/ disk-space, mind numbing variables, rocket science tuning, hair
>> splitting failure scenarios, etc. After that try to choose solutions like
>> Athena, or Trino/ Presto, and then come to SPARK.
>>
>> Try out solutions like  "great expectations" if you are looking for data
>> quality and not entirely sucked into the world of SPARK and want to keep
>> your options open.
>>
>> Dont get me wrong, SPARK used to be great in 2016-2017, but there are
>> superb alternatives now and the industry, in this recession, should focus
>> on getting more value for every single dollar they spend.
>>
>> Best of luck.
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Well, you need to qualify your statement on data quality. Are you
>>> talking about data lineage here?
>>>
>>> HTH
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 27 Dec 2022 at 19:25, rajat kumar 
>>> wrote:
>>>
 Hi Folks
 Hoping you are doing well, I want to implement data quality to detect
 issues in data in advance. I have heard about few frameworks like GE/Deequ.
 Can anyone pls suggest which one is good and how do I get started on it?

 Regards
 Rajat

>>>


Re: [EXTERNAL] Parse Execution Plan from PySpark

2022-05-03 Thread Walaa Eldin Moustafa
Hi Pablo,

Do you mean an in-memory plan? You can access one by implementing a Spark
Listener. Here is an example from the Datahub project [1].

If you end up parsing the SQL plan string, you may consider using/extending
Coral [2, 3]. There is already a POC for that. See some test cases [4].

Thanks,
Walaa.

[1]
https://github.com/datahub-project/datahub/blob/master/metadata-integration/java/spark-lineage/src/main/java/datahub/spark/DatahubSparkListener.java
[2] https://engineering.linkedin.com/blog/2020/coral
[3] https://github.com/linkedin/coral
[4]
https://github.com/linkedin/coral/blob/master/coral-spark-plan/src/test/java/com/linkedin/coral/sparkplan/SparkPlanToIRRelConverterTest.java


On Tue, May 3, 2022 at 1:18 AM Shay Elbaz  wrote:

> Hi Pablo,
>
>
>
> As you probably know, Spark SQL generates custom Java code for the SQL
> functions. You can use geometry.debugCodegen() to print out the generated
> code.
>
>
>
> Shay
>
>
>
> *From:* Pablo Alcain 
> *Sent:* Tuesday, May 3, 2022 6:07 AM
> *To:* user@spark.apache.org
> *Subject:* [EXTERNAL] Parse Execution Plan from PySpark
>
>
>
> *ATTENTION:* This email originated from outside of GM.
>
>
>
>
> Hello all! I'm working with PySpark trying to reproduce some of the
> results we see on batch through streaming processes, just as a PoC for now.
> For this, I'm thinking of trying to interpret the execution plan and
> eventually write it back to Python (I'm doing something similar with pandas
> as well, and I'd like both approaches to be as similar as possible).
>
>
>
> Let me clarify with an example: suppose that starting with a
> `geometry.csv` file with `width` and `height` I want to calculate the
> `area` doing this:
>
>
>
> >>> geometry = spark.read.csv('geometry.csv', header=True)
>
> >>> geometry = geometry.withColumn('area', F.col('width') *
> F.col('height'))
>
>
>
> I would like to extract from the execution plan the fact that area is
> calculated as the product of width * height. One possibility would be to
> parse the execution plan:
>
>
>
> >>> geometry.explain(True)
>
>
>
> ...
>
> == Optimized Logical Plan ==
>
> Project [width#45, height#46, (cast(width#45 as double) * cast(height#46
> as double)) AS area#64]
> +- Relation [width#45,height#46] csv
>
> ...
>
>
>
> From the first line of the Logical Plan we can parse the formula "area =
> height * width" and then write the function back in any language.
>
>
>
> However, even though I'm getting the logical plan as a string, there has
> to be some internal representation that I could leverage and avoid
> the string parsing. Do you know if/how I can access that internal
> representation from Python? I've been trying to navigate the scala source
> code to find it, but this is definitely beyond my area of expertise, so any
> pointers would be more than welcome.
>
>
>
> Thanks in advance,
>
> Pablo
>