Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
Sadly Apache Spark sounds like it has nothing to do within materialised views. I was hoping it could read it! >>> *spark.sql("SELECT * FROM test.mv <http://test.mv>").show()* Traceback (most recent call last): File "", line 1, in File "/opt/spark/p

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Fri, 3 May 2024 at 00:54, Mich Talebzadeh wrote: > An issue I encountered while wor

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Jungtaek Lim
e. You can initiate a feature request and wish the community to include that into the roadmap. On Fri, May 3, 2024 at 12:01 PM Mich Talebzadeh wrote: > An issue I encountered while working with Materialized Views in Spark SQL. > It appears that there is an inconsistency between the behavior o

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Walaa Eldin Moustafa
. There is some work in the Iceberg community to add the support to Spark through SQL extensions, and Iceberg support for views and materialization tables. Some recent discussions can be found here [1] along with a WIP Iceberg-Spark PR. [1] https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc

Issue with Materialized Views in Spark SQL

2024-05-02 Thread Mich Talebzadeh
An issue I encountered while working with Materialized Views in Spark SQL. It appears that there is an inconsistency between the behavior of Materialized Views in Spark SQL and Hive. When attempting to execute a statement like DROP MATERIALIZED VIEW IF EXISTS test.mv in Spark SQL, I encountered

How to use Structured Streaming in Spark SQL

2024-04-22 Thread ????
In Flink, you can create flow calculation tables using Flink SQL, and directly connect with SQL through CDC and Kafka. How to use SQL for flow calculation in Spark 308027...@qq.com

[Spark SQL][How-To] Remove builtin function support from Spark

2024-04-17 Thread Matthew McMillian
-sql called FunctionRegistry that seems to act as an allowlist on what functions Spark can execute. If I remove a function of the registry, is that enough guarantee that that function can "never" be invoked in Spark, or are there other areas that would need to be changed as well? Thank

[Spark SQL][How-To] Remove builtin function support from Spark

2024-04-17 Thread Matthew McMillian
-sql called FunctionRegistry that seems to act as an allowlist on what functions Spark can execute. If I remove a function of the registry, is that enough guarantee that that function can "never" be invoked in Spark, or are there other areas that would need to be changed as well? Thank

[Spark SQL] xxhash64 default seed of 42 confusion

2024-04-16 Thread Igor Calabria
Hi all, I've noticed that spark's xxhas64 output doesn't match other tool's due to using seed=42 as a default. I've looked at a few libraries and they use 0 as a default seed: - python https://github.com/ifduyue/python-xxhash - java https://github.com/OpenHFT/Zero-Allocation-Hashing/ - java

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-11 Thread Ashley McManamon
Hi Mich, Thanks for the reply. I did come across that file but it didn't align with the appearance of `PartitionedFile`: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala In fact, the code snippet you shared also

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread Mich Talebzadeh
interesting. So below should be the corrected code with the suggestion in the [SPARK-47718] .sql() does not recognize watermark defined upstream - ASF JIRA (apache.org) <https://issues.apache.org/jira/browse/SPARK-47718> # Define schema for parsing Kafka messages schema = Stru

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread 刘唯
gt;, >>> col("parsed_value.temperature").alias("temperature")) >>> > >>> > """ >>> > We work out the window and the AVG(temperature) in the >>> window's >>> > timeframe below

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Mich Talebzadeh
Hi, I believe this is the package https://raw.githubusercontent.com/apache/spark/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala And the code case class FilePartition(index: Int, files: Array[PartitionedFile]) extends Partition

[Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Ashley McManamon
Hi All, I've been diving into the source code to get a better understanding of how file splitting works from a user perspective. I've hit a deadend at `PartitionedFile`, for which I cannot seem to find a definition? It appears though it should be found at

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-06 Thread 刘唯
("timestamp", "5 minutes"). \ >> > groupBy(window(resultC.timestamp, "5 minutes", "5 >> > minutes")). \ >> > avg('temperature') >> > >> > # We take the above DataF

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
ultMF = resultM. \ > >select( \ > > > F.col("window.start").alias("startOfWindow") \ > > , F.col("window.end").alias("endOfWindow") \ > > , > > F.col("

RE: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Chloe He
as a string and used as the key. > We take all the columns of the DataFrame and serialize them as > a JSON string, putting the results in the "value" of the record. > """ > result = resultMF.withColumn("uuid",uui

Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
ure)) AS value") \ .writeStream \ .outputMode('complete') \ .format("kafka") \ .option("kafka.bootstrap.servers", config['MDVariables']['bootstrapServers'],) \ .option("topic", &qu

[Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Chloe He
Hello! I am attempting to write a streaming pipeline that would consume data from a Kafka source, manipulate the data, and then write results to a downstream sink (Kafka, Redis, etc). I want to write fully formed SQL instead of using the function API that Spark offers. I read a few guides

Re: Bugs with joins and SQL in Structured Streaming

2024-03-11 Thread Andrzej Zera
red Streaming in production for almost a year >>> already and I want to share the bugs I found in this time. I created a test >>> for each of the issues and put them all here: >>> https://github.com/andrzejzera/spark-bugs/tree/main/spark-3.5/src/test/scala >>> &g

Re: Bugs with joins and SQL in Structured Streaming

2024-02-27 Thread Andrzej Zera
gt; https://github.com/andrzejzera/spark-bugs/tree/main/spark-3.5/src/test/scala >> >> I split the issues into three groups: outer joins on event time, interval >> joins and Spark SQL. >> >> Issues related to outer joins: >> >>- When joining three or mor

Re: Bugs with joins and SQL in Structured Streaming

2024-02-26 Thread Mich Talebzadeh
f the issues and put them all here: > https://github.com/andrzejzera/spark-bugs/tree/main/spark-3.5/src/test/scala > > I split the issues into three groups: outer joins on event time, interval > joins and Spark SQL. > > Issues related to outer joins: > >- When joining three or

Bugs with joins and SQL in Structured Streaming

2024-02-26 Thread Andrzej Zera
into three groups: outer joins on event time, interval joins and Spark SQL. Issues related to outer joins: - When joining three or more input streams on event time, if two or more streams don't contain an event for a join key (which is event time), no row will be output even if other

[Spark SQL]: Crash when attempting to select PostgreSQL bpchar without length specifier in Spark 3.5.0

2024-01-29 Thread Lily Hahn
Hi, I’m currently migrating an ETL project to Spark 3.5.0 from 3.2.1 and ran into an issue with some of our queries that read from PostgreSQL databases. Any attempt to run a Spark SQL query that selects a bpchar without a length specifier from the source DB seems to crash

Re: Validate spark sql

2023-12-26 Thread Gourav Sengupta
Dear friend, thanks a ton was looking for linting for SQL for a long time, looks like https://sqlfluff.com/ is something that can be used :) Thank you so much, and wish you all a wonderful new year. Regards, Gourav On Tue, Dec 26, 2023 at 4:42 AM Bjørn Jørgensen wrote: > You can try sqlfl

Re: Validate spark sql

2023-12-26 Thread Mich Talebzadeh
Worth trying EXPLAIN <https://spark.apache.org/docs/latest/sql-ref-syntax-qry-explain.html>statement as suggested by @tianlangstudio HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.co

Re: Validate spark sql

2023-12-25 Thread Bjørn Jørgensen
You can try sqlfluff <https://sqlfluff.com/> it's a linter for SQL code and it seems to have support for sparksql <https://pypi.org/project/sqlfluff/> man. 25. des. 2023 kl. 17:13 skrev ram manickam : > Thanks Mich, Nicholas. I tried looking over the stack overflow post and

Re: Validate spark sql

2023-12-25 Thread Bjørn Jørgensen
or column existence. >> >> is not correct. When you call spark.sql(…), Spark will lookup the table >> references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them. >> >> Also, when you run DDL via spark.sql(…), Spark will actually run it. So >> spark.sql(

回复:Validate spark sql

2023-12-25 Thread tianlangstudio
What about EXPLAIN? https://spark.apache.org/docs/3.5.0/sql-ref-syntax-qry-explain.html#content <https://spark.apache.org/docs/3.5.0/sql-ref-syntax-qry-explain.html#content > <https://www.upwork.com/fl/huanqingzhu > <https://www.tianlang.tech/ >Fusion Zhu <http

Re: Validate spark sql

2023-12-24 Thread ram manickam
re not validating against table or column existence. >> >> is not correct. When you call spark.sql(…), Spark will lookup the table >> references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them. >> >> Also, when you run DDL via spark.sql(…), Spark will actually r

Re: Validate spark sql

2023-12-24 Thread Mich Talebzadeh
p the table > references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them. > > Also, when you run DDL via spark.sql(…), Spark will actually run it. So > spark.sql(“drop table my_table”) will actually drop my_table. It’s not a > validation-only operation. > > This question of validati

Re: Validate spark sql

2023-12-24 Thread Nicholas Chammas
ces and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them. Also, when you run DDL via spark.sql(…), Spark will actually run it. So spark.sql(“drop table my_table”) will actually drop my_table. It’s not a validation-only operation. This question of validating SQL is already discussed on St

[sql] how to connect query stage to Spark job/stages?

2023-11-29 Thread Chenghao Lyu
Hi, I am seeking advice on measuring the performance of each QueryStage (QS) when AQE is enabled in Spark SQL. Specifically, I need help to automatically map a QS to its corresponding jobs (or stages) to get the QS runtime metrics. I recorded the QS structure via a customized injected Query

[Spark-sql 3.2.4] Wrong Statistic INFO From 'ANALYZE TABLE' Command

2023-11-24 Thread Nick Luo
Hi, all The ANALYZE TABLE command run from Spark on a Hive table. Question: Before I run ANALYZE TABLE' Command on Spark-sql client, I ran 'ANALYZE TABLE' Command on Hive client, the wrong Statistic Info show up. For example 1. run the analyze table command o hive client - create table

Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-11-07 Thread Suyash Ajmera
> > On Thu, 12 Oct, 2023, 7:46 pm Suyash Ajmera, > wrote: > >> I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am >> querying to Mysql Database and applying >> >> `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working >>

[Spark SQL] [Bug] Adding `checkpoint()` causes "column [...] cannot be resolved" error

2023-11-05 Thread Robin Zimmerman
Hi all, Wondering if anyone has run into this as I can't find any similar issues in JIRA, mailing list archives, Stack Overflow, etc. I had a query that was running successfully, but the query planning time was extremely long (4+ hours). To fix this I added `checkpoint()` calls earlier in the

Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-10-13 Thread Suyash Ajmera
ark 3.3.1 to spark 3.5.0, I am > querying to Mysql Database and applying > > `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working > as expected in spark 3.3.1 , but not working with 3.5.0. > > Where Condition :: `*UPPER(vn) = 'ERICSSON' AND (upper(st) = 'OPEN' OR

[ SPARK SQL ]: PPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-10-12 Thread Suyash Ajmera
I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am querying to Mysql Database and applying `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working as expected in spark 3.3.1 , but not working with 3.5.0. Where Condition :: `*UPPER(vn) = 'ERICSSON' AND (upper(st

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-18 Thread Mich Talebzadeh
t;> server, which I launch like so: >>> >>> ~/spark/sbin/start-thriftserver.sh --master spark://10.0.50.1:7077 >>> >>> The cluster runs in standalone mode and does not use Yarn for resource >>> management. As a result, the Spark Thrift server acquir

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
e only application that runs on the cluster is the Spark Thrift server, >> which I launch like so: >> >> ~/spark/sbin/start-thriftserver.sh --master spark://10.0.50.1:7077 >> >> The cluster runs in standalone mode and does not use Yarn for resource >> manageme

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
s is okay; as of right now, I am the > only user of the cluster. If I add more users, they will also be SQL users, > submitting queries through the Thrift server. > > Let me know if you have any other questions or thoughts. > > Thanks, > > Patrick > > On Thu, Au

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
acquires all available cluster resources when it starts. This is okay; as of right now, I am the only user of the cluster. If I add more users, they will also be SQL users, submitting queries through the Thrift server. Let me know if you have any other questions or thoughts. Thanks, Patrick On Thu

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
, although I couldn't >>>>> figure out how to get it to use the metastore_db from Spark. >>>>> >>>>> After turning my attention back to Spark, I determined the issue. >>>>> After much troubleshooting, I discovered that if I performed a C

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
er >>>> much troubleshooting, I discovered that if I performed a COUNT(*) using >>>> the same JOINs, the problem query worked. I removed all the columns from >>>> the SELECT statement and added them one by one until I found the culprit. >>>> It's

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
mpletes. If I >>> remove all explicit references to this column, the query works fine. Since >>> I need this column in the results, I went back to the ETL and extracted the >>> values to a dimension table. I replaced the text column in the source table >>> with a

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
On the topic of Hive, does anyone have any detailed resources for how to >> set up Hive from scratch? Aside from the official site, since those >> instructions didn't work for me. I'm starting to feel uneasy about building >> my process around Spark. There really shouldn't be a

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Mich Talebzadeh
> my process around Spark. There really shouldn't be any instances where I > ask Spark to run legal ANSI SQL code and it just does nothing. In the past > 4 days I've run into 2 of these instances, and the solution was more voodoo > and magic than examining errors/logs and fixing cod

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Patrick Tucci
uneasy about building my process around Spark. There really shouldn't be any instances where I ask Spark to run legal ANSI SQL code and it just does nothing. In the past 4 days I've run into 2 of these instances, and the solution was more voodoo and magic than examining errors/logs and fixing code. I

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
n no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci >> wrote: >> >>> Hi Mich, >>> >>> Thanks for the feedback. My orig

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
gt; > > > > On Sat, 12 Aug 2023 at 12:03, Patrick Tucci > wrote: > >> Hi Mich, >> >> Thanks for the feedback. My original intention after reading your >> response was to stick to Hive for managing tables. Unfortunately, I'm >> running into another

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
uction. On Sat, 12 Aug 2023 at 12:03, Patrick Tucci wrote: > Hi Mich, > > Thanks for the feedback. My original intention after reading your response > was to stick to Hive for managing tables. Unfortunately, I'm running into > another case of SQL scripts hanging. Since all table

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
Hi Mich, Thanks for the feedback. My original intention after reading your response was to stick to Hive for managing tables. Unfortunately, I'm running into another case of SQL scripts hanging. Since all tables are already Parquet, I'm out of troubleshooting options. I'm going to migrate

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Mich Talebzadeh
connect to Spark through Thrift server and have it write tables > using Delta Lake instead of Hive. From this StackOverflow question, it > looks like this is possible: > https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect-to-delta-using-jdbc

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Patrick Tucci
, so I need to be able to connect to Spark through Thrift server and have it write tables using Delta Lake instead of Hive. From this StackOverflow question, it looks like this is possible: https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Mich Talebzadeh
Steve may have a valid point. You raised an issue with concurrent writes before, if I recall correctly. Since this limitation may be due to Hive metastore. By default Spark uses Apache Derby for its database persistence. *However it is limited to only one Spark session at any time for the purposes

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Stephen Coy
Hi Patrick, When this has happened to me in the past (admittedly via spark-submit) it has been because another job was still running and had already claimed some of the resources (cores and memory). I think this can also happen if your configuration tries to claim resources that will never be

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
and sounds like there is no >> password! >> >> Once inside that host, hive logs are kept in your case >> /tmp/hadoop/hive.log or go to /tmp and do >> >> /tmp> find ./ -name hive.log. It should be under /tmp/hive.log >> >> Try running the s

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
n your case and sounds like there is no > password! > > Once inside that host, hive logs are kept in your case > /tmp/hadoop/hive.log or go to /tmp and do > > /tmp> find ./ -name hive.log. It should be under /tmp/hive.log > > Try running the sql inside hive and see what it says > >

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
are kept in your case /tmp/hadoop/hive.log or go to /tmp and do /tmp> find ./ -name hive.log. It should be under /tmp/hive.log Try running the sql inside hive and see what it says HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile <

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
hadoop -f command.sql Thanks again for your help. Patrick On Thu, Aug 10, 2023 at 2:24 PM Mich Talebzadeh wrote: > Can you run this sql query through hive itself? > > Are you using this command or similar for your thrift server? > > beeline -u jdbc:hive2:/

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
Can you run this sql query through hive itself? Are you using this command or similar for your thrift server? beeline -u jdbc:hive2:///1/default org.apache.hive.jdbc.HiveDriver -n hadoop -p xxx HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my

Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hello, I'm attempting to run a query on Spark 3.4.0 through the Spark ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in standalone mode using HDFS for storage. The query is as follows: SELECT ME.*, MB.BenefitID FROM MemberEnrollment ME JOIN MemberBenefits MB ON ME.ID =

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Mich Talebzadeh
y utilizing an open table format with concurrency control. Several >> formats, such as Apache Hudi, Apache Iceberg, Delta Lake, and Qbeast >> Format, offer this capability. All of them provide advanced features that >> will work better in different use cases according to the

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Patrick Tucci
023 at 4:28 PM Mich Talebzadeh > wrote: > >> It is not Spark SQL that throws the error. It is the underlying Database >> or layer that throws the error. >> >> Spark acts as an ETL tool. What is the underlying DB where the table >> resides? Is concurrency supported. Pleas

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Pol Santamaria
that will work better in different use cases according to the writing pattern, type of queries, data characteristics, etc. *Pol Santamaria* On Sat, Jul 29, 2023 at 4:28 PM Mich Talebzadeh wrote: > It is not Spark SQL that throws the error. It is the underlying Database > or layer that

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-29 Thread Mich Talebzadeh
It is not Spark SQL that throws the error. It is the underlying Database or layer that throws the error. Spark acts as an ETL tool. What is the underlying DB where the table resides? Is concurrency supported. Please send the error to this list HTH Mich Talebzadeh, Solutions Architect

Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-29 Thread Patrick Tucci
Hello, I'm building an application on Spark SQL. The cluster is set up in standalone mode with HDFS as storage. The only Spark application running is the Spark Thrift Server using FAIR scheduling mode. Queries are submitted to Thrift Server using beeline. I have multiple queries that insert rows

Re: [Spark SQL] Data objects from query history

2023-07-03 Thread Jack Wells
e. > > I have been exploring the capabilities of Spark SQL and Databricks, and I > have encountered a challenge related to accessing the data objects used by > queries from the query history. I am aware that Databricks provides a > comprehensive query history that contains valuable inf

[Spark SQL] Data objects from query history

2023-06-30 Thread Ruben Mennes
exploring the capabilities of Spark SQL and Databricks, and I have encountered a challenge related to accessing the data objects used by queries from the query history. I am aware that Databricks provides a comprehensive query history that contains valuable information about executed queries. However

[Spark-SQL] Dataframe write saveAsTable failed

2023-06-26 Thread Anil Dasari
Hi, We have upgraded Spark from 2.4.x to 3.3.1 recently and managed table creation while writing dataframe as saveAsTable failed with below error. Can not create the managed table(``) The associated location('hdfs:') already exists. On high level our code does below before writing dataframe as

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh
he process to > go faster. > > Patrick > > On Mon, Jun 26, 2023 at 12:52 PM Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> OK for now have you analyzed statistics in Hive external table >> >> spark-sql (default)> ANALYZE TABLE test.stg_t

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
: > OK for now have you analyzed statistics in Hive external table > > spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL > COLUMNS; > spark-sql (default)> DESC EXTENDED test.stg_t2; > > Hive external tables have little optimization > > HTH >

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh
OK for now have you analyzed statistics in Hive external table spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL COLUMNS; spark-sql (default)> DESC EXTENDED test.stg_t2; Hive external tables have little optimization HTH Mich Talebzadeh, Solutions Architect/Engin

Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
Hello, I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master node has 2 cores and 8GB of RAM. There is a single worker node with 8 cores and 64GB of RAM. I'm trying to process a large pipe delimited file that has been compressed with gzip (9.2GB zipped, ~58GB unzipped, ~241m

RE: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-10 Thread Vijay B
eason, I can ONLY do this in Spark SQL, instead of either Scala or > PySpark environment. > > I want to aggregate an array into a Map of element count, within that array, > but in Spark SQL. > I know that there is an aggregate function available like > > aggregate(expr, start,

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-09 Thread Yong Zhang
acc -> acc) AS feq_cnt Here are my questions: * Is using "map()" above the best way? The "start" structure in this case should be Map.empty[String, Int], but of course, it won't work in pure Spark SQL, so the best solution I can think of is "map()&quo

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-09 Thread Yong Zhang
acc -> acc) AS feq_cnt Here are my questions: * Is using "map()" above the best way? The "start" structure in this case should be Map.empty[String, Int], but of course, it won't work in pure Spark SQL, so the best solution I can think of is "map()", and it is

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-06 Thread Mich Talebzadeh
you can create DF from your SQL RS and work with that in Python the way you want ## you don't need all these import findspark findspark.init() from pyspark.sql import SparkSession from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql.functions import udf, col

Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-05 Thread Yong Zhang
Hi, This is on Spark 3.1 environment. For some reason, I can ONLY do this in Spark SQL, instead of either Scala or PySpark environment. I want to aggregate an array into a Map of element count, within that array, but in Spark SQL. I know that there is an aggregate function available like

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-24 Thread Oliver Ruebenacker
e gene is located upstream or downstream of the variant. >>>> >>>> >>>> >>>> tor. 23. feb. 2023 kl. 20:48 skrev Russell Jurney < >>>> russell.jur...@gmail.com>: >>>> >>>>&

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-24 Thread Russell Jurney
; >>>> Usually, the solution to these problems is to do less per line, break >>>> it out and perform each minute operation as a field, then combine those >>>> into a final answer. Can you do that here? >>>> >>>> Thanks, >>>> Russell Jurney

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-24 Thread Oliver Ruebenacker
t; Russell Jurney @rjurney <http://twitter.com/rjurney> >>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB >>> <http://facebook.com/jurney> datasyndrome.com Book a time on Calendly >>> <https://calendly.com/rjurney_personal/30mi

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Sean Owen
@broadinstitute.org> wrote: >> >>> Here is the complete error: >>> >>> ``` >>> Traceback (most recent call last): >>> File "nearest-gene.py", line 74, in >>> main() >>> File "nearest-gene.py

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Bjørn Jørgensen
File "nearest-gene.py", line 62, in main >> distances = joined.withColumn("distance", max(col("start") - >> col("position"), col("position") - col("end"), 0)) >> File >> "/mnt/yarn/usercache/hadoop/appcache/applicat

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Russell Jurney
quot;position") - col("end"), 0)) > File > "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_01/pyspark.zip/pyspark/sql/column.py", > line 907, in __nonzero__ > ValueError: Cannot convert column into bool

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Oliver Ruebenacker
ol("position") - col("end"), 0)) File "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_01/pyspark.zip/pyspark/sql/column.py", line 907, in __nonzero__ ValueError: Cannot convert column into bool: please use '&

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Sean Owen
That error sounds like it's from pandas not spark. Are you sure it's this line? On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker < oliv...@broadinstitute.org> wrote: > > Hello, > > I'm trying to calculate the distance between a gene (with start and end) > and a variant (with position),

[PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Oliver Ruebenacker
Hello, I'm trying to calculate the distance between a gene (with start and end) and a variant (with position), so I joined gene and variant data by chromosome and then tried to calculate the distance like this: ``` distances = joined.withColumn("distance", max(col("start") -

Re:Upgrading from Spark SQL 3.2 to 3.3 faild

2023-02-15 Thread lk_spark
I need to use cast function to surround computed expression then explain the SQL is ok, for example : cast(a.Split_Amt * b.percent / 100 asdecimal(20,8)) as split_amt I don't know why , is there a config property could compatibility with spark3.2 ? At 2023-02-16 13:47:25

Upgrading from Spark SQL 3.2 to 3.3 faild

2023-02-15 Thread lk_spark
hi,all : I have a sql statement wich can be run on spark 3.2.1 but not on spark 3.3.1 . when I try to explain it, will got error with message: org.apache.spark.sql.catalyst.expressions.Literal cannot be cast to org.apache.spark.sql.catalyst.expressions.AnsiCast execute the sql, error stack

Fwd: [Spark SQL] : Delete is only supported on V2 tables.

2023-02-09 Thread Jeevan Chhajed
-- Forwarded message - From: Jeevan Chhajed Date: Tue, 7 Feb 2023, 15:16 Subject: [Spark SQL] : Delete is only supported on V2 tables. To: Hi, How do we create V2 tables? I tried a couple of things using sql but was unable to do so. Can you share links/content

[Spark SQL]: Spark 3.2 generates different results to query when columns name have mixed casing vs when they have same casing

2023-02-08 Thread Amit Singh Rathore
Hi Team, I am running a query in Spark 3.2. val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5") val op_cols_same_case = List("id","col2","col3","col4", "col5", "id") val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)

[Spark SQL] : Delete is only supported on V2 tables.

2023-02-07 Thread Jeevan Chhajed
Hi, How do we create V2 tables? I tried a couple of things using sql but was unable to do so. Can you share links/content it will be of much help. Is delete support on V2 tables still under dev ? Thanks, Jeevan

SQL GROUP BY alias with dots, was: Spark SQL question

2023-02-07 Thread Enrico Minack
show() spark.sql("SELECT `an.id` FROM ids_with_struct GROUP BY `an.id`").show() This does not feel very consistent. Enrico Am 28.01.23 um 00:34 schrieb Kohki Nishio: this SQL works select 1 as *`data.group`* from tbl group by *data.group* Since there's no such field as *data,* I tho

Re: Create table before inserting in SQL

2023-02-02 Thread Harut Martirosyan
Thank you very much. I understand the performance implications and that Spark will download it before modifying. The JDBC database is just extremely small, it’s the BI/aggregated layer. What’s interesting is that here it says I can use JDBC https://spark.apache.org/docs/3.3.1/sql-ref-syntax

Re: Create table before inserting in SQL

2023-02-02 Thread Mich Talebzadeh
you may be able to do so in Python or SCALA but I don't know the way in pure SQL. if your JDBC database is Hive you can do so easily HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer

Re: Create table before inserting in SQL

2023-02-02 Thread Harut Martirosyan
Generally, the problem is that I don’t find a way to automatically create a JDBC table in the JDBC database when I want to insert data into it using Spark SQL only, not DataFrames API. > On 2 Feb 2023, at 21:22, Harut Martirosyan > wrote: > > Hi, thanks for the reply. > >

Re: Create table before inserting in SQL

2023-02-02 Thread Harut Martirosyan
Hi, thanks for the reply. Let’s imagine we have a parquet based table called parquet_table, now I want to insert it into a new JDBC table, all using pure SQL. If the JDBC table already exists, it’s easy, we do CREATE TABLE USING JDBC and then we do INSERT INTO that table. If the table doesn’t

Re: Create table before inserting in SQL

2023-02-01 Thread Mich Talebzadeh
Hi, It is not very clear your statement below: ".. If the table existed, I would create a table using JDBC in spark SQL and then insert into it, but I can't create a table if it doesn't exist in JDBC database..." If the table exists in your JDBC database, why do you need to create it

Create table before inserting in SQL

2023-02-01 Thread Harut Martirosyan
I have a resultset (defined in SQL), and I want to insert it into my JDBC database using only SQL, not dataframes API. If the table existed, I would create a table using JDBC in spark SQL and then insert into it, but I can't create a table if it doesn't exist in JDBC database. How to do

  1   2   3   4   5   6   7   8   9   10   >