Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun&

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Fri, 3 May 2024 at 00:54, Mich Talebzadeh wrote: > An issue I encountered while wor

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Jungtaek Lim
e. You can initiate a feature request and wish the community to include that into the roadmap. On Fri, May 3, 2024 at 12:01 PM Mich Talebzadeh wrote: > An issue I encountered while working with Materialized Views in Spark SQL. > It appears that there is an inconsistency between the behavior o

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Walaa Eldin Moustafa
Thanks, Walaa. On Thu, May 2, 2024 at 4:55 PM Mich Talebzadeh wrote: > An issue I encountered while working with Materialized Views in Spark SQL. > It appears that there is an inconsistency between the behavior of > Materialized Views in Spark SQL and Hive. > > When attemp

Issue with Materialized Views in Spark SQL

2024-05-02 Thread Mich Talebzadeh
An issue I encountered while working with Materialized Views in Spark SQL. It appears that there is an inconsistency between the behavior of Materialized Views in Spark SQL and Hive. When attempting to execute a statement like DROP MATERIALIZED VIEW IF EXISTS test.mv in Spark SQL, I encountered

How to use Structured Streaming in Spark SQL

2024-04-22 Thread ????
In Flink, you can create flow calculation tables using Flink SQL, and directly connect with SQL through CDC and Kafka. How to use SQL for flow calculation in Spark 308027...@qq.com

[Spark SQL][How-To] Remove builtin function support from Spark

2024-04-17 Thread Matthew McMillian
Hello, I'm very new to the Spark ecosystem, apologies if this question is a bit simple. I want to modify a custom fork of Spark to remove function support. For example, I want to remove the query runners ability to call reflect and java_method. I saw that there exists a data structure in spark

[Spark SQL][How-To] Remove builtin function support from Spark

2024-04-17 Thread Matthew McMillian
Hello, I'm very new to the Spark ecosystem, apologies if this question is a bit simple. I want to modify a custom fork of Spark to remove function support. For example, I want to remove the query runners ability to call reflect and java_method. I saw that there exists a data structure in spark

[Spark SQL] xxhash64 default seed of 42 confusion

2024-04-16 Thread Igor Calabria
Hi all, I've noticed that spark's xxhas64 output doesn't match other tool's due to using seed=42 as a default. I've looked at a few libraries and they use 0 as a default seed: - python https://github.com/ifduyue/python-xxhash - java https://github.com/OpenHFT/Zero-Allocation-Hashing/ - java

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-11 Thread Ashley McManamon
Hi Mich, Thanks for the reply. I did come across that file but it didn't align with the appearance of `PartitionedFile`: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala In fact, the code snippet you shared also

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread Mich Talebzadeh
interesting. So below should be the corrected code with the suggestion in the [SPARK-47718] .sql() does not recognize watermark defined upstream - ASF JIRA (apache.org) # Define schema for parsing Kafka messages schema = StructType([

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread 刘唯
Sorry this is not a bug but essentially a user error. Spark throws a really confusing error and I'm also confused. Please see the reply in the ticket for how to make things correct. https://issues.apache.org/jira/browse/SPARK-47718 刘唯 于2024年4月6日周六 11:41写道: > This indeed looks like a bug. I will

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Mich Talebzadeh
Hi, I believe this is the package https://raw.githubusercontent.com/apache/spark/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala And the code case class FilePartition(index: Int, files: Array[PartitionedFile]) extends Partition

[Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Ashley McManamon
Hi All, I've been diving into the source code to get a better understanding of how file splitting works from a user perspective. I've hit a deadend at `PartitionedFile`, for which I cannot seem to find a definition? It appears though it should be found at

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-06 Thread 刘唯
This indeed looks like a bug. I will take some time to look into it. Mich Talebzadeh 于2024年4月3日周三 01:55写道: > > hm. you are getting below > > AnalysisException: Append output mode not supported when there are > streaming aggregations on streaming DataFrames/DataSets without watermark; > > The

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
hm. you are getting below AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark; The problem seems to be that you are using the append output mode when writing the streaming query results to Kafka. This mode

RE: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Chloe He
Hi Mich, Thank you so much for your response. I really appreciate your help! You mentioned "defining the watermark using the withWatermark function on the streaming_df before creating the temporary view” - I believe this is what I’m doing and it’s not working for me. Here is the exact code

Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
ok let us take it for a test. The original code of mine def fetch_data(self): self.sc.setLogLevel("ERROR") schema = StructType() \ .add("rowkey", StringType()) \ .add("timestamp", TimestampType()) \ .add("temperature", IntegerType())

[Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Chloe He
Hello! I am attempting to write a streaming pipeline that would consume data from a Kafka source, manipulate the data, and then write results to a downstream sink (Kafka, Redis, etc). I want to write fully formed SQL instead of using the function API that Spark offers. I read a few guides on

[Spark SQL]: Crash when attempting to select PostgreSQL bpchar without length specifier in Spark 3.5.0

2024-01-29 Thread Lily Hahn
Hi, I’m currently migrating an ETL project to Spark 3.5.0 from 3.2.1 and ran into an issue with some of our queries that read from PostgreSQL databases. Any attempt to run a Spark SQL query that selects a bpchar without a length specifier from the source DB seems to crash

Re: Validate spark sql

2023-12-26 Thread Gourav Sengupta
Dear friend, thanks a ton was looking for linting for SQL for a long time, looks like https://sqlfluff.com/ is something that can be used :) Thank you so much, and wish you all a wonderful new year. Regards, Gourav On Tue, Dec 26, 2023 at 4:42 AM Bjørn Jørgensen wrote: > You can try sqlfluff

Re: Validate spark sql

2023-12-26 Thread Mich Talebzadeh
抄 送:Nicholas Chammas; user< > user@spark.apache.org> > 主 题:Re: Validate spark sql > > Thanks Mich, Nicholas. I tried looking over the stack overflow post and > none of them > Seems to cover the syntax validation. Do you know if it's even possible to > do syntax validati

Re: Validate spark sql

2023-12-25 Thread Bjørn Jørgensen
You can try sqlfluff it's a linter for SQL code and it seems to have support for sparksql man. 25. des. 2023 kl. 17:13 skrev ram manickam : > Thanks Mich, Nicholas. I tried looking over the stack overflow post and > none of them >

Re: Validate spark sql

2023-12-25 Thread Bjørn Jørgensen
Mailing lists For broad, opinion based, ask for external resources, debug issues, bugs, contributing to the project, and scenarios, it is recommended you use the user@spark.apache.org mailing list. - user@spark.apache.org is for

回复:Validate spark sql

2023-12-25 Thread tianlangstudio
s://www.tianlang.tech/ > -- 发件人:ram manickam 发送时间:2023年12月25日(星期一) 12:58 收件人:Mich Talebzadeh 抄 送:Nicholas Chammas; user 主 题:Re: Validate spark sql Thanks Mich, Nicholas. I tried looking over the stack overflow post and none of them Seems to cov

Re: Validate spark sql

2023-12-24 Thread ram manickam
Thanks Mich, Nicholas. I tried looking over the stack overflow post and none of them Seems to cover the syntax validation. Do you know if it's even possible to do syntax validation in spark? Thanks Ram On Sun, Dec 24, 2023 at 12:49 PM Mich Talebzadeh wrote: > Well not to put too finer point on

Re: Validate spark sql

2023-12-24 Thread Mich Talebzadeh
Well not to put too finer point on it, in a public forum, one ought to respect the importance of open communication. Everyone has the right to ask questions, seek information, and engage in discussions without facing unnecessary patronization. Mich Talebzadeh, Dad | Technologist | Solutions

Re: Validate spark sql

2023-12-24 Thread Nicholas Chammas
This is a user-list question, not a dev-list question. Moving this conversation to the user list and BCC-ing the dev list. Also, this statement > We are not validating against table or column existence. is not correct. When you call spark.sql(…), Spark will lookup the table references and

[Spark-sql 3.2.4] Wrong Statistic INFO From 'ANALYZE TABLE' Command

2023-11-24 Thread Nick Luo
Hi, all The ANALYZE TABLE command run from Spark on a Hive table. Question: Before I run ANALYZE TABLE' Command on Spark-sql client, I ran 'ANALYZE TABLE' Command on Hive client, the wrong Statistic Info show up. For example 1. run the analyze table command o hive client - create table

Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-11-07 Thread Suyash Ajmera
Any update on this? On Fri, 13 Oct, 2023, 12:56 pm Suyash Ajmera, wrote: > This issue is related to CharVarcharCodegenUtils readSidePadding method . > > Appending white spaces while reading ENUM data from mysql > > Causing issue in querying , writing the same data to Cassandra. > > On Thu, 12

[Spark SQL] [Bug] Adding `checkpoint()` causes "column [...] cannot be resolved" error

2023-11-05 Thread Robin Zimmerman
Hi all, Wondering if anyone has run into this as I can't find any similar issues in JIRA, mailing list archives, Stack Overflow, etc. I had a query that was running successfully, but the query planning time was extremely long (4+ hours). To fix this I added `checkpoint()` calls earlier in the

Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-10-13 Thread Suyash Ajmera
This issue is related to CharVarcharCodegenUtils readSidePadding method . Appending white spaces while reading ENUM data from mysql Causing issue in querying , writing the same data to Cassandra. On Thu, 12 Oct, 2023, 7:46 pm Suyash Ajmera, wrote: > I have upgraded my spark job from spark

[ SPARK SQL ]: PPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-10-12 Thread Suyash Ajmera
I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am querying to Mysql Database and applying `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working as expected in spark 3.3.1 , but not working with 3.5.0. Where Condition :: `*UPPER(vn) = 'ERICSSON' AND (upper(st)

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-18 Thread Mich Talebzadeh
> responsibility for any loss, damage or destruction of data or any >>>>>>>>>> other >>>>>>>>>> property which may arise from relying on this email's technical >>>>>>>>>> content is >>>>>>>

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
gt;>> tables as Delta tables, the issue persists. >>>>>>>>>> >>>>>>>>>> On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh < >>>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>&

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
t;>>>> >>>>>>>>>> >>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all >>>>>>>>>> responsibility for any loss, damage or destruction of data or any >>>>>>>>>>

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
; >>>>>>>>> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci < >>>>>>>>> patrick.tu...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Mich, >>>>>>>>>> >>>>

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
gt;>>>>>>> Thanks again for your feedback. >>>>>>>>> >>>>>>>>> Patrick >>>>>>>>> >>>>>>>>> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh < >>>>>>>

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
gt;>>> the >>>>>>>>> Spark API to Hive which prefers Parquet. I found out a few years ago. >>>>>>>>> >>>>>>>>> From your point of view I suggest you stick to parquet format with >>>>>&g

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
gt; You can also use compression >>>>>>>> >>>>>>>> STORED AS PARQUET >>>>>>>> TBLPROPERTIES ("parquet.compression"="SNAPPY") >>>>>>>> >>>>>>>> ALSO >>>&

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
gt;>>>view my Linkedin profile >>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>> >>>>>>> >>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>> >&g

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Mich Talebzadeh
;> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >&g

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Patrick Tucci
023 at 11:26, Patrick Tucci >>>>> wrote: >>>>> >>>>>> Thanks for the reply Stephen and Mich. >>>>>> >>>>>> Stephen, you're right, it feels like Spark is waiting for something, >>>>>> but I'm

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
kground. >>>>> >>>>> Mich, thank you so much, your suggestion worked. Storing the tables as >>>>> Parquet solves the issue. >>>>> >>>>> Interestingly, I found that only the MemberEnrollment table needs to >>>>> be Parquet

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
solves the issue. >>>> >>>> Interestingly, I found that only the MemberEnrollment table needs to be >>>> Parquet. The ID field in MemberEnrollment is an int calculated during load >>>> by a ROW_NUMBER() function. Further testing found that if I hard code

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
ROW_NUMBER() function, the >>> query works without issue even if both tables are ORC. >>> >>> Should I infer from this issue that the Hive components prefer Parquet >>> over ORC? Furthermore, should I consider using a different table storage >>> framewor

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
ferent solution might be more robust and stable. The main condition is >> that my application operates solely through Thrift server, so I need to be >> able to connect to Spark through Thrift server and have it write tables >> using Delta Lake instead of Hive. From this StackO

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Mich Talebzadeh
connect to Spark through Thrift server and have it write tables > using Delta Lake instead of Hive. From this StackOverflow question, it > looks like this is possible: > https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect-to-delta-using-jdbc

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Patrick Tucci
, so I need to be able to connect to Spark through Thrift server and have it write tables using Delta Lake instead of Hive. From this StackOverflow question, it looks like this is possible: https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Mich Talebzadeh
Steve may have a valid point. You raised an issue with concurrent writes before, if I recall correctly. Since this limitation may be due to Hive metastore. By default Spark uses Apache Derby for its database persistence. *However it is limited to only one Spark session at any time for the purposes

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Stephen Coy
Hi Patrick, When this has happened to me in the past (admittedly via spark-submit) it has been because another job was still running and had already claimed some of the resources (cores and memory). I think this can also happen if your configuration tries to claim resources that will never be

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hi Mich, I don't believe Hive is installed. I set up this cluster from scratch. I installed Hadoop and Spark by downloading them from their project websites. If Hive isn't bundled with Hadoop or Spark, I don't believe I have it. I'm running the Thrift server distributed with Spark, like so:

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
sorry host is 10.0.50.1 Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
Hi Patrick That beeline on port 1 is a hive thrift server running on your hive on host 10.0.50.1:1. if you can access that host, you should be able to log into hive by typing hive. The os user is hadoop in your case and sounds like there is no password! Once inside that host, hive logs

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hi Mich, Thanks for the reply. Unfortunately I don't have Hive set up on my cluster. I can explore this if there are no other ways to troubleshoot. I'm using beeline to run commands against the Thrift server. Here's the command I use: ~/spark/bin/beeline -u jdbc:hive2://10.0.50.1:1 -n

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
Can you run this sql query through hive itself? Are you using this command or similar for your thrift server? beeline -u jdbc:hive2:///1/default org.apache.hive.jdbc.HiveDriver -n hadoop -p xxx HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my

Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hello, I'm attempting to run a query on Spark 3.4.0 through the Spark ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in standalone mode using HDFS for storage. The query is as follows: SELECT ME.*, MB.BenefitID FROM MemberEnrollment ME JOIN MemberBenefits MB ON ME.ID =

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Mich Talebzadeh
y utilizing an open table format with concurrency control. Several >> formats, such as Apache Hudi, Apache Iceberg, Delta Lake, and Qbeast >> Format, offer this capability. All of them provide advanced features that >> will work better in different use cases according to the

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Patrick Tucci
023 at 4:28 PM Mich Talebzadeh > wrote: > >> It is not Spark SQL that throws the error. It is the underlying Database >> or layer that throws the error. >> >> Spark acts as an ETL tool. What is the underlying DB where the table >> resides? Is concurrency supported. Pleas

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Pol Santamaria
that will work better in different use cases according to the writing pattern, type of queries, data characteristics, etc. *Pol Santamaria* On Sat, Jul 29, 2023 at 4:28 PM Mich Talebzadeh wrote: > It is not Spark SQL that throws the error. It is the underlying Database > or layer that

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-29 Thread Mich Talebzadeh
It is not Spark SQL that throws the error. It is the underlying Database or layer that throws the error. Spark acts as an ETL tool. What is the underlying DB where the table resides? Is concurrency supported. Please send the error to this list HTH Mich Talebzadeh, Solutions Architect

Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-29 Thread Patrick Tucci
Hello, I'm building an application on Spark SQL. The cluster is set up in standalone mode with HDFS as storage. The only Spark application running is the Spark Thrift Server using FAIR scheduling mode. Queries are submitted to Thrift Server using beeline. I have multiple queries that insert rows

Re: [Spark SQL] Data objects from query history

2023-07-03 Thread Jack Wells
e. > > I have been exploring the capabilities of Spark SQL and Databricks, and I > have encountered a challenge related to accessing the data objects used by > queries from the query history. I am aware that Databricks provides a > comprehensive query history that contains valuable inf

[Spark SQL] Data objects from query history

2023-06-30 Thread Ruben Mennes
exploring the capabilities of Spark SQL and Databricks, and I have encountered a challenge related to accessing the data objects used by queries from the query history. I am aware that Databricks provides a comprehensive query history that contains valuable information about executed queries. However

[Spark-SQL] Dataframe write saveAsTable failed

2023-06-26 Thread Anil Dasari
Hi, We have upgraded Spark from 2.4.x to 3.3.1 recently and managed table creation while writing dataframe as saveAsTable failed with below error. Can not create the managed table(``) The associated location('hdfs:') already exists. On high level our code does below before writing dataframe as

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh
he process to > go faster. > > Patrick > > On Mon, Jun 26, 2023 at 12:52 PM Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> OK for now have you analyzed statistics in Hive external table >> >> spark-sql (default)> ANALYZE TABLE test.stg_t

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
: > OK for now have you analyzed statistics in Hive external table > > spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL > COLUMNS; > spark-sql (default)> DESC EXTENDED test.stg_t2; > > Hive external tables have little optimization > > HTH >

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh
OK for now have you analyzed statistics in Hive external table spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL COLUMNS; spark-sql (default)> DESC EXTENDED test.stg_t2; Hive external tables have little optimization HTH Mich Talebzadeh, Solutions Architect/Engin

Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
Hello, I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master node has 2 cores and 8GB of RAM. There is a single worker node with 8 cores and 64GB of RAM. I'm trying to process a large pipe delimited file that has been compressed with gzip (9.2GB zipped, ~58GB unzipped, ~241m

RE: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-10 Thread Vijay B
eason, I can ONLY do this in Spark SQL, instead of either Scala or > PySpark environment. > > I want to aggregate an array into a Map of element count, within that array, > but in Spark SQL. > I know that there is an aggregate function available like > > aggregate(expr, start,

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-09 Thread Yong Zhang
acc -> acc) AS feq_cnt Here are my questions: * Is using "map()" above the best way? The "start" structure in this case should be Map.empty[String, Int], but of course, it won't work in pure Spark SQL, so the best solution I can think of is "map()&quo

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-09 Thread Yong Zhang
acc -> acc) AS feq_cnt Here are my questions: * Is using "map()" above the best way? The "start" structure in this case should be Map.empty[String, Int], but of course, it won't work in pure Spark SQL, so the best solution I can think of is "map()", and it is

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-06 Thread Mich Talebzadeh
e author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Fri, 5 May 2023 at 20:33, Yong Zhang wrote: > Hi, This is on Spark 3.1 environment. > > For some reason, I can ONLY do this in Spark SQL, instead of either Scala > or PySpark en

Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-05 Thread Yong Zhang
Hi, This is on Spark 3.1 environment. For some reason, I can ONLY do this in Spark SQL, instead of either Scala or PySpark environment. I want to aggregate an array into a Map of element count, within that array, but in Spark SQL. I know that there is an aggregate function available like

Re:Upgrading from Spark SQL 3.2 to 3.3 faild

2023-02-15 Thread lk_spark
.scala:176) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPla

Upgrading from Spark SQL 3.2 to 3.3 faild

2023-02-15 Thread lk_spark
) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30

Fwd: [Spark SQL] : Delete is only supported on V2 tables.

2023-02-09 Thread Jeevan Chhajed
-- Forwarded message - From: Jeevan Chhajed Date: Tue, 7 Feb 2023, 15:16 Subject: [Spark SQL] : Delete is only supported on V2 tables. To: Hi, How do we create V2 tables? I tried a couple of things using sql but was unable to do so. Can you share links/content

[Spark SQL]: Spark 3.2 generates different results to query when columns name have mixed casing vs when they have same casing

2023-02-08 Thread Amit Singh Rathore
Hi Team, I am running a query in Spark 3.2. val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5") val op_cols_same_case = List("id","col2","col3","col4", "col5", "id") val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)

[Spark SQL] : Delete is only supported on V2 tables.

2023-02-07 Thread Jeevan Chhajed
Hi, How do we create V2 tables? I tried a couple of things using sql but was unable to do so. Can you share links/content it will be of much help. Is delete support on V2 tables still under dev ? Thanks, Jeevan

SQL GROUP BY alias with dots, was: Spark SQL question

2023-02-07 Thread Enrico Minack
Hi, you are right, that is an interesting question. Looks like GROUP BY is doing something funny / magic here (spark-shell 3.3.1 and 3.5.0-SNAPSHOT): With an alias, it behaves as you have pointed out: spark.range(3).createTempView("ids_without_dots") spark.sql("SELECT * FROM

Re: Spark SQL question

2023-01-28 Thread Bjørn Jørgensen
kl. 09:22 skrev Mich Talebzadeh < mich.talebza...@gmail.com>: > LOL > > First one > > spark-sql> select 1 as `data.group` from abc group by data.group; > 1 > Time taken: 0.198 seconds, Fetched 1 row(s) > > means that are assigning alias data.group to selec

Re: Spark SQL question

2023-01-28 Thread Mich Talebzadeh
LOL First one spark-sql> select 1 as `data.group` from abc group by data.group; 1 Time taken: 0.198 seconds, Fetched 1 row(s) means that are assigning alias data.group to select and you are using that alias -> data.group in your group by statement This is equivalent to spark-sql>

Spark SQL question

2023-01-27 Thread Kohki Nishio
this SQL works select 1 as *`data.group`* from tbl group by *data.group* Since there's no such field as *data,* I thought the SQL has to look like this select 1 as *`data.group`* from tbl group by `*data.group`* But that gives and error (cannot resolve '`data.group`') ... I'm no expert in

[Spark SQL] Data duplicate or data lost with non-deterministic function

2023-01-14 Thread 李建伟
Hi All, I met one data duplicate issue when writing table with shuffle data and non-deterministic function. For example: insert overwrite table target_table partition(ds) select ... from a join b join c... ditributed by ds, cast(rand()*10 as int) As rand() is non deterministic, the order of

Re: [Spark SQL]: unpredictable errors: java.io.IOException: can not read class org.apache.parquet.format.PageHeader

2022-12-19 Thread Eric Hanchrow
We’ve discovered a workaround for this; it’s described here<https://issues.apache.org/jira/browse/HADOOP-18521>. From: Eric Hanchrow Date: Thursday, December 8, 2022 at 17:03 To: user@spark.apache.org Subject: [Spark SQL]: unpredictable errors: java.io.IOException: can not read

[Spark SQL]: unpredictable errors: java.io.IOException: can not read class org.apache.parquet.format.PageHeader

2022-12-08 Thread Eric Hanchrow
My company runs java code that uses Spark to read from, and write to, Azure Blob storage. This code runs more or less 24x7. Recently we've noticed a few failures that leave stack traces in our logs; what they have in common are exceptions that look variously like Caused by:

RE: Re: [Spark Sql] Global Setting for Case-Insensitive String Compare

2022-11-22 Thread Patrick Tucci
Thanks. How would I go about formally submitting a feature request for this? On 2022/11/21 23:47:16 Andrew Melo wrote: > I think this is the right place, just a hard question :) As far as I > know, there's no "case insensitive flag", so YMMV > > On Mon, Nov 21, 2022 at 5:40 PM Patrick Tucci

Re: [Spark Sql] Global Setting for Case-Insensitive String Compare

2022-11-21 Thread Andrew Melo
I think this is the right place, just a hard question :) As far as I know, there's no "case insensitive flag", so YMMV On Mon, Nov 21, 2022 at 5:40 PM Patrick Tucci wrote: > > Is this the wrong list for this type of question? > > On 2022/11/12 16:34:48 Patrick Tucci wrote: > > Hello, > > > >

RE: [Spark Sql] Global Setting for Case-Insensitive String Compare

2022-11-21 Thread Patrick Tucci
Is this the wrong list for this type of question? On 2022/11/12 16:34:48 Patrick Tucci wrote: > Hello, > > Is there a way to set string comparisons to be case-insensitive globally? I > understand LOWER() can be used, but my codebase contains 27k lines of SQL > and many string comparisons. I

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-18 Thread Sean Owen
Taking this of list Start here: https://github.com/apache/spark/blob/70ec696bce7012b25ed6d8acec5e2f3b3e127f11/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala#L144 Look at subclasses of JdbcDialect too, like TeradataDialect. Note that you are using an old unsupported version

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Sean Owen
>> >>>>> Its impact the performance. Can we any alternate solution for this. >>>>> >>>>> Thanks, >>>>> Rama >>>>> >>>>> >>>>> On Thu, Nov 17, 2022, 10:17 PM Sean Owen wrote: >>>>>

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Sean Owen
xistence of the table upfront. >>>> It is nearly a no-op query; can it have a perf impact? >>>> >>>> On Thu, Nov 17, 2022 at 10:42 AM Ramakrishna Rayudu < >>>> ramakrishna560.ray...@gmail.com> wrote: >>>> >>>>> Hi Team, >>>

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Ramakrishna Rayudu
t;> On Thu, Nov 17, 2022 at 10:42 AM Ramakrishna Rayudu < >>> ramakrishna560.ray...@gmail.com> wrote: >>> >>>> Hi Team, >>>> >>>> I am facing one issue. Can you please help me on this. >>>> >>>> <https://stackoverflow

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Sean Owen
on this. >>> >>> <https://stackoverflow.com/> >>> >>>1. >>> >>> >>> <https://stackoverflow.com/posts/74477662/timeline> >>> >>> We are connecting Tera data from spark SQL with below API >>> >&g

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Sean Owen
his. > > <https://stackoverflow.com/> > >1. > > > <https://stackoverflow.com/posts/74477662/timeline> > > We are connecting Tera data from spark SQL with below API > > Dataset jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, > connectionPropertie

[Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Ramakrishna Rayudu
Hi Team, I am facing one issue. Can you please help me on this. <https://stackoverflow.com/> 1. <https://stackoverflow.com/posts/74477662/timeline> We are connecting Tera data from spark SQL with below API Dataset jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, connecti

[Spark Sql] Global Setting for Case-Insensitive String Compare

2022-11-12 Thread Patrick Tucci
Hello, Is there a way to set string comparisons to be case-insensitive globally? I understand LOWER() can be used, but my codebase contains 27k lines of SQL and many string comparisons. I would need to apply LOWER() to each string literal in the code base. I'd also need to change all the

Re: Why the same INSERT OVERWRITE sql , final table file produced by spark sql is larger than hive sql?

2022-10-12 Thread Chartist
| From | Sadha Chilukoori | | Date | 10/12/2022 08:27 | | To | Chartist<13289341...@163.com> | | Cc | | | Subject | Re: Why the same INSERT OVERWRITE sql , final table file produced by spark sql is larger than hive sql? | I have faced the same problem, where hive and spark orc were using the

Re: Why the same INSERT OVERWRITE sql , final table file produced by spark sql is larger than hive sql?

2022-10-11 Thread Sadha Chilukoori
arts'='1', > 'spark.sql.sources.schema.part.0'=‘xxx SOME OMITTED CONTENT xxx', > 'spark.sql.sources.schema.partCol.0'='pt', > 'transient_lastDdlTime'='1653484849’) > > *ENV:* > hive version 2.1.1 > spark version 2.4.4 > > *hadoop fs -du -h Result:* > *[hive sql]: * > *735.2 M /user/

Why the same INSERT OVERWRITE sql , final table file produced by spark sql is larger than hive sql?

2022-10-11 Thread Chartist
’) ENV: hive version 2.1.1 spark version 2.4.4 hadoop fs -du -h Result: [hive sql]: 735.2 M /user/hive/warehouse/mytable/pt=20220518 [spark sql]: 1.1 G /user/hive/warehouse/mytable/pt=20220518 How could this happened? And if this is caused by the different version of orc? Any replies

  1   2   3   4   5   6   7   8   9   10   >