Sadly Apache Spark sounds like it has nothing to do within materialised
views. I was hoping it could read it!
>>> *spark.sql("SELECT * FROM test.mv <http://test.mv>").show()*
Traceback (most recent call last):
File "", line 1, in
File "/opt/spark/p
quote "one test result is worth one-thousand
expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
On Fri, 3 May 2024 at 00:54, Mich Talebzadeh
wrote:
> An issue I encountered while wor
e. You can initiate a feature request
and wish the community to include that into the roadmap.
On Fri, May 3, 2024 at 12:01 PM Mich Talebzadeh
wrote:
> An issue I encountered while working with Materialized Views in Spark SQL.
> It appears that there is an inconsistency between the behavior o
.
There is some work in the Iceberg community to add the support to Spark
through SQL extensions, and Iceberg support for views and
materialization tables. Some recent discussions can be found here [1] along
with a WIP Iceberg-Spark PR.
[1] https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc
An issue I encountered while working with Materialized Views in Spark SQL.
It appears that there is an inconsistency between the behavior of
Materialized Views in Spark SQL and Hive.
When attempting to execute a statement like DROP MATERIALIZED VIEW IF
EXISTS test.mv in Spark SQL, I encountered
In Flink, you can create flow calculation tables using Flink SQL, and directly
connect with SQL through CDC and Kafka. How to use SQL for flow calculation in
Spark
308027...@qq.com
-sql called
FunctionRegistry that seems to act as an allowlist on what functions Spark
can execute. If I remove a function of the registry, is that enough
guarantee that that function can "never" be invoked in Spark, or are there
other areas that would need to be changed as well?
Thank
-sql called
FunctionRegistry that seems to act as an allowlist on what functions Spark
can execute. If I remove a function of the registry, is that enough
guarantee that that function can "never" be invoked in Spark, or are there
other areas that would need to be changed as well?
Thank
Hi all,
I've noticed that spark's xxhas64 output doesn't match other tool's due to
using seed=42 as a default. I've looked at a few libraries and they use 0
as a default seed:
- python https://github.com/ifduyue/python-xxhash
- java https://github.com/OpenHFT/Zero-Allocation-Hashing/
- java
Hi Mich,
Thanks for the reply.
I did come across that file but it didn't align with the appearance of
`PartitionedFile`:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala
In fact, the code snippet you shared also
interesting. So below should be the corrected code with the suggestion in
the [SPARK-47718] .sql() does not recognize watermark defined upstream -
ASF JIRA (apache.org) <https://issues.apache.org/jira/browse/SPARK-47718>
# Define schema for parsing Kafka messages
schema = Stru
gt;,
>>> col("parsed_value.temperature").alias("temperature"))
>>> >
>>> > """
>>> > We work out the window and the AVG(temperature) in the
>>> window's
>>> > timeframe below
Hi,
I believe this is the package
https://raw.githubusercontent.com/apache/spark/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala
And the code
case class FilePartition(index: Int, files: Array[PartitionedFile])
extends Partition
Hi All,
I've been diving into the source code to get a better understanding of how
file splitting works from a user perspective. I've hit a deadend at
`PartitionedFile`, for which I cannot seem to find a definition? It appears
though it should be found at
("timestamp", "5 minutes"). \
>> > groupBy(window(resultC.timestamp, "5 minutes", "5
>> > minutes")). \
>> > avg('temperature')
>> >
>> > # We take the above DataF
ultMF = resultM. \
> >select( \
> >
> F.col("window.start").alias("startOfWindow") \
> > , F.col("window.end").alias("endOfWindow") \
> > ,
> > F.col("
as a string and used as the key.
> We take all the columns of the DataFrame and serialize them as
> a JSON string, putting the results in the "value" of the record.
> """
> result = resultMF.withColumn("uuid",uui
ure)) AS value") \
.writeStream \
.outputMode('complete') \
.format("kafka") \
.option("kafka.bootstrap.servers",
config['MDVariables']['bootstrapServers'],) \
.option("topic", &qu
Hello!
I am attempting to write a streaming pipeline that would consume data from a
Kafka source, manipulate the data, and then write results to a downstream sink
(Kafka, Redis, etc). I want to write fully formed SQL instead of using the
function API that Spark offers. I read a few guides
red Streaming in production for almost a year
>>> already and I want to share the bugs I found in this time. I created a test
>>> for each of the issues and put them all here:
>>> https://github.com/andrzejzera/spark-bugs/tree/main/spark-3.5/src/test/scala
>>>
&g
gt; https://github.com/andrzejzera/spark-bugs/tree/main/spark-3.5/src/test/scala
>>
>> I split the issues into three groups: outer joins on event time, interval
>> joins and Spark SQL.
>>
>> Issues related to outer joins:
>>
>>- When joining three or mor
f the issues and put them all here:
> https://github.com/andrzejzera/spark-bugs/tree/main/spark-3.5/src/test/scala
>
> I split the issues into three groups: outer joins on event time, interval
> joins and Spark SQL.
>
> Issues related to outer joins:
>
>- When joining three or
into three groups: outer joins on event time, interval
joins and Spark SQL.
Issues related to outer joins:
- When joining three or more input streams on event time, if two or more
streams don't contain an event for a join key (which is event time), no row
will be output even if other
Hi,
I’m currently migrating an ETL project to Spark 3.5.0 from 3.2.1 and ran into
an issue with some of our queries that read from PostgreSQL databases.
Any attempt to run a Spark SQL query that selects a bpchar without a length
specifier from the source DB seems to crash
Dear friend,
thanks a ton was looking for linting for SQL for a long time, looks like
https://sqlfluff.com/ is something that can be used :)
Thank you so much, and wish you all a wonderful new year.
Regards,
Gourav
On Tue, Dec 26, 2023 at 4:42 AM Bjørn Jørgensen
wrote:
> You can try sqlfl
Worth trying EXPLAIN
<https://spark.apache.org/docs/latest/sql-ref-syntax-qry-explain.html>statement
as suggested by @tianlangstudio
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.co
You can try sqlfluff <https://sqlfluff.com/> it's a linter for SQL code and
it seems to have support for sparksql <https://pypi.org/project/sqlfluff/>
man. 25. des. 2023 kl. 17:13 skrev ram manickam :
> Thanks Mich, Nicholas. I tried looking over the stack overflow post and
or column existence.
>>
>> is not correct. When you call spark.sql(…), Spark will lookup the table
>> references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them.
>>
>> Also, when you run DDL via spark.sql(…), Spark will actually run it. So
>> spark.sql(
What about EXPLAIN?
https://spark.apache.org/docs/3.5.0/sql-ref-syntax-qry-explain.html#content
<https://spark.apache.org/docs/3.5.0/sql-ref-syntax-qry-explain.html#content >
<https://www.upwork.com/fl/huanqingzhu >
<https://www.tianlang.tech/ >Fusion Zhu <http
re not validating against table or column existence.
>>
>> is not correct. When you call spark.sql(…), Spark will lookup the table
>> references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them.
>>
>> Also, when you run DDL via spark.sql(…), Spark will actually r
p the table
> references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them.
>
> Also, when you run DDL via spark.sql(…), Spark will actually run it. So
> spark.sql(“drop table my_table”) will actually drop my_table. It’s not a
> validation-only operation.
>
> This question of validati
ces and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them.
Also, when you run DDL via spark.sql(…), Spark will actually run it. So
spark.sql(“drop table my_table”) will actually drop my_table. It’s not a
validation-only operation.
This question of validating SQL is already discussed on St
Hi,
I am seeking advice on measuring the performance of each QueryStage (QS) when
AQE is enabled in Spark SQL. Specifically, I need help to automatically map a
QS to its corresponding jobs (or stages) to get the QS runtime metrics.
I recorded the QS structure via a customized injected Query
Hi, all
The ANALYZE TABLE command run from Spark on a Hive table.
Question:
Before I run ANALYZE TABLE' Command on Spark-sql client, I ran 'ANALYZE
TABLE' Command on Hive client, the wrong Statistic Info show up.
For example
1. run the analyze table command o hive client
- create table
>
> On Thu, 12 Oct, 2023, 7:46 pm Suyash Ajmera,
> wrote:
>
>> I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am
>> querying to Mysql Database and applying
>>
>> `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working
>>
Hi all,
Wondering if anyone has run into this as I can't find any similar issues in
JIRA, mailing list archives, Stack Overflow, etc. I had a query that was
running successfully, but the query planning time was extremely long (4+
hours). To fix this I added `checkpoint()` calls earlier in the
ark 3.3.1 to spark 3.5.0, I am
> querying to Mysql Database and applying
>
> `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working
> as expected in spark 3.3.1 , but not working with 3.5.0.
>
> Where Condition :: `*UPPER(vn) = 'ERICSSON' AND (upper(st) = 'OPEN' OR
I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am querying
to Mysql Database and applying
`*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working as
expected in spark 3.3.1 , but not working with 3.5.0.
Where Condition :: `*UPPER(vn) = 'ERICSSON' AND (upper(st
t;> server, which I launch like so:
>>>
>>> ~/spark/sbin/start-thriftserver.sh --master spark://10.0.50.1:7077
>>>
>>> The cluster runs in standalone mode and does not use Yarn for resource
>>> management. As a result, the Spark Thrift server acquir
e only application that runs on the cluster is the Spark Thrift server,
>> which I launch like so:
>>
>> ~/spark/sbin/start-thriftserver.sh --master spark://10.0.50.1:7077
>>
>> The cluster runs in standalone mode and does not use Yarn for resource
>> manageme
s is okay; as of right now, I am the
> only user of the cluster. If I add more users, they will also be SQL users,
> submitting queries through the Thrift server.
>
> Let me know if you have any other questions or thoughts.
>
> Thanks,
>
> Patrick
>
> On Thu, Au
acquires all available
cluster resources when it starts. This is okay; as of right now, I am the
only user of the cluster. If I add more users, they will also be SQL users,
submitting queries through the Thrift server.
Let me know if you have any other questions or thoughts.
Thanks,
Patrick
On Thu
, although I couldn't
>>>>> figure out how to get it to use the metastore_db from Spark.
>>>>>
>>>>> After turning my attention back to Spark, I determined the issue.
>>>>> After much troubleshooting, I discovered that if I performed a C
er
>>>> much troubleshooting, I discovered that if I performed a COUNT(*) using
>>>> the same JOINs, the problem query worked. I removed all the columns from
>>>> the SELECT statement and added them one by one until I found the culprit.
>>>> It's
mpletes. If I
>>> remove all explicit references to this column, the query works fine. Since
>>> I need this column in the results, I went back to the ETL and extracted the
>>> values to a dimension table. I replaced the text column in the source table
>>> with a
On the topic of Hive, does anyone have any detailed resources for how to
>> set up Hive from scratch? Aside from the official site, since those
>> instructions didn't work for me. I'm starting to feel uneasy about building
>> my process around Spark. There really shouldn't be a
> my process around Spark. There really shouldn't be any instances where I
> ask Spark to run legal ANSI SQL code and it just does nothing. In the past
> 4 days I've run into 2 of these instances, and the solution was more voodoo
> and magic than examining errors/logs and fixing cod
uneasy about building
my process around Spark. There really shouldn't be any instances where I
ask Spark to run legal ANSI SQL code and it just does nothing. In the past
4 days I've run into 2 of these instances, and the solution was more voodoo
and magic than examining errors/logs and fixing code. I
n no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci
>> wrote:
>>
>>> Hi Mich,
>>>
>>> Thanks for the feedback. My orig
gt;
>
>
>
> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci
> wrote:
>
>> Hi Mich,
>>
>> Thanks for the feedback. My original intention after reading your
>> response was to stick to Hive for managing tables. Unfortunately, I'm
>> running into another
uction.
On Sat, 12 Aug 2023 at 12:03, Patrick Tucci wrote:
> Hi Mich,
>
> Thanks for the feedback. My original intention after reading your response
> was to stick to Hive for managing tables. Unfortunately, I'm running into
> another case of SQL scripts hanging. Since all table
Hi Mich,
Thanks for the feedback. My original intention after reading your response
was to stick to Hive for managing tables. Unfortunately, I'm running into
another case of SQL scripts hanging. Since all tables are already Parquet,
I'm out of troubleshooting options. I'm going to migrate
connect to Spark through Thrift server and have it write tables
> using Delta Lake instead of Hive. From this StackOverflow question, it
> looks like this is possible:
> https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect-to-delta-using-jdbc
, so I need to be
able to connect to Spark through Thrift server and have it write tables
using Delta Lake instead of Hive. From this StackOverflow question, it
looks like this is possible:
https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect
Steve may have a valid point. You raised an issue with concurrent writes
before, if I recall correctly. Since this limitation may be due to Hive
metastore. By default Spark uses Apache Derby for its database
persistence. *However
it is limited to only one Spark session at any time for the purposes
Hi Patrick,
When this has happened to me in the past (admittedly via spark-submit) it has
been because another job was still running and had already claimed some of the
resources (cores and memory).
I think this can also happen if your configuration tries to claim resources
that will never be
and sounds like there is no
>> password!
>>
>> Once inside that host, hive logs are kept in your case
>> /tmp/hadoop/hive.log or go to /tmp and do
>>
>> /tmp> find ./ -name hive.log. It should be under /tmp/hive.log
>>
>> Try running the s
n your case and sounds like there is no
> password!
>
> Once inside that host, hive logs are kept in your case
> /tmp/hadoop/hive.log or go to /tmp and do
>
> /tmp> find ./ -name hive.log. It should be under /tmp/hive.log
>
> Try running the sql inside hive and see what it says
>
>
are kept in your case /tmp/hadoop/hive.log
or go to /tmp and do
/tmp> find ./ -name hive.log. It should be under /tmp/hive.log
Try running the sql inside hive and see what it says
HTH
Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom
view my Linkedin profile
<
hadoop -f command.sql
Thanks again for your help.
Patrick
On Thu, Aug 10, 2023 at 2:24 PM Mich Talebzadeh
wrote:
> Can you run this sql query through hive itself?
>
> Are you using this command or similar for your thrift server?
>
> beeline -u jdbc:hive2:/
Can you run this sql query through hive itself?
Are you using this command or similar for your thrift server?
beeline -u jdbc:hive2:///1/default
org.apache.hive.jdbc.HiveDriver -n hadoop -p xxx
HTH
Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom
view my
Hello,
I'm attempting to run a query on Spark 3.4.0 through the Spark
ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in
standalone mode using HDFS for storage.
The query is as follows:
SELECT ME.*, MB.BenefitID
FROM MemberEnrollment ME
JOIN MemberBenefits MB
ON ME.ID =
y utilizing an open table format with concurrency control. Several
>> formats, such as Apache Hudi, Apache Iceberg, Delta Lake, and Qbeast
>> Format, offer this capability. All of them provide advanced features that
>> will work better in different use cases according to the
023 at 4:28 PM Mich Talebzadeh
> wrote:
>
>> It is not Spark SQL that throws the error. It is the underlying Database
>> or layer that throws the error.
>>
>> Spark acts as an ETL tool. What is the underlying DB where the table
>> resides? Is concurrency supported. Pleas
that
will work better in different use cases according to the writing pattern,
type of queries, data characteristics, etc.
*Pol Santamaria*
On Sat, Jul 29, 2023 at 4:28 PM Mich Talebzadeh
wrote:
> It is not Spark SQL that throws the error. It is the underlying Database
> or layer that
It is not Spark SQL that throws the error. It is the underlying Database or
layer that throws the error.
Spark acts as an ETL tool. What is the underlying DB where the table
resides? Is concurrency supported. Please send the error to this list
HTH
Mich Talebzadeh,
Solutions Architect
Hello,
I'm building an application on Spark SQL. The cluster is set up in
standalone mode with HDFS as storage. The only Spark application running is
the Spark Thrift Server using FAIR scheduling mode. Queries are submitted
to Thrift Server using beeline.
I have multiple queries that insert rows
e.
>
> I have been exploring the capabilities of Spark SQL and Databricks, and I
> have encountered a challenge related to accessing the data objects used by
> queries from the query history. I am aware that Databricks provides a
> comprehensive query history that contains valuable inf
exploring the capabilities of Spark SQL and Databricks, and I have
encountered a challenge related to accessing the data objects used by queries
from the query history. I am aware that Databricks provides a comprehensive
query history that contains valuable information about executed queries.
However
Hi,
We have upgraded Spark from 2.4.x to 3.3.1 recently and managed table
creation while writing dataframe as saveAsTable failed with below error.
Can not create the managed table(``) The associated
location('hdfs:') already exists.
On high level our code does below before writing dataframe as
he process to
> go faster.
>
> Patrick
>
> On Mon, Jun 26, 2023 at 12:52 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> OK for now have you analyzed statistics in Hive external table
>>
>> spark-sql (default)> ANALYZE TABLE test.stg_t
:
> OK for now have you analyzed statistics in Hive external table
>
> spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL
> COLUMNS;
> spark-sql (default)> DESC EXTENDED test.stg_t2;
>
> Hive external tables have little optimization
>
> HTH
>
OK for now have you analyzed statistics in Hive external table
spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL
COLUMNS;
spark-sql (default)> DESC EXTENDED test.stg_t2;
Hive external tables have little optimization
HTH
Mich Talebzadeh,
Solutions Architect/Engin
Hello,
I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master node
has 2 cores and 8GB of RAM. There is a single worker node with 8 cores and
64GB of RAM.
I'm trying to process a large pipe delimited file that has been compressed
with gzip (9.2GB zipped, ~58GB unzipped, ~241m
eason, I can ONLY do this in Spark SQL, instead of either Scala or
> PySpark environment.
>
> I want to aggregate an array into a Map of element count, within that array,
> but in Spark SQL.
> I know that there is an aggregate function available like
>
> aggregate(expr, start,
acc -> acc) AS feq_cnt
Here are my questions:
* Is using "map()" above the best way? The "start" structure in this case
should be Map.empty[String, Int], but of course, it won't work in pure Spark
SQL, so the best solution I can think of is "map()&quo
acc -> acc) AS feq_cnt
Here are my questions:
* Is using "map()" above the best way? The "start" structure in this case
should be Map.empty[String, Int], but of course, it won't work in pure Spark
SQL, so the best solution I can think of is "map()", and it is
you can create DF from your SQL RS and work with that in Python the way you
want
## you don't need all these
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import udf, col
Hi, This is on Spark 3.1 environment.
For some reason, I can ONLY do this in Spark SQL, instead of either Scala or
PySpark environment.
I want to aggregate an array into a Map of element count, within that array,
but in Spark SQL.
I know that there is an aggregate function available like
e gene is located upstream or downstream of the variant.
>>>>
>>>>
>>>>
>>>> tor. 23. feb. 2023 kl. 20:48 skrev Russell Jurney <
>>>> russell.jur...@gmail.com>:
>>>>
>>>>&
;
>>>> Usually, the solution to these problems is to do less per line, break
>>>> it out and perform each minute operation as a field, then combine those
>>>> into a final answer. Can you do that here?
>>>>
>>>> Thanks,
>>>> Russell Jurney
t; Russell Jurney @rjurney <http://twitter.com/rjurney>
>>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
>>> <http://facebook.com/jurney> datasyndrome.com Book a time on Calendly
>>> <https://calendly.com/rjurney_personal/30mi
@broadinstitute.org> wrote:
>>
>>> Here is the complete error:
>>>
>>> ```
>>> Traceback (most recent call last):
>>> File "nearest-gene.py", line 74, in
>>> main()
>>> File "nearest-gene.py
File "nearest-gene.py", line 62, in main
>> distances = joined.withColumn("distance", max(col("start") -
>> col("position"), col("position") - col("end"), 0))
>> File
>> "/mnt/yarn/usercache/hadoop/appcache/applicat
quot;position") - col("end"), 0))
> File
> "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_01/pyspark.zip/pyspark/sql/column.py",
> line 907, in __nonzero__
> ValueError: Cannot convert column into bool
ol("position") - col("end"), 0))
File
"/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_01/pyspark.zip/pyspark/sql/column.py",
line 907, in __nonzero__
ValueError: Cannot convert column into bool: please use '&
That error sounds like it's from pandas not spark. Are you sure it's this
line?
On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:
>
> Hello,
>
> I'm trying to calculate the distance between a gene (with start and end)
> and a variant (with position),
Hello,
I'm trying to calculate the distance between a gene (with start and end)
and a variant (with position), so I joined gene and variant data by
chromosome and then tried to calculate the distance like this:
```
distances = joined.withColumn("distance", max(col("start") -
I need to use cast function to surround computed expression then explain the
SQL is ok, for example :
cast(a.Split_Amt * b.percent / 100 asdecimal(20,8)) as split_amt
I don't know why , is there a config property could compatibility with spark3.2
?
At 2023-02-16 13:47:25
hi,all :
I have a sql statement wich can be run on spark 3.2.1 but not on spark 3.3.1 .
when I try to explain it, will got error with message:
org.apache.spark.sql.catalyst.expressions.Literal cannot be cast to
org.apache.spark.sql.catalyst.expressions.AnsiCast
execute the sql, error stack
-- Forwarded message -
From: Jeevan Chhajed
Date: Tue, 7 Feb 2023, 15:16
Subject: [Spark SQL] : Delete is only supported on V2 tables.
To:
Hi,
How do we create V2 tables? I tried a couple of things using sql but was
unable to do so.
Can you share links/content
Hi Team,
I am running a query in Spark 3.2.
val df1 =
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4",
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
Hi,
How do we create V2 tables? I tried a couple of things using sql but was
unable to do so.
Can you share links/content it will be of much help.
Is delete support on V2 tables still under dev ?
Thanks,
Jeevan
show()
spark.sql("SELECT `an.id` FROM ids_with_struct GROUP BY `an.id`").show()
This does not feel very consistent.
Enrico
Am 28.01.23 um 00:34 schrieb Kohki Nishio:
this SQL works
select 1 as *`data.group`* from tbl group by *data.group*
Since there's no such field as *data,* I tho
Thank you very much.
I understand the performance implications and that Spark will download it
before modifying.
The JDBC database is just extremely small, it’s the BI/aggregated layer.
What’s interesting is that here it says I can use JDBC
https://spark.apache.org/docs/3.3.1/sql-ref-syntax
you may be able to do so in Python or SCALA but I don't know
the way in pure SQL.
if your JDBC database is Hive you can do so easily
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer
Generally, the problem is that I don’t find a way to automatically create a
JDBC table in the JDBC database when I want to insert data into it using Spark
SQL only, not DataFrames API.
> On 2 Feb 2023, at 21:22, Harut Martirosyan
> wrote:
>
> Hi, thanks for the reply.
>
>
Hi, thanks for the reply.
Let’s imagine we have a parquet based table called parquet_table, now I want to
insert it into a new JDBC table, all using pure SQL.
If the JDBC table already exists, it’s easy, we do CREATE TABLE USING JDBC and
then we do INSERT INTO that table.
If the table doesn’t
Hi,
It is not very clear your statement below:
".. If the table existed, I would create a table using JDBC in spark SQL
and then insert into it, but I can't create a table if it doesn't exist in
JDBC database..."
If the table exists in your JDBC database, why do you need to create it
I have a resultset (defined in SQL), and I want to insert it into my JDBC
database using only SQL, not dataframes API.
If the table existed, I would create a table using JDBC in spark SQL and
then insert into it, but I can't create a table if it doesn't exist in JDBC
database.
How to do
1 - 100 of 4090 matches
Mail list logo