Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

Mich Talebzadeh Sun, 30 Jul 2023 11:05:29 -0700

ok so as expected the underlying database is Hive. Hive uses hdfs storage.

You said you encountered limitations on concurrent writes. The order and
limitations are introduced by Hive metastore so to speak. Since this is all
happening through Spark, by default implementation of the Hive metastore
<https://hive.apache.org/#:~:text=The%20Hive%20Metastore%20(HMS)%20is,using%20the%20metastore%20service%20API.>
in Apache Spark uses Apache Derby for its database persistence. This is
available with no configuration required as in your case. *However it is
limited to only one Spark session at any time for the purposes of metadata
storage.* This obviously makes it unsuitable for use in multi-concurrency
situations as you observed. For industrial strength backend Hive metastore
databases, you should consider a multi-user ACID-compliant relational
database product for hosting the metastore. Any current RDBMSs should do. I
use Oracle 12g myself and others use MySQL or postgresQL for this purpose
etc.


HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 30 Jul 2023 at 13:46, Patrick Tucci <patrick.tu...@gmail.com> wrote:

> Hi Mich and Pol,
>
> Thanks for the feedback. The database layer is Hadoop 3.3.5. The cluster
> restarted so I lost the stack trace in the application UI. In the snippets
> I saved, it looks like the exception being thrown was from Hive. Given the
> feedback you've provided, I suspect the issue is with how the Hive
> components are handling concurrent writes.
>
> While using a different format would likely help with this issue, I think
> I have found an easier solution for now. Currently I have many individual
> scripts that perform logic and insert the results separately. Instead of
> each script performing an insert, each script can instead create a view.
> After the views are created, one single script can perform one single
> INSERT, combining the views with UNION ALL statements.
>
> -- Old logic --
> -- Script 1
> INSERT INTO EventClaims
> /*Long, complicated query 1*/
>
> -- Script N
> INSERT INTO EventClaims
> /*Long, complicated query N*/
>
> -- New logic --
> -- Script 1
> CREATE VIEW Q1 AS
> /*Long, complicated query 1*/
>
> -- Script N
> CREATE VIEW QN AS
> /*Long, complicated query N*/
>
> -- Final script --
> INSERT INTO EventClaims
> SELECT * FROM Q1 UNION ALL
> SELECT * FROM QN
>
> The old approach had almost two dozen stages with relatively fewer tasks.
> The new approach requires only 3 stages. With fewer stages and more tasks,
> cluster utilization is much higher.
>
> Thanks again for your feedback. I suspect better concurrent writes will be
> valuable for my project in the future, so this is good information to have
> ready.
>
> Thanks,
>
> Patrick
>
> On Sun, Jul 30, 2023 at 5:30 AM Pol Santamaria <p...@qbeast.io> wrote:
>
>> Hi Patrick,
>>
>> You can have multiple writers simultaneously writing to the same table in
>> HDFS by utilizing an open table format with concurrency control. Several
>> formats, such as Apache Hudi, Apache Iceberg, Delta Lake, and Qbeast
>> Format, offer this capability. All of them provide advanced features that
>> will work better in different use cases according to the writing pattern,
>> type of queries, data characteristics, etc.
>>
>> *Pol Santamaria*
>>
>>
>> On Sat, Jul 29, 2023 at 4:28 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> It is not Spark SQL that throws the error. It is the underlying Database
>>> or layer that throws the error.
>>>
>>> Spark acts as an ETL tool.  What is the underlying DB  where the table
>>> resides? Is concurrency supported. Please send the error to this list
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 29 Jul 2023 at 12:02, Patrick Tucci <patrick.tu...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I'm building an application on Spark SQL. The cluster is set up in
>>>> standalone mode with HDFS as storage. The only Spark application running is
>>>> the Spark Thrift Server using FAIR scheduling mode. Queries are submitted
>>>> to Thrift Server using beeline.
>>>>
>>>> I have multiple queries that insert rows into the same table
>>>> (EventClaims). These queries work fine when run sequentially, however, some
>>>> individual queries don't fully utilize the resources available on the
>>>> cluster. I would like to run all of these queries concurrently to more
>>>> fully utilize available resources. When I attempt to do this, tasks
>>>> eventually begin to fail. The stack trace is pretty long, but here's what
>>>> looks like the most relevant parts:
>>>>
>>>>
>>>> org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:788)
>>>>
>>>> org.apache.hive.service.cli.HiveSQLException: Error running query:
>>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 28
>>>> in stage 128.0 failed 4 times, most recent failure: Lost task 28.3 in stage
>>>> 128.0 (TID 6578) (10.0.50.2 executor 0): org.apache.spark.SparkException:
>>>> [TASK_WRITE_FAILED] Task failed while writing rows to hdfs://
>>>> 10.0.50.1:8020/user/spark/warehouse/eventclaims.
>>>>
>>>> Is it possible to have multiple concurrent writers to the same table
>>>> with Spark SQL? Is there any way to make this work?
>>>>
>>>> Thanks for the help.
>>>>
>>>> Patrick
>>>>
>>>

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

Reply via email to