Unsubscribe

2023-07-30 Thread
Unsubscribe


thanks!




郭
祝工作顺利、万事胜意

Unsubscribe

2023-07-30 Thread Aayush Ostwal
Unsubscribe


*Thanks,Aayush Ostwal*


Unsubscribe

2023-07-30 Thread Parag Chaudhari
Unsubscribe


*Thanks,Parag Chaudhari*


Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Mich Talebzadeh
ok so as expected the underlying database is Hive. Hive uses hdfs storage.

You said you encountered limitations on concurrent writes. The order and
limitations are introduced by Hive metastore so to speak. Since this is all
happening through Spark, by default implementation of the Hive metastore

in Apache Spark uses Apache Derby for its database persistence. This is
available with no configuration required as in your case. *However it is
limited to only one Spark session at any time for the purposes of metadata
storage.* This obviously makes it unsuitable for use in multi-concurrency
situations as you observed. For industrial strength backend Hive metastore
databases, you should consider a multi-user ACID-compliant relational
database product for hosting the metastore. Any current RDBMSs should do. I
use Oracle 12g myself and others use MySQL or postgresQL for this purpose
etc.

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 30 Jul 2023 at 13:46, Patrick Tucci  wrote:

> Hi Mich and Pol,
>
> Thanks for the feedback. The database layer is Hadoop 3.3.5. The cluster
> restarted so I lost the stack trace in the application UI. In the snippets
> I saved, it looks like the exception being thrown was from Hive. Given the
> feedback you've provided, I suspect the issue is with how the Hive
> components are handling concurrent writes.
>
> While using a different format would likely help with this issue, I think
> I have found an easier solution for now. Currently I have many individual
> scripts that perform logic and insert the results separately. Instead of
> each script performing an insert, each script can instead create a view.
> After the views are created, one single script can perform one single
> INSERT, combining the views with UNION ALL statements.
>
> -- Old logic --
> -- Script 1
> INSERT INTO EventClaims
> /*Long, complicated query 1*/
>
> -- Script N
> INSERT INTO EventClaims
> /*Long, complicated query N*/
>
> -- New logic --
> -- Script 1
> CREATE VIEW Q1 AS
> /*Long, complicated query 1*/
>
> -- Script N
> CREATE VIEW QN AS
> /*Long, complicated query N*/
>
> -- Final script --
> INSERT INTO EventClaims
> SELECT * FROM Q1 UNION ALL
> SELECT * FROM QN
>
> The old approach had almost two dozen stages with relatively fewer tasks.
> The new approach requires only 3 stages. With fewer stages and more tasks,
> cluster utilization is much higher.
>
> Thanks again for your feedback. I suspect better concurrent writes will be
> valuable for my project in the future, so this is good information to have
> ready.
>
> Thanks,
>
> Patrick
>
> On Sun, Jul 30, 2023 at 5:30 AM Pol Santamaria  wrote:
>
>> Hi Patrick,
>>
>> You can have multiple writers simultaneously writing to the same table in
>> HDFS by utilizing an open table format with concurrency control. Several
>> formats, such as Apache Hudi, Apache Iceberg, Delta Lake, and Qbeast
>> Format, offer this capability. All of them provide advanced features that
>> will work better in different use cases according to the writing pattern,
>> type of queries, data characteristics, etc.
>>
>> *Pol Santamaria*
>>
>>
>> On Sat, Jul 29, 2023 at 4:28 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> It is not Spark SQL that throws the error. It is the underlying Database
>>> or layer that throws the error.
>>>
>>> Spark acts as an ETL tool.  What is the underlying DB  where the table
>>> resides? Is concurrency supported. Please send the error to this list
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 29 Jul 2023 at 12:02, Patrick Tucci 
>>> wrote:
>>>
 Hello,

 I'm building an application on Spark SQL. The cluster is set up in
 standalone mode with HDFS 

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Patrick Tucci
Hi Mich and Pol,

Thanks for the feedback. The database layer is Hadoop 3.3.5. The cluster
restarted so I lost the stack trace in the application UI. In the snippets
I saved, it looks like the exception being thrown was from Hive. Given the
feedback you've provided, I suspect the issue is with how the Hive
components are handling concurrent writes.

While using a different format would likely help with this issue, I think I
have found an easier solution for now. Currently I have many individual
scripts that perform logic and insert the results separately. Instead of
each script performing an insert, each script can instead create a view.
After the views are created, one single script can perform one single
INSERT, combining the views with UNION ALL statements.

-- Old logic --
-- Script 1
INSERT INTO EventClaims
/*Long, complicated query 1*/

-- Script N
INSERT INTO EventClaims
/*Long, complicated query N*/

-- New logic --
-- Script 1
CREATE VIEW Q1 AS
/*Long, complicated query 1*/

-- Script N
CREATE VIEW QN AS
/*Long, complicated query N*/

-- Final script --
INSERT INTO EventClaims
SELECT * FROM Q1 UNION ALL
SELECT * FROM QN

The old approach had almost two dozen stages with relatively fewer tasks.
The new approach requires only 3 stages. With fewer stages and more tasks,
cluster utilization is much higher.

Thanks again for your feedback. I suspect better concurrent writes will be
valuable for my project in the future, so this is good information to have
ready.

Thanks,

Patrick

On Sun, Jul 30, 2023 at 5:30 AM Pol Santamaria  wrote:

> Hi Patrick,
>
> You can have multiple writers simultaneously writing to the same table in
> HDFS by utilizing an open table format with concurrency control. Several
> formats, such as Apache Hudi, Apache Iceberg, Delta Lake, and Qbeast
> Format, offer this capability. All of them provide advanced features that
> will work better in different use cases according to the writing pattern,
> type of queries, data characteristics, etc.
>
> *Pol Santamaria*
>
>
> On Sat, Jul 29, 2023 at 4:28 PM Mich Talebzadeh 
> wrote:
>
>> It is not Spark SQL that throws the error. It is the underlying Database
>> or layer that throws the error.
>>
>> Spark acts as an ETL tool.  What is the underlying DB  where the table
>> resides? Is concurrency supported. Please send the error to this list
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 29 Jul 2023 at 12:02, Patrick Tucci 
>> wrote:
>>
>>> Hello,
>>>
>>> I'm building an application on Spark SQL. The cluster is set up in
>>> standalone mode with HDFS as storage. The only Spark application running is
>>> the Spark Thrift Server using FAIR scheduling mode. Queries are submitted
>>> to Thrift Server using beeline.
>>>
>>> I have multiple queries that insert rows into the same table
>>> (EventClaims). These queries work fine when run sequentially, however, some
>>> individual queries don't fully utilize the resources available on the
>>> cluster. I would like to run all of these queries concurrently to more
>>> fully utilize available resources. When I attempt to do this, tasks
>>> eventually begin to fail. The stack trace is pretty long, but here's what
>>> looks like the most relevant parts:
>>>
>>>
>>> org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:788)
>>>
>>> org.apache.hive.service.cli.HiveSQLException: Error running query:
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 28
>>> in stage 128.0 failed 4 times, most recent failure: Lost task 28.3 in stage
>>> 128.0 (TID 6578) (10.0.50.2 executor 0): org.apache.spark.SparkException:
>>> [TASK_WRITE_FAILED] Task failed while writing rows to hdfs://
>>> 10.0.50.1:8020/user/spark/warehouse/eventclaims.
>>>
>>> Is it possible to have multiple concurrent writers to the same table
>>> with Spark SQL? Is there any way to make this work?
>>>
>>> Thanks for the help.
>>>
>>> Patrick
>>>
>>


Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Pol Santamaria
Hi Patrick,

You can have multiple writers simultaneously writing to the same table in
HDFS by utilizing an open table format with concurrency control. Several
formats, such as Apache Hudi, Apache Iceberg, Delta Lake, and Qbeast
Format, offer this capability. All of them provide advanced features that
will work better in different use cases according to the writing pattern,
type of queries, data characteristics, etc.

*Pol Santamaria*


On Sat, Jul 29, 2023 at 4:28 PM Mich Talebzadeh 
wrote:

> It is not Spark SQL that throws the error. It is the underlying Database
> or layer that throws the error.
>
> Spark acts as an ETL tool.  What is the underlying DB  where the table
> resides? Is concurrency supported. Please send the error to this list
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 29 Jul 2023 at 12:02, Patrick Tucci 
> wrote:
>
>> Hello,
>>
>> I'm building an application on Spark SQL. The cluster is set up in
>> standalone mode with HDFS as storage. The only Spark application running is
>> the Spark Thrift Server using FAIR scheduling mode. Queries are submitted
>> to Thrift Server using beeline.
>>
>> I have multiple queries that insert rows into the same table
>> (EventClaims). These queries work fine when run sequentially, however, some
>> individual queries don't fully utilize the resources available on the
>> cluster. I would like to run all of these queries concurrently to more
>> fully utilize available resources. When I attempt to do this, tasks
>> eventually begin to fail. The stack trace is pretty long, but here's what
>> looks like the most relevant parts:
>>
>>
>> org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:788)
>>
>> org.apache.hive.service.cli.HiveSQLException: Error running query:
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 28
>> in stage 128.0 failed 4 times, most recent failure: Lost task 28.3 in stage
>> 128.0 (TID 6578) (10.0.50.2 executor 0): org.apache.spark.SparkException:
>> [TASK_WRITE_FAILED] Task failed while writing rows to hdfs://
>> 10.0.50.1:8020/user/spark/warehouse/eventclaims.
>>
>> Is it possible to have multiple concurrent writers to the same table with
>> Spark SQL? Is there any way to make this work?
>>
>> Thanks for the help.
>>
>> Patrick
>>
>


Unsubscribe

2023-07-30 Thread 王怡刚






Unsubscribe