date:20230918

unsubscribe

2023-09-18 Thread Ghazi Naceur

unsubscribe

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-18 Thread Jerry Peng

Hi Craig,

Thank you for sending us more information.  Can you answer my previous
question which I don't think the document addresses. How did you determine
duplicates in the output?  How was the output data read? The FileStreamSink
provides exactly-once writes ONLY if you read the output with the
FileStreamSource or the FileSource (batch).  A log is used to determine
what data is committed or not and those aforementioned sources know how to
use that log to read the data "exactly-once".  So there may be duplicated
data written on disk.  If you simply just read the data files written to
disk you may see duplicates when there are failures.  However, if you read
the output location with Spark you should get exactly once results (unless
there is a bug) since spark will know how to use the commit log to see what
data files are committed and not.

Best,

Jerry

On Mon, Sep 18, 2023 at 1:18 PM Craig Alfieri 
wrote:

> Hi Russell/Jerry/Mich,
>
>
>
> Appreciate your patience on this.
>
>
>
> Attached are more details on how this duplication “error” was found.
>
> Since we’re still unsure I am using “error” in quotes.
>
>
>
> We’d love the opportunity to work with any of you directly and/or the
> wider Spark community to triage this or get a better understanding of the
> nature of what we’re experiencing.
>
>
>
> Our platform provides the ability to fully reproduce this.
>
>
>
> Once you have had the chance to review the attached draft, let us know if
> there are any questions in the meantime. Again, we welcome the opportunity
> to work with the teams on this.
>
>
>
> Best-
>
> Craig
>
>
>
>
>
>
>
> *From: *Craig Alfieri 
> *Date: *Thursday, September 14, 2023 at 8:45 PM
> *To: *russell.spit...@gmail.com 
> *Cc: *Jerry Peng , Mich Talebzadeh <
> mich.talebza...@gmail.com>, user@spark.apache.org ,
> connor.mc...@antithesis.com 
> *Subject: *Re: Data Duplication Bug Found - Structured Streaming Versions
> 3..4.1, 3.2.4, and 3.3.2
>
> Hi Russell et al,
>
>
>
> Acknowledging receipt; we’ll get these answers back to the group.
>
>
>
> Follow-up forthcoming.
>
>
>
> Craig
>
>
>
>
>
>
>
> On Sep 14, 2023, at 6:38 PM, russell.spit...@gmail.com wrote:
>
> Exactly once should be output sink dependent, what sink was being used?
>
> Sent from my iPhone
>
>
>
> On Sep 14, 2023, at 4:52 PM, Jerry Peng 
> wrote:
>
> 
>
> Craig,
>
>
>
> Thanks! Please let us know the result!
>
>
>
> Best,
>
>
>
> Jerry
>
>
>
> On Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
>
> Hi Craig,
>
>
>
> Can you please clarify what this bug is and provide sample code causing
> this issue?
>
>
>
> HTH
>
>
> Mich Talebzadeh,
>
> Distinguished Technologist, Solutions Architect & Engineer
>
> London
>
> United Kingdom
>
>
>
>  [image: Image removed by sender.]  view my Linkedin profile
> 
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Thu, 14 Sept 2023 at 17:48, Craig Alfieri 
> wrote:
>
> Hello Spark Community-
>
>
>
> As part of a research effort, our team here at Antithesis tests for
> correctness/fault tolerance of major OSS projects.
>
> Our team recently was testing Spark’s Structured Streaming, and we came
> across a data duplication bug we’d like to work with the teams on to
> resolve.
>
>
>
> Our intention is to utilize this as a future case study for our platform,
> but prior to doing so we like to have a resolution in place so that an
> announcement isn’t alarming to the user base.
>
>
>
> Attached is a high level .pdf that reviews the High Availability set-up
> put under test.
>
> This was also tested across the three latest versions, and the same
> behavior was observed.
>
>
>
> We can reproduce this error readily, since our environment is fully
> deterministic, we are just not Spark experts and would like to work with
> someone in the community to resolve this.
>
>
>
> Please let us know at your earliest convenience.
>
>
>
> Best
>
>
>
> Error! Filename not specified.
>
> *Craig Alfieri*
>
> c: 917.841.1652
>
> craig.alfi...@antithesis.com
>
> New York, NY.
>
> Antithesis.com
> 
>
>
>
> We can't talk about most of the bugs that we've found for our customers,
>
> but some customers like to speak about their work with us:
>
> https://github.com/mongodb/mongo/wiki/Testing-MongoDB-with-Antithesis
>
>
>
>
>
>
>
>

Re: getting emails in different order!

2023-09-18 Thread Mich Talebzadeh

OK thanks Sean. Not a big issue for me. Normally happens AM GMT/London
time.. I see the email trail but not the thread owner's email first.
Normally responses first.

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Mon, 18 Sept 2023 at 17:16, Sean Owen  wrote:

> I have seen this, and not sure if it's just the ASF mailer being weird, or
> more likely, because emails are moderated and we inadvertently moderate
> them out of order
>
> On Mon, Sep 18, 2023 at 10:59 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> I use gmail to receive spark user group emails.
>>
>> On occasions, I get the latest emails first and later in the day I
>> receive the original email.
>>
>> Has anyone else seen this behaviour recently?
>>
>> Thanks
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>

Re: getting emails in different order!

2023-09-18 Thread Sean Owen

I have seen this, and not sure if it's just the ASF mailer being weird, or
more likely, because emails are moderated and we inadvertently moderate
them out of order

On Mon, Sep 18, 2023 at 10:59 AM Mich Talebzadeh 
wrote:

> Hi,
>
> I use gmail to receive spark user group emails.
>
> On occasions, I get the latest emails first and later in the day I receive
> the original email.
>
> Has anyone else seen this behaviour recently?
>
> Thanks
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Re: Spark stand-alone mode

2023-09-18 Thread Ilango

Thanks all for your suggestions. Noted with thanks.
Just wanted share few more details about the environment
1. We use NFS for data storage and data is in parquet format
2. All HPC nodes are connected and already work as a cluster for Studio
workbench. I can setup password less SSH if it not exist already.
3. We will stick with NFS for now and stand alone then may be will explore
HDFS and YARN.

Can you please confirm whether multiple users can run spark jobs at the
same time?
If so I will start working on it and let you know how it goes

Mich, the link to Hadoop is not working. Can you please check and let me
know the correct link. Would like to explore Hadoop option as well.



Thanks,
Elango

On Sat, Sep 16, 2023, 4:20 AM Bjørn Jørgensen 
wrote:

> you need to setup ssh without password, use key instead.  How to connect
> without password using SSH (passwordless)
> 
>
> fre. 15. sep. 2023 kl. 20:55 skrev Mich Talebzadeh <
> mich.talebza...@gmail.com>:
>
>> Hi,
>>
>> Can these 4 nodes talk to each other through ssh as trusted hosts (on top
>> of the network that Sean already mentioned)? Otherwise you need to set it
>> up. You can install a LAN if you have another free port at the back of your
>> HPC nodes. They should
>>
>> You ought to try to set up a Hadoop cluster pretty easily. Check this old
>> article of mine for Hadoop set-up.
>>
>>
>> https://www.linkedin.com/pulse/diy-festive-season-how-install-configure-big-data-so-mich/?trackingId=z7n5tx7tQOGK9tcG9VClkw%3D%3D
>>
>> Hadoop will provide you with a common storage layer (HDFS) that these
>> nodes will be able to share and talk. Yarn is your best bet as the resource
>> manager with reasonably powerful hosts you have. However, for now the Stand
>> Alone mode will do. Make sure that the Metastore you choose, (by default it
>> will use Hive Metastore called Derby :( ) is something respetable like
>> Postgres DB that can handle multiple concurrent spark jobs
>>
>> HTH
>>
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 15 Sept 2023 at 07:04, Ilango  wrote:
>>
>>>
>>> Hi all,
>>>
>>> We have 4 HPC nodes and installed spark individually in all nodes.
>>>
>>> Spark is used as local mode(each driver/executor will have 8 cores and
>>> 65 GB) in Sparklyr/pyspark using Rstudio/Posit workbench. Slurm is used as
>>> scheduler.
>>>
>>> As this is local mode, we are facing performance issue(as only one
>>> executor) when it comes dealing with large datasets.
>>>
>>> Can I convert this 4 nodes into spark standalone cluster. We dont have
>>> hadoop so yarn mode is out of scope.
>>>
>>> Shall I follow the official documentation for setting up standalone
>>> cluster. Will it work? Do I need to aware anything else?
>>> Can you please share your thoughts?
>>>
>>> Thanks,
>>> Elango
>>>
>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

getting emails in different order!

2023-09-18 Thread Mich Talebzadeh

Hi,

I use gmail to receive spark user group emails.

On occasions, I get the latest emails first and later in the day I receive
the original email.

Has anyone else seen this behaviour recently?

Thanks

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

[ANNOUNCE] Apache Kyuubi released 1.7.2

2023-09-18 Thread Zhen Wang

Hi all,

The Apache Kyuubi community is pleased to announce that
Apache Kyuubi 1.7.2 has been released!

Apache Kyuubi is a distributed and multi-tenant gateway to provide
serverless SQL on data warehouses and lakehouses.

Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface
for end-users to manipulate large-scale data with pre-programmed and
extensible Spark SQL engines.

We are aiming to make Kyuubi an "out-of-the-box" tool for data warehouses
and lakehouses.

This "out-of-the-box" model minimizes the barriers and costs for end-users
to use Spark at the client side.

At the server-side, Kyuubi server and engine's multi-tenant architecture
provides the administrators a way to achieve computing resource isolation,
data security, high availability, high client concurrency, etc.

The full release notes and download links are available at:
Release Notes: https://kyuubi.apache.org/release/1.7.2.html

To learn more about Apache Kyuubi, please see
https://kyuubi.apache.org/

Kyuubi Resources:
- Issue: https://github.com/apache/kyuubi/issues
- Mailing list: d...@kyuubi.apache.org

We would like to thank all contributors of the Kyuubi community
who made this release possible!

Thanks,
On behalf of Apache Kyuubi community

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

unsubscribe

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

Re: getting emails in different order!

Re: getting emails in different order!

Re: Spark stand-alone mode

getting emails in different order!

[ANNOUNCE] Apache Kyuubi released 1.7.2

7 matches

Site Navigation

Mail list logo

Footer information