unsubscribe

2023-09-19 Thread Danilo Sousa
unsubscribe

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



unsubscribe

2023-09-19 Thread Ghousia
unsubscribe


Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Mich Talebzadeh
Hi Helen,

Assuming you want to calculate stddev_samp,  Spark correctly points  STDDEV
to STDDEV_SAMP.

In below replace sales with your table name and AMOUNT_SOLD with the column
you want to do the calculation

SELECT

SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
AS MYSTDDEV,
STDDEV(amount_sold) AS STDDEV,
STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
STDDEV_POP(amount_sold) AS STDDEV_POP
fromsales;

for me it returned

++-++-+--+
|  mystddev  |   stddev|stddev_samp |
stddev_pop  |
++-++-+--+
| 260.7270919450411  | 260.7270722861637   | 260.7270722861637  |
260.72704617042166  |
++-++-+--+

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 19 Sept 2023 at 13:14, Helene Bøe 
wrote:

> Hi!
>
>
>
> I am applying the stddev function (so actually stddev_samp), however when
> comparing with the sample standard deviation in Excel the resuls do not
> match.
>
> I cannot find in your documentation any more specifics on how the sample
> standard deviation is calculated, so I cannot compare the difference toward
> excel, which uses
>
> .
>
> I am trying to avoid using Excel at all costs, but if the stddev_samp
> function is not calculating the standard deviation correctly I have a
> problem.
>
> I hope you can help me resolve this issue.
>
>
>
> Kindest regards,
>
>
>
> *Helene Bøe*
> *Graduate Project Engineer*
> Recycling Process & Support
>
> M: +47 980 00 887
> helene.b...@hydro.com
> 
>
> Norsk Hydro ASA
> Drammensveien 264
> NO-0283 Oslo, Norway
> www.hydro.com
> 
>
>
> NOTICE: This e-mail transmission, and any documents, files or previous
> e-mail messages attached to it, may contain confidential or privileged
> information. If you are not the intended recipient, or a person responsible
> for delivering it to the intended recipient, you are hereby notified that
> any disclosure, copying, distribution or use of any of the information
> contained in or attached to this message is STRICTLY PROHIBITED. If you
> have received this transmission in error, please immediately notify the
> sender and delete the e-mail and attached documents. Thank you.
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Bjørn Jørgensen
from pyspark.sql import SparkSession
from pyspark.sql.functions import stddev_samp, stddev_pop

spark = SparkSession.builder.getOrCreate()

data = [(52.7,), (45.3,), (60.2,), (53.8,), (49.1,), (44.6,), (58.0,),
(56.5,), (47.9,), (50.3,)]
df = spark.createDataFrame(data, ["value"])

df.select(stddev_samp("value").alias("sample_stddev")).show()

+-+
|sample_stddev|
+-+
|5.320025062597606|
+-+



In MS Excel 365 Norwegian

[image: image.png]


=STDAVVIKA(B1:B10)

=STDAV.S(B1:B10)

They both prints
5,32002506

 Which is the same as pyspark does.





tir. 19. sep. 2023 kl. 14:15 skrev Helene Bøe :

> Hi!
>
>
>
> I am applying the stddev function (so actually stddev_samp), however when
> comparing with the sample standard deviation in Excel the resuls do not
> match.
>
> I cannot find in your documentation any more specifics on how the sample
> standard deviation is calculated, so I cannot compare the difference toward
> excel, which uses
>
> .
>
> I am trying to avoid using Excel at all costs, but if the stddev_samp
> function is not calculating the standard deviation correctly I have a
> problem.
>
> I hope you can help me resolve this issue.
>
>
>
> Kindest regards,
>
>
>
> *Helene Bøe*
> *Graduate Project Engineer*
> Recycling Process & Support
>
> M: +47 980 00 887
> helene.b...@hydro.com
> 
>
> Norsk Hydro ASA
> Drammensveien 264
> NO-0283 Oslo, Norway
> www.hydro.com
> 
>
>
> NOTICE: This e-mail transmission, and any documents, files or previous
> e-mail messages attached to it, may contain confidential or privileged
> information. If you are not the intended recipient, or a person responsible
> for delivering it to the intended recipient, you are hereby notified that
> any disclosure, copying, distribution or use of any of the information
> contained in or attached to this message is STRICTLY PROHIBITED. If you
> have received this transmission in error, please immediately notify the
> sender and delete the e-mail and attached documents. Thank you.
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Create an external table with DataFrameWriterV2

2023-09-19 Thread Christophe Préaud

Hi,

I usually create an external Delta table with the command below, using 
DataFrameWriter API:


df.write
   .format("delta")
   .option("path", "")
   .saveAsTable("")

Now I would like to use the DataFrameWriterV2 API.
I have tried the following command:

df.writeTo("")
   .using("delta")
   .option("path", "")
   .createOrReplace()

but it creates a managed table, not an external one.

Can you tell me the correct syntax for creating an external table with 
DataFrameWriterV2 API?


Thanks,
Christophe.


Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Sean Owen
Pyspark follows SQL databases here. stddev is stddev_samp, and sample
standard deviation is the calculation with the Bessel correction, n-1 in
the denominator. stddev_pop is simply standard deviation, with n in the
denominator.

On Tue, Sep 19, 2023 at 7:13 AM Helene Bøe 
wrote:

> Hi!
>
>
>
> I am applying the stddev function (so actually stddev_samp), however when
> comparing with the sample standard deviation in Excel the resuls do not
> match.
>
> I cannot find in your documentation any more specifics on how the sample
> standard deviation is calculated, so I cannot compare the difference toward
> excel, which uses
>
> .
>
> I am trying to avoid using Excel at all costs, but if the stddev_samp
> function is not calculating the standard deviation correctly I have a
> problem.
>
> I hope you can help me resolve this issue.
>
>
>
> Kindest regards,
>
>
>
> *Helene Bøe*
> *Graduate Project Engineer*
> Recycling Process & Support
>
> M: +47 980 00 887
> helene.b...@hydro.com
> 
>
> Norsk Hydro ASA
> Drammensveien 264
> NO-0283 Oslo, Norway
> www.hydro.com
> 
>
>
> NOTICE: This e-mail transmission, and any documents, files or previous
> e-mail messages attached to it, may contain confidential or privileged
> information. If you are not the intended recipient, or a person responsible
> for delivering it to the intended recipient, you are hereby notified that
> any disclosure, copying, distribution or use of any of the information
> contained in or attached to this message is STRICTLY PROHIBITED. If you
> have received this transmission in error, please immediately notify the
> sender and delete the e-mail and attached documents. Thank you.
>


Spark streaming sourceArchiveDir does not move file to archive directory

2023-09-19 Thread Yunus Emre G?rses
Hello everyone,

I'm using scala and spark with the version 3.4.1 in Windows 10. While streaming 
using Spark, I give the `cleanSource` option as "archive" and the 
`sourceArchiveDir` option as "archived" as in the code below.

```
spark.readStream
  .option("cleanSource", "archive")
  .option("sourceArchiveDir", "archived")
  .option("enforceSchema", false)
  .option("header", includeHeader)
  .option("inferSchema", inferSchema)
  .options(otherOptions)
  .schema(csvSchema.orNull)
  .csv(FileUtils.getPath(sourceSettings.dataFolderPath, 
mappingSource.path).toString)
```

The code ```FileUtils.getPath(sourceSettings.dataFolderPath, 
mappingSource.path)``` returns a relative path like: 
test-data\streaming-folder\patients

When I start stream, spark does not move source csv to archive folder. After 
working on it a bit, I started debugging the spark source codes. I found the 
```override protected def cleanTask(entry: FileEntry): Unit``` method in the 
`FileStreamSource.scala` file in the `org.apache.spark.sql.execution.streaming` 
package.
On line 569, the ```!fileSystem.rename(curPath, newPath)``` code supposed to 
move source file to archive folder. However, when I debugged, I noticed that 
the curPath and newPath values were as follows:

**curPath**: 
`file:/C:/dev/be/data-integration-suite/test-data/streaming-folder/patients/patients-success.csv`

**newPath**: 
`file:/C:/dev/be/data-integration-suite/archived/C:/dev/be/data-integration-suite/test-data/streaming-folder/patients/patients-success.csv`

It seems that absolute path of csv file were appended when creating `newPath` 
because there are two `C:/dev/be/data-integration-suite` in the newPath. This 
is the reason spark archiving does not work. Instead, newPath should be: 
`file:/C:/dev/be/data-integration-suite/archived/test-data/streaming-folder/patients/patients-success.csv`.
 I guess this is more related to spark library and maybe it's a spark related 
bug? Is there any workaround or spark config to overcome this problem?

Thanks
Best regards,
Yunus Emre


Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Helene Bøe
Hi!

I am applying the stddev function (so actually stddev_samp), however when 
comparing with the sample standard deviation in Excel the resuls do not match.
I cannot find in your documentation any more specifics on how the sample 
standard deviation is calculated, so I cannot compare the difference toward 
excel, which uses
[cid:image003.png@01D9EAF8.AE708920].
I am trying to avoid using Excel at all costs, but if the stddev_samp function 
is not calculating the standard deviation correctly I have a problem.
I hope you can help me resolve this issue.

Kindest regards,

Helene Bøe
Graduate Project Engineer
Recycling Process & Support
M: +47 980 00 887
helene.b...@hydro.com

Norsk Hydro ASA
Drammensveien 264
NO-0283 Oslo, Norway
www.hydro.com

[cid:image004.png@01D9EAF8.AE708920]

NOTICE: This e-mail transmission, and any documents, files or previous e-mail 
messages attached to it, may contain confidential or privileged information. If 
you are not the intended recipient, or a person responsible for delivering it 
to the intended recipient, you are hereby notified that any disclosure, 
copying, distribution or use of any of the information contained in or attached 
to this message is STRICTLY PROHIBITED. If you have received this transmission 
in error, please immediately notify the sender and delete the e-mail and 
attached documents. Thank you.


Re: Spark stand-alone mode

2023-09-19 Thread Patrick Tucci
Multiple applications can run at once, but you need to either configure
Spark or your applications to allow that. In stand-alone mode, each
application attempts to take all resources available by default. This
section of the documentation has more details:

https://spark.apache.org/docs/latest/spark-standalone.html#resource-scheduling

Explicitly setting the resources per application limits the resources to
the configured values for the lifetime of the application. You can use
dynamic allocation to allow Spark to scale the resources up and down per
application based on load, but the configuration is relatively more complex:

https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation

On Mon, Sep 18, 2023 at 3:53 PM Ilango  wrote:

>
> Thanks all for your suggestions. Noted with thanks.
> Just wanted share few more details about the environment
> 1. We use NFS for data storage and data is in parquet format
> 2. All HPC nodes are connected and already work as a cluster for Studio
> workbench. I can setup password less SSH if it not exist already.
> 3. We will stick with NFS for now and stand alone then may be will explore
> HDFS and YARN.
>
> Can you please confirm whether multiple users can run spark jobs at the
> same time?
> If so I will start working on it and let you know how it goes
>
> Mich, the link to Hadoop is not working. Can you please check and let me
> know the correct link. Would like to explore Hadoop option as well.
>
>
>
> Thanks,
> Elango
>
> On Sat, Sep 16, 2023, 4:20 AM Bjørn Jørgensen 
> wrote:
>
>> you need to setup ssh without password, use key instead.  How to connect
>> without password using SSH (passwordless)
>> 
>>
>> fre. 15. sep. 2023 kl. 20:55 skrev Mich Talebzadeh <
>> mich.talebza...@gmail.com>:
>>
>>> Hi,
>>>
>>> Can these 4 nodes talk to each other through ssh as trusted hosts (on
>>> top of the network that Sean already mentioned)? Otherwise you need to set
>>> it up. You can install a LAN if you have another free port at the back of
>>> your HPC nodes. They should
>>>
>>> You ought to try to set up a Hadoop cluster pretty easily. Check this
>>> old article of mine for Hadoop set-up.
>>>
>>>
>>> https://www.linkedin.com/pulse/diy-festive-season-how-install-configure-big-data-so-mich/?trackingId=z7n5tx7tQOGK9tcG9VClkw%3D%3D
>>>
>>> Hadoop will provide you with a common storage layer (HDFS) that these
>>> nodes will be able to share and talk. Yarn is your best bet as the resource
>>> manager with reasonably powerful hosts you have. However, for now the Stand
>>> Alone mode will do. Make sure that the Metastore you choose, (by default it
>>> will use Hive Metastore called Derby :( ) is something respetable like
>>> Postgres DB that can handle multiple concurrent spark jobs
>>>
>>> HTH
>>>
>>>
>>> Mich Talebzadeh,
>>> Distinguished Technologist, Solutions Architect & Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 15 Sept 2023 at 07:04, Ilango  wrote:
>>>

 Hi all,

 We have 4 HPC nodes and installed spark individually in all nodes.

 Spark is used as local mode(each driver/executor will have 8 cores and
 65 GB) in Sparklyr/pyspark using Rstudio/Posit workbench. Slurm is used as
 scheduler.

 As this is local mode, we are facing performance issue(as only one
 executor) when it comes dealing with large datasets.

 Can I convert this 4 nodes into spark standalone cluster. We dont have
 hadoop so yarn mode is out of scope.

 Shall I follow the official documentation for setting up standalone
 cluster. Will it work? Do I need to aware anything else?
 Can you please share your thoughts?

 Thanks,
 Elango

>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>


Urgent: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem

2023-09-19 Thread Karthick
Subject: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem

Dear Spark Community,

I recently reached out to the Apache Flink community for assistance with a
critical issue we are facing in our IoT platform, which relies on Apache
Kafka and real-time data processing. We received some valuable insights and
suggestions from the Apache Flink community, and now, we would like to seek
your expertise and guidance on the same problem.

In our IoT ecosystem, we are dealing with data streams from thousands of
devices, each uniquely identified. To maintain data integrity and ordering,
we have configured a Kafka topic with ten partitions, ensuring that each
device's data is directed to its respective partition based on its unique
identifier. While this architectural choice has been effective in
maintaining data order, it has unveiled a significant challenge:

*Slow Consumer and Data Skew Problem:* When a single device experiences
processing delays, it acts as a bottleneck within the Kafka partition,
leading to delays in processing data from other devices sharing the same
partition. This issue severely affects the efficiency and scalability of
our entire data processing pipeline.

Here are some key details:

- Number of Devices: 1000 (with potential growth)
- Target Message Rate: 1000 messages per second (with expected growth)
- Kafka Partitions: 10 (some partitions are overloaded)
- We are planning to migrate from Apache Storm to Apache Flink/Spark.

We are actively seeking guidance on the following aspects:

*1. Independent Device Data Processing*: We require a strategy that
guarantees one device's processing speed does not affect other devices in
the same Kafka partition. In other words, we need a solution that ensures
the independent processing of each device's data.

*2. Custom Partitioning Strategy:* We are looking for a custom partitioning
strategy to distribute the load evenly across Kafka partitions. Currently,
we are using Murmur hashing with the device's unique identifier, but we are
open to exploring alternative partitioning strategies.

*3. Determining Kafka Partition Count:* We seek guidance on how to
determine the optimal number of Kafka partitions to handle the target
message rate efficiently.

*4. Handling Data Skew:* Strategies or techniques for handling data skew
within Apache Flink.

We believe that many in your community may have faced similar challenges or
possess valuable insights into addressing them. Your expertise and
experiences can greatly benefit our team and the broader community dealing
with real-time data processing.

If you have any knowledge, solutions, or references to open-source
projects, libraries, or community-contributed solutions that align with our
requirements, we would be immensely grateful for your input.

We appreciate your prompt attention to this matter and eagerly await your
responses and insights. Your support will be invaluable in helping us
overcome this critical challenge.

Thank you for your time and consideration.

Thanks & regards,
Karthick.