Re: PySpark 3.5.0 on PyPI

2023-09-20 Thread Kezhi Xiong
Oh, I saw it now. Thanks!

On Wed, Sep 20, 2023 at 1:04 PM Sean Owen  wrote:

> [ External sender. Exercise caution. ]
>
> I think the announcement mentioned there were some issues with pypi and
> the upload size this time. I am sure it's intended to be there when
> possible.
>
> On Wed, Sep 20, 2023, 3:00 PM Kezhi Xiong 
> wrote:
>
>> Hi,
>>
>> Are there any plans to upload PySpark 3.5.0 to PyPI (
>> https://pypi.org/project/pyspark/)? It's still 3.4.1.
>>
>> Thanks,
>> Kezhi
>>
>>
>>


Re: PySpark 3.5.0 on PyPI

2023-09-20 Thread Sean Owen
I think the announcement mentioned there were some issues with pypi and the
upload size this time. I am sure it's intended to be there when possible.

On Wed, Sep 20, 2023, 3:00 PM Kezhi Xiong  wrote:

> Hi,
>
> Are there any plans to upload PySpark 3.5.0 to PyPI (
> https://pypi.org/project/pyspark/)? It's still 3.4.1.
>
> Thanks,
> Kezhi
>
>
>


PySpark 3.5.0 on PyPI

2023-09-20 Thread Kezhi Xiong
Hi,

Are there any plans to upload PySpark 3.5.0 to PyPI (
https://pypi.org/project/pyspark/)? It's still 3.4.1.

Thanks,
Kezhi


[Spark 3.5.0] Is the protobuf-java JAR no longer shipped with Spark?

2023-09-20 Thread Gijs Hendriksen

Hi all,

This week, I tried upgrading to Spark 3.5.0, as it contained some fixes 
for spark-protobuf that I need for my project. However, my code is no 
longer running under Spark 3.5.0.


My build.sbt file is configured as follows:

val sparkV  = "3.5.0"
val hadoopV = "3.3.6"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core"   % sparkV  % "provided",
  "org.apache.spark" %% "spark-sql"    % sparkV  % "provided",
  "org.apache.hadoop"    %  "hadoop-client"    % hadoopV % "provided",
  "org.apache.spark" %% "spark-protobuf"   % sparkV,
)

I am using sbt-assembly to build a fat JAR, but I exclude Spark and 
Hadoop JARs to limit the assembled JAR size. Spark (and its 
dependencies) are supplied in our environment by the jars/ directory 
included in the the Spark distribution.


However, when running my application (which uses protobuf-java's 
CodedOutputStream for writing delimited protobuf files) with Spark 
3.5.0, I now get the following error:


...
Caused by: java.lang.ClassNotFoundException: 
com.google.protobuf.CodedOutputStream

    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    ... 22 more

When inspecting the jars/ directory in the newest Spark release 
(spark-3.5.0-bin-hadoop3), I noticed the protobuf-java JAR was no longer 
included in the release, while it was present in Spark 3.4.1. My code 
seems to compile because protobuf-java is still a dependency of 
spark-core:3.5.0, but since the JAR is no longer included, the class 
cannot be found at runtime.


Is this expected/intentional behaviour? I was able to resolve the issue 
by manually adding protobuf-java as a dependency to my own project and 
including it in the fat JAR, but it seems weird to me that it is no 
longer shipped with Spark since the newest release. I also could not 
find any mention of this change in the release notes or elsewhere, but 
perhaps I missed something.


Thanks in advance for any help!

Cheers,
Gijs


Re: Discriptency sample standard deviation pyspark and Excel

2023-09-20 Thread Sean Owen
This has turned into a big thread for a simple thing and has been answered
3 times over now.

Neither is better, they just calculate different things. That the 'default'
is sample stddev is just convention.
stddev_pop is the simple standard deviation of a set of numbers
stddev_samp is used when the set of numbers is a sample from a notional
larger population, and you estimate the stddev of the population from the
sample.

They only differ in the denominator. Neither is more efficient at all or
more/less sensitive to outliers.

On Wed, Sep 20, 2023 at 3:06 AM Mich Talebzadeh 
wrote:

> Spark uses the sample standard deviation stddev_samp by default, whereas
> *Hive* uses population standard deviation stddev_pop as default.
>
> My understanding is that spark uses sample standard deviation by default
> because
>
>- It is more commonly used.
>- It is more efficient to calculate.
>- It is less sensitive to outliers. (data points that differ
>significantly from other observations in a dataset. They can be caused by a
>variety of factors, such as measurement errors or edge events.)
>
> The sample standard deviation is less sensitive to outliers because it
> divides by N-1 instead of N. This means that a single outlier will have a
> smaller impact on the sample standard deviation than it would on the
> population standard deviation.
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 19 Sept 2023 at 21:50, Sean Owen  wrote:
>
>> Pyspark follows SQL databases here. stddev is stddev_samp, and sample
>> standard deviation is the calculation with the Bessel correction, n-1 in
>> the denominator. stddev_pop is simply standard deviation, with n in the
>> denominator.
>>
>> On Tue, Sep 19, 2023 at 7:13 AM Helene Bøe 
>> wrote:
>>
>>> Hi!
>>>
>>>
>>>
>>> I am applying the stddev function (so actually stddev_samp), however
>>> when comparing with the sample standard deviation in Excel the resuls do
>>> not match.
>>>
>>> I cannot find in your documentation any more specifics on how the sample
>>> standard deviation is calculated, so I cannot compare the difference toward
>>> excel, which uses
>>>
>>> .
>>>
>>> I am trying to avoid using Excel at all costs, but if the stddev_samp
>>> function is not calculating the standard deviation correctly I have a
>>> problem.
>>>
>>> I hope you can help me resolve this issue.
>>>
>>>
>>>
>>> Kindest regards,
>>>
>>>
>>>
>>> *Helene Bøe*
>>> *Graduate Project Engineer*
>>> Recycling Process & Support
>>>
>>> M: +47 980 00 887
>>> helene.b...@hydro.com
>>> 
>>>
>>> Norsk Hydro ASA
>>> Drammensveien 264
>>> NO-0283 Oslo, Norway
>>> www.hydro.com
>>> 
>>>
>>>
>>> NOTICE: This e-mail transmission, and any documents, files or previous
>>> e-mail messages attached to it, may contain confidential or privileged
>>> information. If you are not the intended recipient, or a person responsible
>>> for delivering it to the intended recipient, you are hereby notified that
>>> any disclosure, copying, distribution or use of any of the information
>>> contained in or attached to this message is STRICTLY PROHIBITED. If you
>>> have received this transmission in error, please immediately notify the
>>> sender and delete the e-mail and attached documents. Thank you.
>>>
>>


Re: Urgent: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem

2023-09-20 Thread Gowtham S
Hi Spark Community,

Thank you for bringing up this issue. We've also encountered the same
challenge and are actively working on finding a solution. It's reassuring
to know that we're not alone in this.

If you have any insights or suggestions regarding how to address this
problem, please feel free to share them.

Looking forward to hearing from others who might have encountered similar
issues.


Thanks and regards,
Gowtham S


On Tue, 19 Sept 2023 at 17:23, Karthick  wrote:

> Subject: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem
>
> Dear Spark Community,
>
> I recently reached out to the Apache Flink community for assistance with a
> critical issue we are facing in our IoT platform, which relies on Apache
> Kafka and real-time data processing. We received some valuable insights and
> suggestions from the Apache Flink community, and now, we would like to seek
> your expertise and guidance on the same problem.
>
> In our IoT ecosystem, we are dealing with data streams from thousands of
> devices, each uniquely identified. To maintain data integrity and ordering,
> we have configured a Kafka topic with ten partitions, ensuring that each
> device's data is directed to its respective partition based on its unique
> identifier. While this architectural choice has been effective in
> maintaining data order, it has unveiled a significant challenge:
>
> *Slow Consumer and Data Skew Problem:* When a single device experiences
> processing delays, it acts as a bottleneck within the Kafka partition,
> leading to delays in processing data from other devices sharing the same
> partition. This issue severely affects the efficiency and scalability of
> our entire data processing pipeline.
>
> Here are some key details:
>
> - Number of Devices: 1000 (with potential growth)
> - Target Message Rate: 1000 messages per second (with expected growth)
> - Kafka Partitions: 10 (some partitions are overloaded)
> - We are planning to migrate from Apache Storm to Apache Flink/Spark.
>
> We are actively seeking guidance on the following aspects:
>
> *1. Independent Device Data Processing*: We require a strategy that
> guarantees one device's processing speed does not affect other devices in
> the same Kafka partition. In other words, we need a solution that ensures
> the independent processing of each device's data.
>
> *2. Custom Partitioning Strategy:* We are looking for a custom
> partitioning strategy to distribute the load evenly across Kafka
> partitions. Currently, we are using Murmur hashing with the device's unique
> identifier, but we are open to exploring alternative partitioning
> strategies.
>
> *3. Determining Kafka Partition Count:* We seek guidance on how to
> determine the optimal number of Kafka partitions to handle the target
> message rate efficiently.
>
> *4. Handling Data Skew:* Strategies or techniques for handling data skew
> within Apache Flink.
>
> We believe that many in your community may have faced similar challenges
> or possess valuable insights into addressing them. Your expertise and
> experiences can greatly benefit our team and the broader community dealing
> with real-time data processing.
>
> If you have any knowledge, solutions, or references to open-source
> projects, libraries, or community-contributed solutions that align with our
> requirements, we would be immensely grateful for your input.
>
> We appreciate your prompt attention to this matter and eagerly await your
> responses and insights. Your support will be invaluable in helping us
> overcome this critical challenge.
>
> Thank you for your time and consideration.
>
> Thanks & regards,
> Karthick.
>


Re: Discriptency sample standard deviation pyspark and Excel

2023-09-20 Thread Mich Talebzadeh
Spark uses the sample standard deviation stddev_samp by default, whereas
*Hive* uses population standard deviation stddev_pop as default.

My understanding is that spark uses sample standard deviation by default
because

   - It is more commonly used.
   - It is more efficient to calculate.
   - It is less sensitive to outliers. (data points that differ
   significantly from other observations in a dataset. They can be caused by a
   variety of factors, such as measurement errors or edge events.)

The sample standard deviation is less sensitive to outliers because it
divides by N-1 instead of N. This means that a single outlier will have a
smaller impact on the sample standard deviation than it would on the
population standard deviation.

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 19 Sept 2023 at 21:50, Sean Owen  wrote:

> Pyspark follows SQL databases here. stddev is stddev_samp, and sample
> standard deviation is the calculation with the Bessel correction, n-1 in
> the denominator. stddev_pop is simply standard deviation, with n in the
> denominator.
>
> On Tue, Sep 19, 2023 at 7:13 AM Helene Bøe 
> wrote:
>
>> Hi!
>>
>>
>>
>> I am applying the stddev function (so actually stddev_samp), however when
>> comparing with the sample standard deviation in Excel the resuls do not
>> match.
>>
>> I cannot find in your documentation any more specifics on how the sample
>> standard deviation is calculated, so I cannot compare the difference toward
>> excel, which uses
>>
>> .
>>
>> I am trying to avoid using Excel at all costs, but if the stddev_samp
>> function is not calculating the standard deviation correctly I have a
>> problem.
>>
>> I hope you can help me resolve this issue.
>>
>>
>>
>> Kindest regards,
>>
>>
>>
>> *Helene Bøe*
>> *Graduate Project Engineer*
>> Recycling Process & Support
>>
>> M: +47 980 00 887
>> helene.b...@hydro.com
>> 
>>
>> Norsk Hydro ASA
>> Drammensveien 264
>> NO-0283 Oslo, Norway
>> www.hydro.com
>> 
>>
>>
>> NOTICE: This e-mail transmission, and any documents, files or previous
>> e-mail messages attached to it, may contain confidential or privileged
>> information. If you are not the intended recipient, or a person responsible
>> for delivering it to the intended recipient, you are hereby notified that
>> any disclosure, copying, distribution or use of any of the information
>> contained in or attached to this message is STRICTLY PROHIBITED. If you
>> have received this transmission in error, please immediately notify the
>> sender and delete the e-mail and attached documents. Thank you.
>>
>