from:"Sean Owen"

Re: Spark Compatibility with Spring Boot 3.x

2023-10-05 Thread Sean Owen

I think we already updated this in Spark 4. However for now you would have
to also include a JAR with the jakarta.* classes instead.
You are welcome to try Spark 4 now by building from master, but it's far
from release.

On Thu, Oct 5, 2023 at 11:53 AM Ahmed Albalawi
 wrote:

> Hello team,
>
> We are in the process of upgrading one of our apps to Spring Boot 3.x
> while using Spark, and we have encountered an issue with Spark
> compatibility, specifically with Jakarta Servlet. Spring Boot 3.x uses
> Jakarta Servlet while Spark uses Javax Servlet. Can we get some guidance on
> how to upgrade to Spring Boot 3.x while continuing to use Spark.
>
> The specific error is listed below:
>
> java.lang.NoClassDefFoundError: javax/servlet/Servlet
> at org.apache.spark.ui.SparkUI$.create(SparkUI.scala:239)
> at org.apache.spark.SparkContext.(SparkContext.scala:503)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2888)
> at org.apache.spark.SparkContext.getOrCreate(SparkContext.scala)
>
> The error comes up when we try to run a mvn clean install, and the issue is 
> in our test cases. This issue happens specifically when we build our spark 
> session. The line of code it traces down to is as follows:
>
> *session = 
> SparkSession.builder().sparkContext(SparkContext.getOrCreate(sparkConf)).getOrCreate();*
>
> What we have tried:
>
> - We noticed according to this post 
> ,
>  there are no compatible versions of spark using version 5 of the Jakarta 
> Servlet API
>
> - We've tried 
> 
>  using the maven shade plugin to use jakarta instead of javax, but are 
> running into some other issues with this.
> - We've also looked at the following 
> 
>  to use jakarta 4.x with jersey 2.x and still have an issue with the servlet
>
>
> Please let us know if there are any solutions to this issue. Thanks!
>
>
> --
> *Ahmed Albalawi*
>
> Senior Associate Software Engineer • EP2 Tech - CuRE
>
> 571-668-3911 •  1680 Capital One Dr.
> --
>
> The information contained in this e-mail may be confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>
>
>
>
>

Re: PySpark 3.5.0 on PyPI

2023-09-20 Thread Sean Owen

I think the announcement mentioned there were some issues with pypi and the
upload size this time. I am sure it's intended to be there when possible.

On Wed, Sep 20, 2023, 3:00 PM Kezhi Xiong  wrote:

> Hi,
>
> Are there any plans to upload PySpark 3.5.0 to PyPI (
> https://pypi.org/project/pyspark/)? It's still 3.4.1.
>
> Thanks,
> Kezhi
>
>
>

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-20 Thread Sean Owen

This has turned into a big thread for a simple thing and has been answered
3 times over now.

Neither is better, they just calculate different things. That the 'default'
is sample stddev is just convention.
stddev_pop is the simple standard deviation of a set of numbers
stddev_samp is used when the set of numbers is a sample from a notional
larger population, and you estimate the stddev of the population from the
sample.

They only differ in the denominator. Neither is more efficient at all or
more/less sensitive to outliers.

On Wed, Sep 20, 2023 at 3:06 AM Mich Talebzadeh 
wrote:

> Spark uses the sample standard deviation stddev_samp by default, whereas
> *Hive* uses population standard deviation stddev_pop as default.
>
> My understanding is that spark uses sample standard deviation by default
> because
>
>- It is more commonly used.
>- It is more efficient to calculate.
>- It is less sensitive to outliers. (data points that differ
>significantly from other observations in a dataset. They can be caused by a
>variety of factors, such as measurement errors or edge events.)
>
> The sample standard deviation is less sensitive to outliers because it
> divides by N-1 instead of N. This means that a single outlier will have a
> smaller impact on the sample standard deviation than it would on the
> population standard deviation.
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 19 Sept 2023 at 21:50, Sean Owen  wrote:
>
>> Pyspark follows SQL databases here. stddev is stddev_samp, and sample
>> standard deviation is the calculation with the Bessel correction, n-1 in
>> the denominator. stddev_pop is simply standard deviation, with n in the
>> denominator.
>>
>> On Tue, Sep 19, 2023 at 7:13 AM Helene Bøe 
>> wrote:
>>
>>> Hi!
>>>
>>>
>>>
>>> I am applying the stddev function (so actually stddev_samp), however
>>> when comparing with the sample standard deviation in Excel the resuls do
>>> not match.
>>>
>>> I cannot find in your documentation any more specifics on how the sample
>>> standard deviation is calculated, so I cannot compare the difference toward
>>> excel, which uses
>>>
>>> .
>>>
>>> I am trying to avoid using Excel at all costs, but if the stddev_samp
>>> function is not calculating the standard deviation correctly I have a
>>> problem.
>>>
>>> I hope you can help me resolve this issue.
>>>
>>>
>>>
>>> Kindest regards,
>>>
>>>
>>>
>>> *Helene Bøe*
>>> *Graduate Project Engineer*
>>> Recycling Process & Support
>>>
>>> M: +47 980 00 887
>>> helene.b...@hydro.com
>>> <https://intra.hydro.com/EPiServer/CMS/Content/en/%2c%2c9/?epieditmode=False>
>>>
>>> Norsk Hydro ASA
>>> Drammensveien 264
>>> NO-0283 Oslo, Norway
>>> www.hydro.com
>>> <https://intra.hydro.com/EPiServer/CMS/Content/en/%2c%2c9/?epieditmode=False>
>>>
>>>
>>> NOTICE: This e-mail transmission, and any documents, files or previous
>>> e-mail messages attached to it, may contain confidential or privileged
>>> information. If you are not the intended recipient, or a person responsible
>>> for delivering it to the intended recipient, you are hereby notified that
>>> any disclosure, copying, distribution or use of any of the information
>>> contained in or attached to this message is STRICTLY PROHIBITED. If you
>>> have received this transmission in error, please immediately notify the
>>> sender and delete the e-mail and attached documents. Thank you.
>>>
>>

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Sean Owen

Pyspark follows SQL databases here. stddev is stddev_samp, and sample
standard deviation is the calculation with the Bessel correction, n-1 in
the denominator. stddev_pop is simply standard deviation, with n in the
denominator.

On Tue, Sep 19, 2023 at 7:13 AM Helene Bøe 
wrote:

> Hi!
>
>
>
> I am applying the stddev function (so actually stddev_samp), however when
> comparing with the sample standard deviation in Excel the resuls do not
> match.
>
> I cannot find in your documentation any more specifics on how the sample
> standard deviation is calculated, so I cannot compare the difference toward
> excel, which uses
>
> .
>
> I am trying to avoid using Excel at all costs, but if the stddev_samp
> function is not calculating the standard deviation correctly I have a
> problem.
>
> I hope you can help me resolve this issue.
>
>
>
> Kindest regards,
>
>
>
> *Helene Bøe*
> *Graduate Project Engineer*
> Recycling Process & Support
>
> M: +47 980 00 887
> helene.b...@hydro.com
> 
>
> Norsk Hydro ASA
> Drammensveien 264
> NO-0283 Oslo, Norway
> www.hydro.com
> 
>
>
> NOTICE: This e-mail transmission, and any documents, files or previous
> e-mail messages attached to it, may contain confidential or privileged
> information. If you are not the intended recipient, or a person responsible
> for delivering it to the intended recipient, you are hereby notified that
> any disclosure, copying, distribution or use of any of the information
> contained in or attached to this message is STRICTLY PROHIBITED. If you
> have received this transmission in error, please immediately notify the
> sender and delete the e-mail and attached documents. Thank you.
>

Re: getting emails in different order!

2023-09-18 Thread Sean Owen

I have seen this, and not sure if it's just the ASF mailer being weird, or
more likely, because emails are moderated and we inadvertently moderate
them out of order

On Mon, Sep 18, 2023 at 10:59 AM Mich Talebzadeh 
wrote:

> Hi,
>
> I use gmail to receive spark user group emails.
>
> On occasions, I get the latest emails first and later in the day I receive
> the original email.
>
> Has anyone else seen this behaviour recently?
>
> Thanks
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Re: Spark stand-alone mode

2023-09-15 Thread Sean Owen

Yes, should work fine, just set up according to the docs. There needs to be
network connectivity between whatever the driver node is and these 4 nodes.

On Thu, Sep 14, 2023 at 11:57 PM Ilango  wrote:

>
> Hi all,
>
> We have 4 HPC nodes and installed spark individually in all nodes.
>
> Spark is used as local mode(each driver/executor will have 8 cores and 65
> GB) in Sparklyr/pyspark using Rstudio/Posit workbench. Slurm is used as
> scheduler.
>
> As this is local mode, we are facing performance issue(as only one
> executor) when it comes dealing with large datasets.
>
> Can I convert this 4 nodes into spark standalone cluster. We dont have
> hadoop so yarn mode is out of scope.
>
> Shall I follow the official documentation for setting up standalone
> cluster. Will it work? Do I need to aware anything else?
> Can you please share your thoughts?
>
> Thanks,
> Elango
>

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Sean Owen

I mean, have you checked if this is in your jar? Are you building an
assembly? Where do you expect elastic classes to be and are they there?
Need some basic debugging here

On Thu, Sep 7, 2023, 8:49 PM Dipayan Dev  wrote:

> Hi Sean,
>
> Removed the provided thing, but still the same issue.
>
> 
> org.elasticsearch
> elasticsearch-spark-30_${scala.compat.version}
> 7.12.1
> 
>
>
> On Fri, Sep 8, 2023 at 4:41 AM Sean Owen  wrote:
>
>> By marking it provided, you are not including this dependency with your
>> app. If it is also not somehow already provided by your spark cluster (this
>> is what it means), then yeah this is not anywhere on the class path at
>> runtime. Remove the provided scope.
>>
>> On Thu, Sep 7, 2023, 4:09 PM Dipayan Dev  wrote:
>>
>>> Hi,
>>>
>>> Can you please elaborate your last response? I don’t have any external
>>> dependencies added, and just updated the Spark version as mentioned below.
>>>
>>> Can someone help me with this?
>>>
>>> On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers  wrote:
>>>
>>>> could the provided scope be the issue?
>>>>
>>>> On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev 
>>>> wrote:
>>>>
>>>>> Using the following dependency for Spark 3 in POM file (My Scala
>>>>> version is 2.12.14)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *org.elasticsearch
>>>>> elasticsearch-spark-30_2.12
>>>>> 7.12.0provided*
>>>>>
>>>>>
>>>>> The code throws error at this line :
>>>>> df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name")
>>>>> The same code is working with Spark 2.4.0 and the following dependency
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *org.elasticsearch
>>>>> elasticsearch-spark-20_2.12
>>>>> 7.12.0*
>>>>>
>>>>>
>>>>> On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau 
>>>>> wrote:
>>>>>
>>>>>> What’s the version of the ES connector you are using?
>>>>>>
>>>>>> On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> We're using Spark 2.4.x to write dataframe into the Elasticsearch
>>>>>>> index.
>>>>>>> As we're upgrading to Spark 3.3.0, it throwing out error
>>>>>>> Caused by: java.lang.ClassNotFoundException: es.DefaultSource
>>>>>>> at
>>>>>>> java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
>>>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
>>>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>>>>>>>
>>>>>>> Looking at a few responses from Stackoverflow
>>>>>>> <https://stackoverflow.com/a/66452149>. it seems this is not yet
>>>>>>> supported by Elasticsearch-hadoop.
>>>>>>>
>>>>>>> Does anyone have experience with this? Or faced/resolved this issue
>>>>>>> in Spark 3?
>>>>>>>
>>>>>>> Thanks in advance!
>>>>>>>
>>>>>>> Regards
>>>>>>> Dipayan
>>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>
>>>> CONFIDENTIALITY NOTICE: This electronic communication and any files
>>>> transmitted with it are confidential, privileged and intended solely for
>>>> the use of the individual or entity to whom they are addressed. If you are
>>>> not the intended recipient, you are hereby notified that any disclosure,
>>>> copying, distribution (electronic or otherwise) or forwarding of, or the
>>>> taking of any action in reliance on the contents of this transmission is
>>>> strictly prohibited. Please notify the sender immediately by e-mail if you
>>>> have received this email by mistake and delete this email from your system.
>>>>
>>>> Is it necessary to print this email? If you care about the environment
>>>> like we do, please refrain from printing emails. It helps to keep the
>>>> environment forested and litter-free.
>>>
>>>

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Sean Owen

By marking it provided, you are not including this dependency with your
app. If it is also not somehow already provided by your spark cluster (this
is what it means), then yeah this is not anywhere on the class path at
runtime. Remove the provided scope.

On Thu, Sep 7, 2023, 4:09 PM Dipayan Dev  wrote:

> Hi,
>
> Can you please elaborate your last response? I don’t have any external
> dependencies added, and just updated the Spark version as mentioned below.
>
> Can someone help me with this?
>
> On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers  wrote:
>
>> could the provided scope be the issue?
>>
>> On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev 
>> wrote:
>>
>>> Using the following dependency for Spark 3 in POM file (My Scala version
>>> is 2.12.14)
>>>
>>>
>>>
>>>
>>>
>>>
>>> *org.elasticsearch
>>> elasticsearch-spark-30_2.12
>>> 7.12.0provided*
>>>
>>>
>>> The code throws error at this line :
>>> df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name")
>>> The same code is working with Spark 2.4.0 and the following dependency
>>>
>>>
>>>
>>>
>>>
>>> *org.elasticsearch
>>> elasticsearch-spark-20_2.12
>>> 7.12.0*
>>>
>>>
>>> On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau 
>>> wrote:
>>>
 What’s the version of the ES connector you are using?

 On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev 
 wrote:

> Hi All,
>
> We're using Spark 2.4.x to write dataframe into the Elasticsearch
> index.
> As we're upgrading to Spark 3.3.0, it throwing out error
> Caused by: java.lang.ClassNotFoundException: es.DefaultSource
> at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>
> Looking at a few responses from Stackoverflow
> . it seems this is not yet
> supported by Elasticsearch-hadoop.
>
> Does anyone have experience with this? Or faced/resolved this issue in
> Spark 3?
>
> Thanks in advance!
>
> Regards
> Dipayan
>
 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>> CONFIDENTIALITY NOTICE: This electronic communication and any files
>> transmitted with it are confidential, privileged and intended solely for
>> the use of the individual or entity to whom they are addressed. If you are
>> not the intended recipient, you are hereby notified that any disclosure,
>> copying, distribution (electronic or otherwise) or forwarding of, or the
>> taking of any action in reliance on the contents of this transmission is
>> strictly prohibited. Please notify the sender immediately by e-mail if you
>> have received this email by mistake and delete this email from your system.
>>
>> Is it necessary to print this email? If you care about the environment
>> like we do, please refrain from printing emails. It helps to keep the
>> environment forested and litter-free.
>
>

Re: Okio Vulnerability in Spark 3.4.1

2023-08-31 Thread Sean Owen

It's a dependency of some other HTTP library. Use mvn dependency:tree to
see where it comes from. It may be more straightforward to upgrade the
library that brings it in, assuming a later version brings in a later okio.
You can also manage up the version directly with a new entry in


However, does this affect Spark? all else equal it doesn't hurt to upgrade,
but wondering if there is even a theory that it needs to be updated.


On Thu, Aug 31, 2023 at 7:42 AM Agrawal, Sanket 
wrote:

> I don’t see an entry in pom.xml while building spark. I think it is being
> downloaded as part of some other dependency.
>
>
>
> *From:* Sean Owen 
> *Sent:* Thursday, August 31, 2023 5:10 PM
> *To:* Agrawal, Sanket 
> *Cc:* user@spark.apache.org
> *Subject:* [EXT] Re: Okio Vulnerability in Spark 3.4.1
>
>
>
> Does the vulnerability affect Spark?
>
> In any event, have you tried updating Okio in the Spark build? I don't
> believe you could just replace the JAR, as other libraries probably rely on
> it and compiled against the current version.
>
>
>
> On Thu, Aug 31, 2023 at 6:02 AM Agrawal, Sanket <
> sankeagra...@deloitte.com.invalid> wrote:
>
> Hi All,
>
>
>
> Amazon inspector has detected a vulnerability in okio-1.15.0.jar JAR in
> Spark 3.4.1. It suggests to upgrade the jar version to 3.4.0. But when we
> try this version of jar then the spark application is failing with below
> error:
>
>
>
> py4j.protocol.Py4JJavaError: An error occurred while calling
> None.org.apache.spark.api.java.JavaSparkContext.
>
> : java.lang.NoClassDefFoundError: okio/BufferedSource
>
> at okhttp3.internal.Util.(Util.java:62)
>
> at okhttp3.OkHttpClient.(OkHttpClient.java:127)
>
> at okhttp3.OkHttpClient$Builder.(OkHttpClient.java:475)
>
> at
> io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory.newOkHttpClientBuilder(OkHttpClientFactory.java:41)
>
> at
> io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory.newBuilder(OkHttpClientFactory.java:56)
>
> at
> io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory.newBuilder(OkHttpClientFactory.java:68)
>
> at
> io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory.newBuilder(OkHttpClientFactory.java:30)
>
> at
> io.fabric8.kubernetes.client.KubernetesClientBuilder.getHttpClient(KubernetesClientBuilder.java:88)
>
> at
> io.fabric8.kubernetes.client.KubernetesClientBuilder.build(KubernetesClientBuilder.java:78)
>
> at
> org.apache.spark.deploy.k8s.SparkKubernetesClientFactory$.createKubernetesClient(SparkKubernetesClientFactory.scala:120)
>
> at
> org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:111)
>
> at
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:3037)
>
> at org.apache.spark.SparkContext.(SparkContext.scala:568)
>
> at
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
>
> at
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
> at
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
> Source)
>
> at
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> Source)
>
> at java.base/java.lang.reflect.Constructor.newInstance(Unknown
> Source)
>
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
>
> at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
>
> at py4j.Gateway.invoke(Gateway.java:238)
>
> at
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
>
> at
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
>
> at
> py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
>
> at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
>
> at java.base/java.lang.Thread.run(Unknown Source)
>
> Caused by: java.lang.ClassNotFoundException: okio.BufferedSource
>
> at
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(Unknown Source)
>
> at
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(Unknown
> Source)
>
> at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
>
>

Re: Okio Vulnerability in Spark 3.4.1

2023-08-31 Thread Sean Owen

Does the vulnerability affect Spark?
In any event, have you tried updating Okio in the Spark build? I don't
believe you could just replace the JAR, as other libraries probably rely on
it and compiled against the current version.

On Thu, Aug 31, 2023 at 6:02 AM Agrawal, Sanket
 wrote:

> Hi All,
>
>
>
> Amazon inspector has detected a vulnerability in okio-1.15.0.jar JAR in
> Spark 3.4.1. It suggests to upgrade the jar version to 3.4.0. But when we
> try this version of jar then the spark application is failing with below
> error:
>
>
>
> py4j.protocol.Py4JJavaError: An error occurred while calling
> None.org.apache.spark.api.java.JavaSparkContext.
>
> : java.lang.NoClassDefFoundError: okio/BufferedSource
>
> at okhttp3.internal.Util.(Util.java:62)
>
> at okhttp3.OkHttpClient.(OkHttpClient.java:127)
>
> at okhttp3.OkHttpClient$Builder.(OkHttpClient.java:475)
>
> at
> io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory.newOkHttpClientBuilder(OkHttpClientFactory.java:41)
>
> at
> io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory.newBuilder(OkHttpClientFactory.java:56)
>
> at
> io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory.newBuilder(OkHttpClientFactory.java:68)
>
> at
> io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory.newBuilder(OkHttpClientFactory.java:30)
>
> at
> io.fabric8.kubernetes.client.KubernetesClientBuilder.getHttpClient(KubernetesClientBuilder.java:88)
>
> at
> io.fabric8.kubernetes.client.KubernetesClientBuilder.build(KubernetesClientBuilder.java:78)
>
> at
> org.apache.spark.deploy.k8s.SparkKubernetesClientFactory$.createKubernetesClient(SparkKubernetesClientFactory.scala:120)
>
> at
> org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:111)
>
> at
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:3037)
>
> at org.apache.spark.SparkContext.(SparkContext.scala:568)
>
> at
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
>
> at
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
> at
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
> Source)
>
> at
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> Source)
>
> at java.base/java.lang.reflect.Constructor.newInstance(Unknown
> Source)
>
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
>
> at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
>
> at py4j.Gateway.invoke(Gateway.java:238)
>
> at
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
>
> at
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
>
> at
> py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
>
> at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
>
> at java.base/java.lang.Thread.run(Unknown Source)
>
> Caused by: java.lang.ClassNotFoundException: okio.BufferedSource
>
> at
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(Unknown Source)
>
> at
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(Unknown
> Source)
>
> at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
>
> ... 26 more
>
>
>
> Replaced the existing jar with the JAR file at
> https://repo1.maven.org/maven2/com/squareup/okio/okio/3.4.0/okio-3.4.0.jar
>
>
>
>
>
> PFB, the vulnerability details:
>
> Link: https://nvd.nist.gov/vuln/detail/CVE-2023-3635
>
>
>
> Any guidance here would be of great help.
>
>
>
> Thanks,
>
> Sanket A.
>
> This message (including any attachments) contains confidential information
> intended for a specific individual and purpose, and is protected by law. If
> you are not the intended recipient, you should delete this message and any
> disclosure, copying, or distribution of this message, or the taking of any
> action based on it, by you is strictly prohibited.
>
> Deloitte refers to a Deloitte member firm, one of its related entities, or
> Deloitte Touche Tohmatsu Limited ("DTTL"). Each Deloitte member firm is a
> separate legal entity and a member of DTTL. DTTL does not provide services
> to clients. Please see www.deloitte.com/about to learn more.
>
> v.E.1
>

Re: error trying to save to database (Phoenix)

2023-08-21 Thread Sean Owen

It is. But you have a third party library in here which seems to require a
different version.

On Mon, Aug 21, 2023, 7:04 PM Kal Stevens  wrote:

> OK, it was my impression that scala was packaged with Spark to avoid a
> mismatch
> https://spark.apache.org/downloads.html
>
> It looks like spark 3.4.1 (my version) uses scala Scala 2.12
> How do I specify the scala version?
>
> On Mon, Aug 21, 2023 at 4:47 PM Sean Owen  wrote:
>
>> That's a mismatch in the version of scala that your library uses vs spark
>> uses.
>>
>> On Mon, Aug 21, 2023, 6:46 PM Kal Stevens  wrote:
>>
>>> I am having a hard time figuring out what I am doing wrong here.
>>> I am not sure if I have an incompatible version of something installed
>>> or something else.
>>> I can not find anything relevant in google to figure out what I am doing
>>> wrong
>>> I am using *spark 3.4.1*, and *python3.10*
>>>
>>> This is my code to save my dataframe
>>> urls = []
>>> pull_sitemap_xml(robot, urls)
>>> df = spark.createDataFrame(data=urls, schema=schema)
>>> df.write.format("org.apache.phoenix.spark") \
>>> .mode("overwrite") \
>>> .option("table", "property") \
>>> .option("zkUrl", "192.168.1.162:2181") \
>>> .save()
>>>
>>> urls is an array of maps, containing a "url" and a "last_mod" field.
>>>
>>> Here is the error that I am getting
>>>
>>> Traceback (most recent call last):
>>>
>>>   File "/home/kal/real-estate/pullhttp/pull_properties.py", line 65, in
>>> main
>>>
>>> .save()
>>>
>>>   File
>>> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
>>> line 1396, in save
>>>
>>> self._jwrite.save()
>>>
>>>   File
>>> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
>>> line 1322, in __call__
>>>
>>> return_value = get_return_value(
>>>
>>>   File
>>> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py",
>>> line 169, in deco
>>>
>>> return f(*a, **kw)
>>>
>>>   File
>>> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py",
>>> line 326, in get_return_value
>>>
>>> raise Py4JJavaError(
>>>
>>> py4j.protocol.Py4JJavaError: An error occurred while calling o636.save.
>>>
>>> : java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps
>>> scala.Predef$.refArrayOps(java.lang.Object[])'
>>>
>>> at
>>> org.apache.phoenix.spark.DataFrameFunctions.getFieldArray(DataFrameFunctions.scala:76)
>>>
>>> at
>>> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:35)
>>>
>>> at
>>> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:28)
>>>
>>> at
>>> org.apache.phoenix.spark.DefaultSource.createRelation(DefaultSource.scala:47)
>>>
>>> at
>>> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
>>>
>>> at
>>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>>>
>>> at
>>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>>>
>>

Re: error trying to save to database (Phoenix)

2023-08-21 Thread Sean Owen

That's a mismatch in the version of scala that your library uses vs spark
uses.

On Mon, Aug 21, 2023, 6:46 PM Kal Stevens  wrote:

> I am having a hard time figuring out what I am doing wrong here.
> I am not sure if I have an incompatible version of something installed or
> something else.
> I can not find anything relevant in google to figure out what I am doing
> wrong
> I am using *spark 3.4.1*, and *python3.10*
>
> This is my code to save my dataframe
> urls = []
> pull_sitemap_xml(robot, urls)
> df = spark.createDataFrame(data=urls, schema=schema)
> df.write.format("org.apache.phoenix.spark") \
> .mode("overwrite") \
> .option("table", "property") \
> .option("zkUrl", "192.168.1.162:2181") \
> .save()
>
> urls is an array of maps, containing a "url" and a "last_mod" field.
>
> Here is the error that I am getting
>
> Traceback (most recent call last):
>
>   File "/home/kal/real-estate/pullhttp/pull_properties.py", line 65, in
> main
>
> .save()
>
>   File
> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
> line 1396, in save
>
> self._jwrite.save()
>
>   File
> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
> line 1322, in __call__
>
> return_value = get_return_value(
>
>   File
> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py",
> line 169, in deco
>
> return f(*a, **kw)
>
>   File
> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py",
> line 326, in get_return_value
>
> raise Py4JJavaError(
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o636.save.
>
> : java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps
> scala.Predef$.refArrayOps(java.lang.Object[])'
>
> at
> org.apache.phoenix.spark.DataFrameFunctions.getFieldArray(DataFrameFunctions.scala:76)
>
> at
> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:35)
>
> at
> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:28)
>
> at
> org.apache.phoenix.spark.DefaultSource.createRelation(DefaultSource.scala:47)
>
> at
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
>
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>

Re: Spark Vulnerabilities

2023-08-14 Thread Sean Owen

Yeah, we generally don't respond to "look at the output of my static
analyzer".
Some of these are already addressed in a later version.
Some don't affect Spark.
Some are possibly an issue but hard to change without breaking lots of
things - they are really issues with upstream dependencies.

But for any you find that seem possibly relevant, that are directly
fixable, yes please open a PR with the change and your reasoning.

On Mon, Aug 14, 2023 at 7:42 AM Bjørn Jørgensen 
wrote:

> I have added links to the github PR. Or comment for those that I have not
> seen before.
>
> Apache Spark has very many dependencies, some can easily be upgraded while
> others are very hard to fix.
>
> Please feel free to open a PR if you wanna help.
>
> man. 14. aug. 2023 kl. 14:06 skrev Sankavi Nagalingam
> :
>
>> Hi Team,
>>
>>
>>
>> We could see there are many dependent vulnerabilities present in the
>> latest spark-core:3.4.1.jar. PFA
>>
>> Could you please let us know when will be the fix version available for
>> the users.
>>
>>
>>
>> Thanks,
>>
>> Sankavi
>>
>>
>>
>> The information in this e-mail and any attachments is confidential and
>> may be legally privileged. It is intended solely for the addressee or
>> addressees. Any use or disclosure of the contents of this
>> e-mail/attachments by a not intended recipient is unauthorized and may be
>> unlawful. If you have received this e-mail in error please notify the
>> sender. Please note that any views or opinions presented in this e-mail are
>> solely those of the author and do not necessarily represent those of
>> TEMENOS. We recommend that you check this e-mail and any attachments
>> against viruses. TEMENOS accepts no liability for any damage caused by any
>> malicious code or virus transmitted by this e-mail.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: conver panda image column to spark dataframe

2023-08-03 Thread Sean Owen

pp4 has one row, I'm guessing - containing an array of 10 images. You want
10 rows of 1 image each.
But, just don't do this. Pass the bytes of the image as an array,
along with width/height/channels, and reshape it on use. It's just easier.
That is how the Spark image representation works anyway

On Thu, Aug 3, 2023 at 8:43 PM second_co...@yahoo.com.INVALID
 wrote:

> Hello Adrian,
>
>   here is the snippet
>
> import tensorflow_datasets as tfds
>
> (ds_train, ds_test), ds_info = tfds.load(
> dataset_name, data_dir='',  split=["train",
> "test"], with_info=True, as_supervised=True
> )
>
> schema = StructType([
> StructField("image",
> ArrayType(ArrayType(ArrayType(IntegerType(, nullable=False),
> StructField("label", IntegerType(), nullable=False)
> ])
> pp4 =
> spark.createDataFrame(pd.DataFrame(tfds.as_dataframe(ds_train.take(4),
> ds_info)), schema)
>
>
>
> raised error
>
> , TypeError: field image: ArrayType(ArrayType(ArrayType(IntegerType(), True), 
> True), True) can not accept object array([[[14, 14, 14],
> [14, 14, 14],
> [14, 14, 14],
> ...,
> [19, 17, 20],
> [19, 17, 20],
> [19, 17, 20]],
>
>
>
>
>
> On Thursday, August 3, 2023 at 11:34:08 PM GMT+8, Adrian Pop-Tifrea <
> poptifreaadr...@gmail.com> wrote:
>
>
> Hello,
>
> can you also please show us how you created the pandas dataframe? I mean,
> how you added the actual data into the dataframe. It would help us for
> reproducing the error.
>
> Thank you,
> Pop-Tifrea Adrian
>
> On Mon, Jul 31, 2023 at 5:03 AM second_co...@yahoo.com <
> second_co...@yahoo.com> wrote:
>
> i changed to
>
> ArrayType(ArrayType(ArrayType(IntegerType( , still get same error
>
> Thank you for responding
>
> On Thursday, July 27, 2023 at 06:58:09 PM GMT+8, Adrian Pop-Tifrea <
> poptifreaadr...@gmail.com> wrote:
>
>
> Hello,
>
> when you said your pandas Dataframe has 10 rows, does that mean it
> contains 10 images? Because if that's the case, then you'd want ro only use
> 3 layers of ArrayType when you define the schema.
>
> Best regards,
> Adrian
>
>
>
> On Thu, Jul 27, 2023, 11:04 second_co...@yahoo.com.INVALID
>  wrote:
>
> i have panda dataframe with column 'image' using numpy.ndarray. shape is (500,
> 333, 3) per image. my panda dataframe has 10 rows, thus, shape is (10,
> 500, 333, 3)
>
> when using spark.createDataframe(panda_dataframe, schema), i need to
> specify the schema,
>
> schema = StructType([
> StructField("image",
> ArrayType(ArrayType(ArrayType(ArrayType(IntegerType(), nullable=False)
> ])
>
>
> i get error
>
> raise TypeError(
> , TypeError: field image: 
> ArrayType(ArrayType(ArrayType(ArrayType(IntegerType(), True), True), True), 
> True) can not accept object array([[[14, 14, 14],
>
> ...
>
> Can advise how to set schema for image with numpy.ndarray ?
>
>
>
>

Re: Interested in contributing to SPARK-24815

2023-08-03 Thread Sean Owen

Formally, an ICLA is required, and you can read more here:
https://www.apache.org/licenses/contributor-agreements.html

In practice, it's unrealistic to collect and verify an ICLA for every PR
contributed by 1000s of people. We have not gated on that.
But, contributions are in all cases governed by the same terms, even
without a signed ICLA. That's the verbiage you're referring to.
A CLA is a good idea, for sure, if there are any questions about the terms
of your contribution.

Here there does seem to be a question - retaining Twilio copyright headers
in source code. That is generally not what would happen for your everyday
contributions to an ASF project, as the copyright header (and CLAs) already
describe the relevant questions of rights: it has been licensed to the ASF.
(There are other situations where retaining a distinct copyright header is
required, typically when adding code licensed under another OSS license,
but I don't think they apply here)

I would say you should review and execute a CCLA for Twilio (assuming you
agree with the terms) to avoid doubt.


On Thu, Aug 3, 2023 at 6:34 PM Rinat Shangeeta 
wrote:

> (Adding my manager Eugene Kim who will cover me as I plan to be out of the
> office soon)
>
> Hi Kent and Sean,
>
> Nice to meet you. I am working on the OSS legal aspects with Pavan who is
> planning to make the contribution request to the Spark project. I saw that
> Sean mentioned in his email that the contributions would be governed under
> the ASF CCLA. In the Spark contribution guidelines
> <https://spark.apache.org/contributing.html>, there is no mention of
> having to sign a CCLA. In fact, this is what I found in the contribution
> guidelines:
>
> Contributing code changes
>
> Please review the preceding section before proposing a code change. This
> section documents how to do so.
>
> When you contribute code, you affirm that the contribution is your
> original work and that you license the work to the project under the
> project’s open source license. Whether or not you state this explicitly,
> by submitting any copyrighted material via pull request, email, or other
> means you agree to license the material under the project’s open source
> license and warrant that you have the legal authority to do so.
>
> Can you please point us to an authoritative source about the process?
>
> Also, is there a way to find out if a signed CCLA already exists for
> Twilio from your end? Thanks and appreciate your help!
>
>
> Best,
> Rinat
>
> *Rinat Shangeeta*
> Sr. Patent/Open Source Counsel
> [image: Twilio] <https://www.twilio.com/?utm_source=email_signature>
>
>
> On Wed, Jul 26, 2023 at 2:27 PM Pavan Kotikalapudi <
> pkotikalap...@twilio.com> wrote:
>
>> Thanks for the response with all the information Sean and Kent.
>>
>> Is there a way to figure out if my employer (Twilio) part of CCLA?
>>
>> cc'ing: @Rinat Shangeeta  our Open Source Counsel
>> at twilio
>>
>> Thank you,
>>
>> Pavan
>>
>> On Tue, Jul 25, 2023 at 10:48 PM Kent Yao  wrote:
>>
>>> Hi Pavan,
>>>
>>> Refer to the ASF Source Header and Copyright Notice Policy[1], code
>>> directly submitted to ASF should include the Apache license header
>>> without any additional copyright notice.
>>>
>>>
>>> Kent Yao
>>>
>>> [1]
>>> https://urldefense.com/v3/__https://www.apache.org/legal/src-headers.html*headers__;Iw!!NCc8flgU!c_mZKzBbSjJtYRjillV20gRzzzDOgW2ooH6ctfrqaJA8Eu4D5yfA7OlQnGm5JpdAZIU_doYmrsufzUc$
>>>
>>> Sean Owen  于2023年7月25日周二 07:22写道：
>>>
>>> >
>>> > When contributing to an ASF project, it's governed by the terms of the
>>> ASF ICLA:
>>> https://urldefense.com/v3/__https://www.apache.org/licenses/icla.pdf__;!!NCc8flgU!c_mZKzBbSjJtYRjillV20gRzzzDOgW2ooH6ctfrqaJA8Eu4D5yfA7OlQnGm5JpdAZIU_doYmZDPppZg$
>>> or CCLA:
>>> https://urldefense.com/v3/__https://www.apache.org/licenses/cla-corporate.pdf__;!!NCc8flgU!c_mZKzBbSjJtYRjillV20gRzzzDOgW2ooH6ctfrqaJA8Eu4D5yfA7OlQnGm5JpdAZIU_doYmUNwE-5A$
>>> >
>>> > I don't believe ASF projects ever retain an original author copyright
>>> statement, but rather source files have a statement like:
>>> >
>>> > ...
>>> >  * Licensed to the Apache Software Foundation (ASF) under one or more
>>> >  * contributor license agreements.  See the NOTICE file distributed
>>> with
>>> >  * this work for additional information regarding copyright ownership.
>>> > ...
>>> >
>>> > While it's conceivable that such a statement could live in a NOTICE
>>> file, I don't bel

Re: spark context list_packages()

2023-07-27 Thread Sean Owen

There is no such method in Spark. I think that's some EMR-specific
modification.

On Wed, Jul 26, 2023 at 11:06 PM second_co...@yahoo.com.INVALID
 wrote:

> I ran the following code
>
> spark.sparkContext.list_packages()
>
> on spark 3.4.1 and i get below error
>
> An error was encountered:
> AttributeError
> [Traceback (most recent call last):
> ,   File "/tmp/spark-3d66c08a-08a3-4d4e-9fdf-45853f65e03d/shell_wrapper.py", 
> line 113, in exec
> self._exec_then_eval(code)
> ,   File "/tmp/spark-3d66c08a-08a3-4d4e-9fdf-45853f65e03d/shell_wrapper.py", 
> line 106, in _exec_then_eval
> exec(compile(last, '', 'single'), self.globals)
> ,   File "", line 1, in 
> , AttributeError: 'SparkContext' object has no attribute 'list_packages'
> ]
>
>
> Is list_packages and install_pypi_package available for vanilla spark or
> only available for AWS services?
>
>
> Thank you
>

Re: Interested in contributing to SPARK-24815

2023-07-24 Thread Sean Owen

When contributing to an ASF project, it's governed by the terms of the ASF
ICLA: https://www.apache.org/licenses/icla.pdf or CCLA:
https://www.apache.org/licenses/cla-corporate.pdf

I don't believe ASF projects ever retain an original author copyright
statement, but rather source files have a statement like:

...
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
...

While it's conceivable that such a statement could live in a NOTICE file, I
don't believe that's been done for any of the thousands of other
contributors. That's really more for noting the license of
non-Apache-licensed code. Code directly contributed to the project is
assumed to have been licensed per above already.

It might be wise to review the CCLA with Twilio and consider establishing
that to govern contributions.

On Mon, Jul 24, 2023 at 6:10 PM Pavan Kotikalapudi
 wrote:

> Hi Spark Dev,
>
> My name is Pavan Kotikalapudi, I work at Twilio.
>
> I am looking to contribute to this spark issue
> https://issues.apache.org/jira/browse/SPARK-24815.
>
> There is a clause from the company's OSS saying
>
> - The proposed contribution is about 100 lines of code modification in the
> Spark project, involving two files - this is considered a large
> contribution. An appropriate Twilio copyright notice needs to be added for
> the portion of code that is newly added.
>
> Please let me know if that is acceptable?
>
> Thank you,
>
> Pavan
>
>

Re: How to read excel file in PySpark

2023-06-20 Thread Sean Owen

No, a pandas on Spark DF is distributed.

On Tue, Jun 20, 2023, 1:45 PM Mich Talebzadeh 
wrote:

> Thanks but if you create a Spark DF from Pandas DF that Spark DF is not
> distributed and remains on the driver. I recall a while back we had this
> conversation. I don't think anything has changed.
>
> Happy to be corrected
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 20 Jun 2023 at 20:09, Bjørn Jørgensen 
> wrote:
>
>> Pandas API on spark is an API so that users can use spark as they use
>> pandas. This was known as koalas.
>>
>> Is this limitation still valid for Pandas?
>> For pandas, yes. But what I did show wos pandas API on spark so its spark.
>>
>>  Additionally when we convert from Panda DF to Spark DF, what process is
>> involved under the bonnet?
>> I gess pyarrow and drop the index column.
>>
>> Have a look at
>> https://github.com/apache/spark/tree/master/python/pyspark/pandas
>>
>> tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh <
>> mich.talebza...@gmail.com>:
>>
>>> Whenever someone mentions Pandas I automatically think of it as an excel
>>> sheet for Python.
>>>
>>> OK my point below needs some qualification
>>>
>>> Why Spark here. Generally, parallel architecture comes into play when
>>> the data size is significantly large which cannot be handled on a single
>>> machine, hence, the use of Spark becomes meaningful. In cases where (the
>>> generated) data size is going to be very large (which is often norm rather
>>> than the exception these days), the data cannot be processed and stored in
>>> Pandas data frames as these data frames store data in RAM. Then, the whole
>>> dataset from a storage like HDFS or cloud storage cannot be collected,
>>> because it will take significant time and space and probably won't fit in a
>>> single machine RAM. (in this the driver memory)
>>>
>>> Is this limitation still valid for Pandas? Additionally when we convert
>>> from Panda DF to Spark DF, what process is involved under the bonnet?
>>>
>>> Thanks
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen 
>>> wrote:
>>>
 This is pandas API on spark

 from pyspark import pandas as ps
 df = ps.read_excel("testexcel.xlsx")
 [image: image.png]
 this will convert it to pyspark
 [image: image.png]

 tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme
 :

> Good day,
>
>
>
> I have a task to read excel files in databricks but I cannot seem to
> proceed. I am referencing the API documents -  read_excel
> 
> , but there is an error sparksession object has no attribute
> 'read_excel'. Can you advise?
>
>
>
> *JOHN PAUL JAYME*
> Data Engineer
>
> m. +639055716384  w. www.tdcx.com
>
>
>
> *Winner of over 350 Industry Awards*
>
> [image: Linkedin]  [image:
> Facebook]  [image: Twitter]
>  [image: Youtube]
>  [image: Instagram]
> 
>
>
>
> This is a confidential email that may be privileged or legally
> protected. You are not authorized to copy or disclose the contents of this
> email. If you are not the intended addressee, please inform the sender and
> delete this email.
>
>
>
>
>


 --
 Bjørn Jørgensen
 Vestre Aspehaug 4, 6010 Ålesund
 Norge

 +47 480 94 297

>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

Re: How to read excel file in PySpark

2023-06-20 Thread Sean Owen

It is indeed not part of SparkSession. See the link you cite. It is part of
the pyspark pandas API

On Tue, Jun 20, 2023, 5:42 AM John Paul Jayme 
wrote:

> Good day,
>
>
>
> I have a task to read excel files in databricks but I cannot seem to
> proceed. I am referencing the API documents -  read_excel
> 
> , but there is an error sparksession object has no attribute
> 'read_excel'. Can you advise?
>
>
>
> *JOHN PAUL JAYME*
> Data Engineer
>
> m. +639055716384  w. www.tdcx.com
>
>
>
> *Winner of over 350 Industry Awards*
>
> [image: Linkedin]  [image:
> Facebook]  [image: Twitter]
>  [image: Youtube]
>  [image: Instagram]
> 
>
>
>
> This is a confidential email that may be privileged or legally protected.
> You are not authorized to copy or disclose the contents of this email. If
> you are not the intended addressee, please inform the sender and delete
> this email.
>
>
>
>
>

Re: Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread Sean Owen

You sure it is not just that it's displaying in your local TZ? Check the
actual value as a long for example. That is likely the same time.

On Thu, Jun 8, 2023, 5:50 PM karan alang  wrote:

> ref :
> https://stackoverflow.com/questions/76436159/apache-spark-not-reading-utc-timestamp-from-mongodb-correctly
>
> Hello All,
> I've data stored in MongoDB collection and the timestamp column is not
> being read by Apache Spark correctly. I'm running Apache Spark on GCP
> Dataproc.
>
> Here is sample data :
>
> -
>
> In Mongo :
>
> timeslot_date  :
> timeslot  |timeslot_date |
> +--+--1683527400|{2023-05-08T06:30:00Z}|
>
>
> When I use pyspark to read this  :
>
> +--+---+
> timeslot  |timeslot_date  |
> +--+---+1683527400|2023-05-07 23:30:00|
> ++---+-
>
> -
>
> My understanding is, data in Mongo is in UTC format i.e. 2023-05-08T06:30:00Z 
> is in UTC format. I'm in PST timezone. I'm not clear why spark is reading it 
> a different timezone format (neither PST nor UTC) Note - it is not reading it 
> as PST timezone, if it was doing that it would advance the time by 7 hours, 
> instead it is doing the opposite.
>
> Where is the default timezone format taken from, when Spark is reading data 
> from MongoDB ?
>
> Any ideas on this ?
>
> tia!
>
>
>
>
>

Re: JDK version support information

2023-05-29 Thread Sean Owen

Per docs, it is Java 8. It's possible Java 11 partly works with 2.x but not
supported. But then again 2.x is not supported either.

On Mon, May 29, 2023, 6:43 AM Poorna Murali  wrote:

> We are currently using JDK 11 and spark 2.4.5.1 is working fine with that.
> So, we wanted to check the maximum JDK version supported for 2.4.5.1.
>
> On Mon, 29 May, 2023, 5:03 pm Aironman DirtDiver, 
> wrote:
>
>> Spark version 2.4.5.1 is based on Apache Spark 2.4.5. According to the
>> official Spark documentation for version 2.4.5, the maximum supported JDK
>> (Java Development Kit) version is JDK 8 (Java 8).
>>
>> Spark 2.4.5 is not compatible with JDK versions higher than Java 8.
>> Therefore, you should use JDK 8 to ensure compatibility and avoid any
>> potential issues when running Spark 2.4.5.
>>
>> El lun, 29 may 2023 a las 13:28, Poorna Murali ()
>> escribió:
>>
>>> Hi,
>>>
>>> We are using spark version 2.4.5.1. We would like to know the maximum
>>> JDK version supported for the same.
>>>
>>> Thanks,
>>> Poorna
>>>
>>
>>
>> --
>> Alonso Isidoro Roman
>> [image: https://]about.me/alonso.isidoro.roman
>>
>> 
>>
>

Re: [MLlib] how-to find implementation of Decision Tree Regressor fit function

2023-05-25 Thread Sean Owen

Are you looking for
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala


On Thu, May 25, 2023 at 6:54 AM Max 
wrote:

> Good day, I'm working on an Implantation from Joint Probability Trees
> (JPT) using the Spark framework. For this to be as efficient as possible, I
> tried to find the Implementation of the Code for the fit function of
> Decision Tree Regressors. Unfortunately, I had little success in finding
> the specifics on how to handle the RDDs in the documentation or in the
> GitHub. Hence, I am asking for a pointer on where these documents are. For
> Context, here are the links to JPT GitHub and the article:
> https://github.com/joint-probability-trees
> https://arxiv.org/abs/2302.07167 Thanks in advance. Sincerely, Maximilian
> Neumann
>

Re: Tensorflow on Spark CPU

2023-04-30 Thread Sean Owen

There is a large overhead to distributing this type of workload. I imagine
that for a small problem, the overhead dominates. You do not nearly need to
distribute a problem of this size, so more workers is probalby just worse.

On Sun, Apr 30, 2023 at 1:46 AM second_co...@yahoo.com <
second_co...@yahoo.com> wrote:

> I re-test with cifar10 example and below is the result .  can advice why
> lesser num_slot is faster compared with more slots?
>
> num_slots=20
>
> 231 seconds
>
>
> num_slots=5
>
> 52 seconds
>
>
> num_slot=1
>
> 34 seconds
>
> the code is at below
> https://gist.github.com/cometta/240bbc549155e22f80f6ba670c9a2e32
>
> Do you have an example of tensorflow+big dataset that I can test?
>
>
>
>
>
>
>
> On Saturday, April 29, 2023 at 08:44:04 PM GMT+8, Sean Owen <
> sro...@gmail.com> wrote:
>
>
> You don't want to use CPUs with Tensorflow.
> If it's not scaling, you may have a problem that is far too small to
> distribute.
>
> On Sat, Apr 29, 2023 at 7:30 AM second_co...@yahoo.com.INVALID
>  wrote:
>
> Anyone successfully run native tensorflow on Spark ? i tested example at
> https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-distributor
> on Kubernetes CPU . By running in on multiple workers CPUs. I do not see
> any speed up in training time by setting number of slot from1 to 10. The
> time taken to train is still the same. Anyone tested tensorflow training on
> Spark distributed workers with CPUs ?  Can share your working example?
>
>
>
>
>
>

Re: Tensorflow on Spark CPU

2023-04-29 Thread Sean Owen

You don't want to use CPUs with Tensorflow.
If it's not scaling, you may have a problem that is far too small to
distribute.

On Sat, Apr 29, 2023 at 7:30 AM second_co...@yahoo.com.INVALID
 wrote:

> Anyone successfully run native tensorflow on Spark ? i tested example at
> https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-distributor
> on Kubernetes CPU . By running in on multiple workers CPUs. I do not see
> any speed up in training time by setting number of slot from1 to 10. The
> time taken to train is still the same. Anyone tested tensorflow training on
> Spark distributed workers with CPUs ?  Can share your working example?
>
>
>
>
>
>

Re: Looping through a series of telephone numbers

2023-04-02 Thread Sean Owen

That won't work, you can't use Spark within Spark like that.
If it were exact matches, the best solution would be to load both datasets
and join on telephone number.
For this case, I think your best bet is a UDF that contains the telephone
numbers as a list and decides whether a given number matches something in
the set. Then use that to filter, then work with the data set.
There are probably clever fast ways of efficiently determining if a string
is a prefix of a group of strings in Python you could use too.

On Sun, Apr 2, 2023 at 3:17 AM Philippe de Rochambeau 
wrote:

> Many thanks, Mich.
> Is « foreach »  the best construct to  lookup items is a dataset  such as
> the below «  telephonedirectory » data set?
>
> val telrdd = spark.sparkContext.parallelize(Seq(«  tel1 » , «  tel2 » , «  
> tel3 » …)) // the telephone sequence
>
> // was read for a CSV file
>
> val ds = spark.read.parquet(«  /path/to/telephonedirectory » )
>
>   rdd .foreach(tel => {
> longAcc.select(«  * » ).rlike(«  + »  + tel)
>   })
>
>
>
>
> Le 1 avr. 2023 à 22:36, Mich Talebzadeh  a
> écrit :
>
> This may help
>
> Spark rlike() Working with Regex Matching Example
> s
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 1 Apr 2023 at 19:32, Philippe de Rochambeau 
> wrote:
>
>> Hello,
>> I’m looking for an efficient way in Spark to search for a series of
>> telephone numbers, contained in a CSV file, in a data set column.
>>
>> In pseudo code,
>>
>> for tel in [tel1, tel2, …. tel40,000]
>> search for tel in dataset using .like(« %tel% »)
>> end for
>>
>> I’m using the like function because the telephone numbers in the data set
>> main contain prefixes, such as « + « ; e.g., « +331222 ».
>>
>> Any suggestions would be welcome.
>>
>> Many thanks.
>>
>> Philippe
>>
>>
>>
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>

Re: What is the range of the PageRank value of graphx

2023-03-28 Thread Sean Owen

>From the docs:

 * Note that this is not the "normalized" PageRank and as a consequence
pages that have no
 * inlinks will have a PageRank of alpha. In particular, the pageranks may
have some values
 * greater than 1.

On Tue, Mar 28, 2023 at 9:11 AM lee  wrote:

> When I calculate pagerank using HugeGraph, each pagerank value is less
> than 1, and the total of pageranks is 1. However, the PageRank value of
> graphx is often greater than 1, so what is the range of the PageRank value
> of graphx？
>
>
>
>
>
>
> 李杰
> leedd1...@163.com
>  
> 
>
>

Re: Question related to asynchronously map transformation using java spark structured streaming

2023-03-26 Thread Sean Owen

What do you mean by asynchronously here?

On Sun, Mar 26, 2023, 10:22 AM Emmanouil Kritharakis <
kritharakismano...@gmail.com> wrote:

> Hello again,
>
> Do we have any news for the above question?
> I would really appreciate it.
>
> Thank you,
>
> --
>
> Emmanouil (Manos) Kritharakis
>
> Ph.D. candidate in the Department of Computer Science
> 
>
> Boston University
>
>
> On Tue, Mar 14, 2023 at 12:04 PM Emmanouil Kritharakis <
> kritharakismano...@gmail.com> wrote:
>
>> Hello,
>>
>> I hope this email finds you well!
>>
>> I have a simple dataflow in which I read from a kafka topic, perform a
>> map transformation and then I write the result to another topic. Based on
>> your documentation here
>> ,
>> I need to work with Dataset data structures. Even though my solution works,
>> I need to utilize map transformation asynchronously. So my question is how
>> can I asynchronously call map transformation with Dataset data structures
>> in a java structured streaming environment? Can you please share a working
>> example?
>>
>> I am looking forward to hearing from you as soon as possible. Thanks in
>> advance!
>>
>> Kind regards
>>
>> --
>>
>> Emmanouil (Manos) Kritharakis
>>
>> Ph.D. candidate in the Department of Computer Science
>> 
>>
>> Boston University
>>
>

Re: Kind help request

2023-03-25 Thread Sean Owen

It is telling you that the UI can't bind to any port. I presume that's
because of container restrictions?
If you don't want the UI at all, just set spark.ui.enabled to false

On Sat, Mar 25, 2023 at 8:28 AM Lorenzo Ferrando <
lorenzo.ferra...@edu.unige.it> wrote:

> Dear Spark team,
>
> I am Lorenzo from University of Genoa. I am currently using (ubuntu 18.04)
> the nextflow/sarek pipeline to analyse genomic data through a singularity
> container. One of the step of the pipeline uses GATK4 and it implements
>  Spark. However, after some time I get the following error:
>
>
> 23:27:48.112 INFO  NativeLibraryLoader - Loading libgkl_compression.so from 
> jar:file:/gatk/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
> 23:27:48.523 INFO  ApplyBQSRSpark - 
> 
> 23:27:48.524 INFO  ApplyBQSRSpark - The Genome Analysis Toolkit (GATK) 
> v4.2.6.1
> 23:27:48.524 INFO  ApplyBQSRSpark - For support and documentation go to 
> https://software.broadinstitute.org/gatk/
> 23:27:48.525 INFO  ApplyBQSRSpark - Executing as ferrandl@alucard on Linux 
> v5.4.0-91-generic amd64
> 23:27:48.525 INFO  ApplyBQSRSpark - Java runtime: OpenJDK 64-Bit Server VM 
> v1.8.0_242-8u242-b08-0ubuntu3~18.04-b08
> 23:27:48.526 INFO  ApplyBQSRSpark - Start Date/Time: March 24, 2023 11:27:47 
> PM GMT
> 23:27:48.526 INFO  ApplyBQSRSpark - 
> 
> 23:27:48.526 INFO  ApplyBQSRSpark - 
> 
> 23:27:48.527 INFO  ApplyBQSRSpark - HTSJDK Version: 2.24.1
> 23:27:48.527 INFO  ApplyBQSRSpark - Picard Version: 2.27.1
> 23:27:48.527 INFO  ApplyBQSRSpark - Built for Spark Version: 2.4.5
> 23:27:48.527 INFO  ApplyBQSRSpark - HTSJDK Defaults.COMPRESSION_LEVEL : 2
> 23:27:48.527 INFO  ApplyBQSRSpark - HTSJDK 
> Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
> 23:27:48.527 INFO  ApplyBQSRSpark - HTSJDK 
> Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
> 23:27:48.527 INFO  ApplyBQSRSpark - HTSJDK 
> Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
> 23:27:48.527 INFO  ApplyBQSRSpark - Deflater: IntelDeflater
> 23:27:48.528 INFO  ApplyBQSRSpark - Inflater: IntelInflater
> 23:27:48.528 INFO  ApplyBQSRSpark - GCS max retries/reopens: 20
> 23:27:48.528 INFO  ApplyBQSRSpark - Requester pays: disabled
> 23:27:48.528 WARN  ApplyBQSRSpark -
>
>
>
>Warning: ApplyBQSRSpark is a BETA tool and is not yet ready for use in 
> production
>
>
>
>
> 23:27:48.528 INFO  ApplyBQSRSpark - Initializing engine
> 23:27:48.528 INFO  ApplyBQSRSpark - Done initializing engine
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 23/03/24 23:27:49 INFO SparkContext: Running Spark version 2.4.5
> 23/03/24 23:27:49 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 23/03/24 23:27:50 INFO SparkContext: Submitted application: ApplyBQSRSpark
> 23/03/24 23:27:50 INFO SecurityManager: Changing view acls to: ferrandl
> 23/03/24 23:27:50 INFO SecurityManager: Changing modify acls to: ferrandl
> 23/03/24 23:27:50 INFO SecurityManager: Changing view acls groups to:
> 23/03/24 23:27:50 INFO SecurityManager: Changing modify acls groups to:
> 23/03/24 23:27:50 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(ferrandl); 
> groups with view permissions: Set(); users  with modify permissions: 
> Set(ferrandl); groups with modify permissions: Set()
> 23/03/24 23:27:50 INFO Utils: Successfully started service 'sparkDriver' on 
> port 46757.
> 23/03/24 23:27:50 INFO SparkEnv: Registering MapOutputTracker
> 23/03/24 23:27:50 INFO SparkEnv: Registering BlockManagerMaster
> 23/03/24 23:27:50 INFO BlockManagerMasterEndpoint: Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
> 23/03/24 23:27:50 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint 
> up
> 23/03/24 23:27:50 INFO DiskBlockManager: Created local directory at 
> /home/ferrandl/projects/ribas_reanalysis/sarek/work/27/89b7451fcac6fd31461885b5774752/blockmgr-e76f7d59-da0b-4e62-8a99-3cdb23f11ae6
> 23/03/24 23:27:50 INFO MemoryStore: MemoryStore started with capacity 2004.6 
> MB
> 23/03/24 23:27:50 INFO SparkEnv: Registering OutputCommitCoordinator
> 23/03/24 23:27:51 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> 23/03/24 23:27:51 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
> Attempting port 4042.
> 23/03/24 23:27:51 WARN Utils: Service 'SparkUI' could not bind on port 4042. 
> Attempting port 4043.
> 23/03/24 23:27:51 WARN Utils: Service 'SparkUI' could not bind on port 4043. 
> Attempting port 4044.
> 23/03/24 23:27:51 WARN Utils: Service

Re: Question related to parallelism using structed streaming parallelism

2023-03-21 Thread Sean Owen

Yes more specifically, you can't ask for executors once the app starts,
in SparkConf like that. You set this when you launch it against a Spark
cluster in spark-submit or otherwise.

On Tue, Mar 21, 2023 at 4:23 AM Mich Talebzadeh 
wrote:

> Hi Emmanouil,
>
> This means that your job is running on the driver as a single JVM, hence
> active(1)
>
>

Re: Understanding executor memory behavior

2023-03-16 Thread Sean Owen

All else equal it is better to have the same resources in fewer executors.
More tasks are local to other tasks which helps perf. There is more
possibility of 'borrowing' extra mem and CPU in a task.

On Thu, Mar 16, 2023, 2:14 PM Nikhil Goyal  wrote:

> Hi folks,
> I am trying to understand what would be the difference in running 8G 1
> core executor vs 40G 5 core executors. I see that on yarn it can cause bin
> fitting issues but other than that are there any pros and cons on using
> either?
>
> Thanks
> Nikhil
>

Re: logging pickle files on local run of spark.ml Pipeline model

2023-03-15 Thread Sean Owen

Pickle won't work. But the others should. I think you are specifying an
invalid path in both cases but hard to say without more detail

On Wed, Mar 15, 2023, 9:13 AM Mnisi, Caleb 
wrote:

> Good Day
>
>
>
> I am having trouble saving a spark.ml Pipeline model to a pickle file,
> when running locally on my PC.
>
> I’ve tried a few ways to save the model:
>
>1. mlflow.spark.log_model(artifact_path=experiment.artifact_location,
>spark_model= model, registered_model_name="myModel")
>   1. with error that the spark model is multiple files
>2. pickle.dump(model, file): with error - TypeError: cannot pickle
>'_thread.RLock' object
>3. model.save(‘path’): with Java errors:
>   1. at
>   
> org.apache.hadoop.mapred.OutputCommitter.commitJob(OutputCommitter.java:291)
>   2. at
>   
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:182)
>   3. at
>   
> org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:99)
>   ... 67 more
>
>
>
> Your assistance on this would be much appreciated.
>
> Regards,
>
>
>
> *Caleb Mnisi*
>
> Consultant | Deloitte Analytics | Cognitive Advantage
>
> Deloitte & Touche
>
> 5th floor, 5 Magwa Crescent, Waterfall City, 2090
>
> M: +27 72 170 8779
>
> *cmn...@deloitte.co.za * | www2.deloitte.com/za
> 
>
>
>
>
>
> Please consider the environment before printing.
>
>
> *Disclaimer:* This email is subject to important restrictions,
> qualifications and disclaimers ("the Disclaimer") that must be accessed and
> read by visiting our website and viewing the webpage at the following
> address: http://www.deloitte.com/za/disclaimer. The Disclaimer forms part
> of the content of this email. If you cannot access the Disclaimer, please
> obtain a copy thereof from us by sending an email to
> zaitserviced...@deloitte.co.za. Deloitte refers to a Deloitte member
> firm, one of its related entities, or Deloitte Touche Tohmatsu Limited
> (“DTTL”). Each Deloitte member firm is a separate legal entity and a member
> of DTTL. DTTL does not provide services to clients. Please see
> www.deloitte.com/about to learn more.
>

Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Sean Owen

That's incorrect, it's spark.default.parallelism, but as the name suggests,
that is merely a default. You control partitioning directly with
.repartition()

On Tue, Mar 14, 2023 at 11:37 AM Mich Talebzadeh 
wrote:

> Check this link
>
>
> https://sparkbyexamples.com/spark/difference-between-spark-sql-shuffle-partitions-and-spark-default-parallelism/
>
> You can set it
>
> spark.conf.set("sparkDefaultParallelism", value])
>
>
> Have a look at Streaming statistics in Spark GUI, especially *Processing
> Tim*e, defined by Spark GUI as Time taken to process all jobs of a batch.
>  *The **Scheduling Dela*y and *the **Total Dela*y are additional
> indicators of health.
>
>
> then decide how to set the value.
>
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 14 Mar 2023 at 16:04, Emmanouil Kritharakis <
> kritharakismano...@gmail.com> wrote:
>
>> Yes I need to check the performance of my streaming job in terms of
>> latency and throughput. Is there any working example of how to increase the
>> parallelism with spark structured streaming  using Dataset data structures?
>> Thanks in advance.
>>
>> Kind regards,
>>
>> --
>>
>> Emmanouil (Manos) Kritharakis
>>
>> Ph.D. candidate in the Department of Computer Science
>> 
>>
>> Boston University
>>
>>
>> On Tue, Mar 14, 2023 at 12:01 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> What benefits are you going with increasing parallelism? Better througput
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 14 Mar 2023 at 15:58, Emmanouil Kritharakis <
>>> kritharakismano...@gmail.com> wrote:
>>>
 Hello,

 I hope this email finds you well!

 I have a simple dataflow in which I read from a kafka topic, perform a
 map transformation and then I write the result to another topic. Based on
 your documentation here
 ,
 I need to work with Dataset data structures. Even though my solution works,
 I need to increase the parallelism. The spark documentation includes a lot
 of parameters that I can change based on specific data structures like
 *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The
 former is the default number of partitions in RDDs returned by
 transformations like join, reduceByKey while the later is not recommended
 for structured streaming as it is described in documentation: "Note: For
 structured streaming, this configuration cannot be changed between query
 restarts from the same checkpoint location".

 So my question is how can I increase the parallelism for a simple
 dataflow based on datasets with a map transformation only?

 I am looking forward to hearing from you as soon as possible. Thanks in
 advance!

 Kind regards,

 --

 Emmanouil (Manos) Kritharakis

 Ph.D. candidate in the Department of Computer Science

 Boston University

>>>

Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Sean Owen

Are you just looking for DataFrame.repartition()?

On Tue, Mar 14, 2023 at 10:57 AM Emmanouil Kritharakis <
kritharakismano...@gmail.com> wrote:

> Hello,
>
> I hope this email finds you well!
>
> I have a simple dataflow in which I read from a kafka topic, perform a map
> transformation and then I write the result to another topic. Based on your
> documentation here
> ,
> I need to work with Dataset data structures. Even though my solution works,
> I need to increase the parallelism. The spark documentation includes a lot
> of parameters that I can change based on specific data structures like
> *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The former
> is the default number of partitions in RDDs returned by transformations
> like join, reduceByKey while the later is not recommended for structured
> streaming as it is described in documentation: "Note: For structured
> streaming, this configuration cannot be changed between query restarts from
> the same checkpoint location".
>
> So my question is how can I increase the parallelism for a simple dataflow
> based on datasets with a map transformation only?
>
> I am looking forward to hearing from you as soon as possible. Thanks in
> advance!
>
> Kind regards,
>
> --
>
> Emmanouil (Manos) Kritharakis
>
> Ph.D. candidate in the Department of Computer Science
> 
>
> Boston University
>

Re: Spark 3.3.2 not running with Antlr4 runtime latest version

2023-03-14 Thread Sean Owen

You want Antlr 3 and Spark is on 4? no I don't think Spark would downgrade.
You can shade your app's dependencies maybe.

On Tue, Mar 14, 2023 at 8:21 AM Sahu, Karuna
 wrote:

> Hi Team
>
>
>
> We are upgrading a legacy application using Spring boot , Spark and
> Hibernate. While upgrading Hibernate to 6.1.6.Final version there is a
> mismatch for antlr4 runtime jar with Hibernate and latest Spark version.
> Details for the issue are posted on StackOverflow as well:
>
> Issue in running Spark 3.3.2 with Antlr 4.10.1 - Stack Overflow
> 
>
>
>
> Please let us know if upgrades for this is being planned for latest Spark
> version.
>
>
>
> Thanks
>
> Karuna
>
> --
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy. Your privacy is important to us.
> Accenture uses your personal data only in compliance with data protection
> laws. For further information on how Accenture processes your personal
> data, please see our privacy statement at
> https://www.accenture.com/us-en/privacy-policy.
>
> __
>
> www.accenture.com
>

Re: How to share a dataset file across nodes

2023-03-09 Thread Sean Owen

Put the file on HDFS, if you have a Hadoop cluster?

On Thu, Mar 9, 2023 at 3:02 PM sam smith  wrote:

> Hello,
>
> I use Yarn client mode to submit my driver program to Hadoop, the dataset
> I load is from the local file system, when i invoke load("file://path")
> Spark complains about the csv file being not found, which i totally
> understand, since the dataset is not in any of the workers or the
> applicationMaster but only where the driver program resides.
> I tried to share the file using the configurations:
>
>> *spark.yarn.dist.files* OR *spark.files *
>
> but both ain't working.
> My question is how to share the csv dataset across the nodes at the
> specified path?
>
> Thanks.
>

Re: 回复：Re: Build SPARK from source with SBT failed

2023-03-07 Thread Sean Owen

No, it's that JAVA_HOME wasn't set to .../Home. It is simply not finding
javac, in the error. Zulu supports M1.

On Tue, Mar 7, 2023 at 9:05 AM Artemis User  wrote:

> Looks like Maven build did find the javac, just can't run it.  So it's not
> a path problem but a compatibility problem.  Are you doing this on a Mac
> with M1/M2?  I don't think that Zulu JDK supports Apple silicon.   Your
> best option would be to use homebrew to install the dev tools (including
> OpenJDK) on Mac.  On Ubuntu, it seems still the compatibility problem.  Try
> to use the apt to install your dev tools, don't do it manually.  If you
> manually install JDK, it doesn't install hardware-optimized JVM libraries.
>
> On 3/7/23 8:21 AM, ckgppl_...@sina.cn wrote:
>
> No. I haven't installed Apple Developer Tools. I have installed Zulu
> OpenJDK 11.0.17 manually.
> So I need to install Apple Developer Tools?
> - 原始邮件 -
> 发件人：Sean Owen  
> 收件人：ckgppl_...@sina.cn
> 抄送人：user  
> 主题：Re: Build SPARK from source with SBT failed
> 日期：2023年03月07日 20点58分
>
> This says you don't have the java compiler installed. Did you install the
> Apple Developer Tools package?
>
> On Tue, Mar 7, 2023 at 1:42 AM  wrote:
>
> Hello,
>
> I have tried to build SPARK source codes with SBT in my local dev
> environment (MacOS 13.2.1). But it reported following error:
> [error] java.io.IOException: Cannot run program
> "/Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/bin/javac" (in
> directory "/Users/username/spark-remotemaster"): error=2, No such file or
> directory
>
> [error] at
> java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
>
> [error] at
> java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
>
> [error] at
> scala.sys.process.ProcessBuilderImpl$Simple.run(ProcessBuilderImpl.scala:75)
> [error] at
> scala.sys.process.ProcessBuilderImpl$AbstractBuilder.run(ProcessBuilderImpl.scala:106)
>
> I need to export JAVA_HOME to let it run successfully. But if I use maven
> then I don't need to export JAVA_HOME. I have also tried to build SPARK
> with SBT in Ubuntu X86_64 environment. It reported similar error.
>
> The official SPARK
> documentation  haven't mentioned export JAVA_HOME operation. So I think
> this is a bug which needs documentation or scripts change. Please correct
> me if I am wrong.
>
> Thanks
>
> Liang
>
>
>

Re: Pandas UDFs vs Inbuilt pyspark functions

2023-03-07 Thread Sean Owen

It's hard to evaluate without knowing what you're doing. Generally, using a
built-in function will be fastest. pandas UDFs can be faster than normal
UDFs if you can take advantage of processing multiple rows at once.

On Tue, Mar 7, 2023 at 6:47 AM neha garde  wrote:

> Hello All,
>
> I need help deciding on what is better, pandas udfs or inbuilt functions
> I have to perform a transformation where I managed to compare the two for
> a few thousand records
> and pandas_udf infact performed better.
> Given the complexity of the transformation, I also found pandas_udf makes
> it more readable.
> I also found a lot of comparisons made between normal udfs and pandas_udfs
>
> What I am looking forward to is whether pandas_udfs will behave as a
> normal pyspark in-built data.
> How do pandas_udfs work internally, and will they be equally performant on
> bigger sets of data.?
> I did go through a few documents but wasn't able to get a clear idea.
> I am mainly looking from the performance perspective.
>
> Thanks in advance
>
>
> Regards,
> Neha R.Garde.
>

Re: Build SPARK from source with SBT failed

2023-03-07 Thread Sean Owen

This says you don't have the java compiler installed. Did you install the
Apple Developer Tools package?

On Tue, Mar 7, 2023 at 1:42 AM  wrote:

> Hello,
>
> I have tried to build SPARK source codes with SBT in my local dev
> environment (MacOS 13.2.1). But it reported following error:
> [error] java.io.IOException: Cannot run program
> "/Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/bin/javac" (in
> directory "/Users/username/spark-remotemaster"): error=2, No such file or
> directory
>
> [error] at
> java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
>
> [error] at
> java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
>
> [error] at
> scala.sys.process.ProcessBuilderImpl$Simple.run(ProcessBuilderImpl.scala:75)
> [error] at
> scala.sys.process.ProcessBuilderImpl$AbstractBuilder.run(ProcessBuilderImpl.scala:106)
>
> I need to export JAVA_HOME to let it run successfully. But if I use maven
> then I don't need to export JAVA_HOME. I have also tried to build SPARK
> with SBT in Ubuntu X86_64 environment. It reported similar error.
>
> The official SPARK
> documentation  haven't mentioned export JAVA_HOME operation. So I think
> this is a bug which needs documentation or scripts change. Please correct
> me if I am wrong.
>
> Thanks
>
> Liang
>
>

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Sean Owen

I don't quite get it - aren't you applying to the same stream, and batches?
worst case why not apply these as one function?
Otherwise, how do you mean to associate one call to another?
globals don't help here. They aren't global beyond the driver, and, which
one would be which batch?

On Sat, Mar 4, 2023 at 3:02 PM Mich Talebzadeh 
wrote:

> Thanks. they are different batchIds
>
> From sendToControl, newtopic batchId is 76
> From sendToSink, md, batchId is 563
>
> As a matter of interest, why does a global variable not work?
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 4 Mar 2023 at 20:13, Sean Owen  wrote:
>
>> It's the same batch ID already, no?
>> Or why not simply put the logic of both in one function? or write one
>> function that calls both?
>>
>> On Sat, Mar 4, 2023 at 2:07 PM Mich Talebzadeh 
>> wrote:
>>
>>>
>>> This is probably pretty  straight forward but somehow is does not look
>>> that way
>>>
>>>
>>>
>>> On Spark Structured Streaming,  "foreachBatch" performs custom write
>>> logic on each micro-batch through a call function. Example,
>>>
>>> foreachBatch(sendToSink) expects 2 parameters, first: micro-batch as
>>> DataFrame or Dataset and second: unique id for each batch
>>>
>>>
>>>
>>> In my case I simultaneously read two topics through two separate
>>> functions
>>>
>>>
>>>
>>>1. foreachBatch(sendToSink). \
>>>2. foreachBatch(sendToControl). \
>>>
>>> This is  the code
>>>
>>> def sendToSink(df, batchId):
>>> if(len(df.take(1))) > 0:
>>> print(f"""From sendToSink, md, batchId is {batchId}, at
>>> {datetime.now()} """)
>>> #df.show(100,False)
>>> df. persist()
>>> # write to BigQuery batch table
>>> #s.writeTableToBQ(df, "append",
>>> config['MDVariables']['targetDataset'],config['MDVariables']['targetTable'])
>>> df.unpersist()
>>> #print(f"""wrote to DB""")
>>>else:
>>> print("DataFrame md is empty")
>>>
>>> def sendToControl(dfnewtopic, batchId2):
>>> if(len(dfnewtopic.take(1))) > 0:
>>> print(f"""From sendToControl, newtopic batchId is {batchId2}""")
>>> dfnewtopic.show(100,False)
>>> queue = dfnewtopic.first()[2]
>>> status = dfnewtopic.first()[3]
>>> print(f"""testing queue is {queue}, and status is {status}""")
>>> if((queue == config['MDVariables']['topic']) & (status ==
>>> 'false')):
>>>   spark_session = s.spark_session(config['common']['appName'])
>>>   active = spark_session.streams.active
>>>   for e in active:
>>>  name = e.name
>>>  if(name == config['MDVariables']['topic']):
>>> print(f"""\n==> Request terminating streaming process
>>> for topic {name} at {datetime.now()}\n """)
>>> e.stop()
>>> else:
>>> print("DataFrame newtopic is empty")
>>>
>>>
>>> The problem I have is to share batchID from the first function in the
>>> second function sendToControl(dfnewtopic, batchId2) so I can print it
>>> out.
>>>
>>>
>>> Defining a global did not work.. So it sounds like I am missing
>>> something rudimentary here!
>>>
>>>
>>> Thanks
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Sean Owen

It's the same batch ID already, no?
Or why not simply put the logic of both in one function? or write one
function that calls both?

On Sat, Mar 4, 2023 at 2:07 PM Mich Talebzadeh 
wrote:

>
> This is probably pretty  straight forward but somehow is does not look
> that way
>
>
>
> On Spark Structured Streaming,  "foreachBatch" performs custom write logic
> on each micro-batch through a call function. Example,
>
> foreachBatch(sendToSink) expects 2 parameters, first: micro-batch as
> DataFrame or Dataset and second: unique id for each batch
>
>
>
> In my case I simultaneously read two topics through two separate functions
>
>
>
>1. foreachBatch(sendToSink). \
>2. foreachBatch(sendToControl). \
>
> This is  the code
>
> def sendToSink(df, batchId):
> if(len(df.take(1))) > 0:
> print(f"""From sendToSink, md, batchId is {batchId}, at
> {datetime.now()} """)
> #df.show(100,False)
> df. persist()
> # write to BigQuery batch table
> #s.writeTableToBQ(df, "append",
> config['MDVariables']['targetDataset'],config['MDVariables']['targetTable'])
> df.unpersist()
> #print(f"""wrote to DB""")
>else:
> print("DataFrame md is empty")
>
> def sendToControl(dfnewtopic, batchId2):
> if(len(dfnewtopic.take(1))) > 0:
> print(f"""From sendToControl, newtopic batchId is {batchId2}""")
> dfnewtopic.show(100,False)
> queue = dfnewtopic.first()[2]
> status = dfnewtopic.first()[3]
> print(f"""testing queue is {queue}, and status is {status}""")
> if((queue == config['MDVariables']['topic']) & (status ==
> 'false')):
>   spark_session = s.spark_session(config['common']['appName'])
>   active = spark_session.streams.active
>   for e in active:
>  name = e.name
>  if(name == config['MDVariables']['topic']):
> print(f"""\n==> Request terminating streaming process for
> topic {name} at {datetime.now()}\n """)
> e.stop()
> else:
> print("DataFrame newtopic is empty")
>
>
> The problem I have is to share batchID from the first function in the
> second function sendToControl(dfnewtopic, batchId2) so I can print it
> out.
>
>
> Defining a global did not work.. So it sounds like I am missing something
> rudimentary here!
>
>
> Thanks
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Sean Owen

That's pretty impressive. I'm not sure it's quite right - not clear that
the intent is taking a minimum of absolute values (is it? that'd be wild).
But I think it might have pointed in the right direction. I'm not quite
sure why that error pops out, but I think 'max' is the wrong function.
That's an aggregate function. "greatest" is the function that returns the
max of several cols. Try that?

On Thu, Feb 23, 2023 at 2:25 PM Bjørn Jørgensen 
wrote:

> I'm trying to learn how to use chatgpt for coding.
>
> So after a lite chat I got this.
>
>
> The code you provided seems to calculate the distance between a gene and a
> variant by finding the maximum value between the difference of the variant
> position and the gene start position, the difference of the gene end
> position and the variant position, and 0. However, the result of this
> calculation may not always be correct.
>
> Here's an alternative approach to calculating the distance between a gene
> and a variant:
>
>
> from pyspark.sql.functions import abs
>
> # Calculate the absolute difference between the variant position and the
> gene start and end positions
> distances = joined.withColumn("distance_start", abs(col("position") -
> col("start"))) \
>  .withColumn("distance_end", abs(col("position") -
> col("end")))
>
> # Take the minimum distance between the variant and the gene
> distances = distances.withColumn("distance", least(col("distance_start"),
> col("distance_end")))
>
>
> In this approach, we first calculate the absolute difference between the
> variant position and the gene start and end positions using the abs
> function. Then, we use the least function to take the minimum distance
> between the variant and the gene. This should give you a more accurate
> distance calculation.
>
>
>
>
> Yes, that's correct. The original code you provided calculates the
> distance between a gene and a variant as the maximum of three terms: the
> difference between the variant position and the gene start position, the
> difference between the gene end position and the variant position, and 0.
>
> While this approach might work in some cases, it has a few issues. First,
> it doesn't take into account the possibility that the variant might be
> inside the gene (i.e., between the start and end positions). Second, it
> assumes that the gene is always located upstream or downstream of the
> variant, which may not be the case.
>
> The alternative approach I provided earlier calculates the distance as the
> minimum of two absolute differences: the absolute difference between the
> variant position and the gene start position, and the absolute difference
> between the variant position and the gene end position. This approach
> should work in all cases, including when the variant is inside the gene or
> when the gene is located upstream or downstream of the variant.
>
>
>
> tor. 23. feb. 2023 kl. 20:48 skrev Russell Jurney <
> russell.jur...@gmail.com>:
>
>> Usually, the solution to these problems is to do less per line, break it
>> out and perform each minute operation as a field, then combine those into a
>> final answer. Can you do that here?
>>
>> Thanks,
>> Russell Jurney @rjurney <http://twitter.com/rjurney>
>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
>> <http://facebook.com/jurney> datasyndrome.com Book a time on Calendly
>> <https://calendly.com/rjurney_personal/30min>
>>
>>
>> On Thu, Feb 23, 2023 at 11:07 AM Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>> Here is the complete error:
>>>
>>> ```
>>> Traceback (most recent call last):
>>>   File "nearest-gene.py", line 74, in 
>>> main()
>>>   File "nearest-gene.py", line 62, in main
>>> distances = joined.withColumn("distance", max(col("start") -
>>> col("position"), col("position") - col("end"), 0))
>>>   File
>>> "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_01/pyspark.zip/pyspark/sql/column.py",
>>> line 907, in __nonzero__
>>> ValueError: Cannot convert column into bool: please use '&' for 'and',
>>> '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
>>> ```
>>>
>>> On Thu, Feb 23, 2023 at 2:00 PM Sean Owen  wrote:
>>>
>>>> That error sounds like it's from pandas not spark.

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Sean Owen

That error sounds like it's from pandas not spark. Are you sure it's this
line?

On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

>
>  Hello,
>
>   I'm trying to calculate the distance between a gene (with start and end)
> and a variant (with position), so I joined gene and variant data by
> chromosome and then tried to calculate the distance like this:
>
> ```
> distances = joined.withColumn("distance", max(col("start") -
> col("position"), col("position") - col("end"), 0))
> ```
>
>   Basically, the distance is the maximum of three terms.
>
>   This line causes an obscure error:
>
> ```
> ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
> for 'or', '~' for 'not' when building DataFrame boolean expressions.
> ```
>
>   How can I do this? Thanks!
>
>  Best, Oliver
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread Sean Owen

It doesn't work because it's an aggregate function. You have to groupBy()
(group by nothing) to make that work, but, you can't assign that as a
column. Folks those approaches don't make sense semantically in SQL or
Spark or anything.
They just mean use threads to collect() distinct values for each col in
parallel using threads in your program. You don't have to but you could.
What else are we looking for here, the answer has been given a number of
times I think.


On Sun, Feb 12, 2023 at 2:28 PM sam smith 
wrote:

> OK, what do you mean by " do your outer for loop in parallel "?
> btw this didn't work:
> for (String columnName : df.columns()) {
> df= df.withColumn(columnName,
> collect_set(col(columnName)).as(columnName));
> }
>
>
> Le dim. 12 févr. 2023 à 20:36, Enrico Minack  a
> écrit :
>
>> That is unfortunate, but 3.4.0 is around the corner, really!
>>
>> Well, then based on your code, I'd suggest two improvements:
>> - cache your dataframe after reading, this way, you don't read the entire
>> file for each column
>> - do your outer for loop in parallel, then you have N parallel Spark jobs
>> (only helps if your Spark cluster is not fully occupied by a single column)
>>
>> Your withColumn-approach does not work because withColumn expects a
>> column as the second argument, but df.select(columnName).distinct() is a
>> DataFrame and .col is a column in *that* DataFrame, it is not a column
>> of the dataframe that you call withColumn on.
>>
>> It should read:
>>
>> Scala:
>> df.select(df.columns.map(column => collect_set(col(column)).as(column)):
>> _*).show()
>>
>> Java:
>> for (String columnName : df.columns()) {
>> df= df.withColumn(columnName,
>> collect_set(col(columnName)).as(columnName));
>> }
>>
>> Then you have a single DataFrame that computes all columns in a single
>> Spark job.
>>
>> But this reads all distinct values into a single partition, which has the
>> same downside as collect, so this is as bad as using collect.
>>
>> Cheers,
>> Enrico
>>
>>
>> Am 12.02.23 um 18:05 schrieb sam smith:
>>
>> @Enrico Minack  Thanks for "unpivot" but I am
>> using version 3.3.0 (you are taking it way too far as usual :) )
>> @Sean Owen  Pls then show me how it can be improved by
>> code.
>>
>> Also, why such an approach (using withColumn() ) doesn't work:
>>
>> for (String columnName : df.columns()) {
>> df= df.withColumn(columnName,
>> df.select(columnName).distinct().col(columnName));
>> }
>>
>> Le sam. 11 févr. 2023 à 13:11, Enrico Minack  a
>> écrit :
>>
>>> You could do the entire thing in DataFrame world and write the result to
>>> disk. All you need is unpivot (to be released in Spark 3.4.0, soon).
>>>
>>> Note this is Scala but should be straightforward to translate into Java:
>>>
>>> import org.apache.spark.sql.functions.collect_set
>>>
>>> val df = Seq((1, 10, 123), (2, 20, 124), (3, 20, 123), (4, 10,
>>> 123)).toDF("a", "b", "c")
>>>
>>> df.unpivot(Array.empty, "column", "value")
>>>   .groupBy("column")
>>>   .agg(collect_set("value").as("distinct_values"))
>>>
>>> The unpivot operation turns
>>> +---+---+---+
>>> |  a|  b|  c|
>>> +---+---+---+
>>> |  1| 10|123|
>>> |  2| 20|124|
>>> |  3| 20|123|
>>> |  4| 10|123|
>>> +---+---+---+
>>>
>>> into
>>>
>>> +--+-+
>>> |column|value|
>>> +--+-+
>>> | a|1|
>>> | b|   10|
>>> | c|  123|
>>> | a|2|
>>> | b|   20|
>>> | c|  124|
>>> | a|3|
>>> | b|   20|
>>> | c|  123|
>>> | a|4|
>>> | b|   10|
>>> | c|  123|
>>> +--+-+
>>>
>>> The groupBy("column").agg(collect_set("value").as("distinct_values"))
>>> collects distinct values per column:
>>> +--+---+
>>>
>>> |column|distinct_values|
>>> +--+---+
>>> | c| [123, 124]|
>>> | b|   [20, 10]|
>>> | a|   [1, 2, 3, 4]|
>>> +--+---+
>>>
>>> Note that unpivot only works if all columns have a "common" type. Then
>>> all columns are cast to that common type. If you

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread Sean Owen

That's the answer, except, you can never select a result set into a column
right? you just collect() each of those results. Or, what do you want? I'm
not clear.

On Sun, Feb 12, 2023 at 10:59 AM sam smith 
wrote:

> @Enrico Minack  Thanks for "unpivot" but I am using
> version 3.3.0 (you are taking it way too far as usual :) )
> @Sean Owen  Pls then show me how it can be improved by
> code.
>
> Also, why such an approach (using withColumn() ) doesn't work:
>
> for (String columnName : df.columns()) {
> df= df.withColumn(columnName,
> df.select(columnName).distinct().col(columnName));
> }
>
> Le sam. 11 févr. 2023 à 13:11, Enrico Minack  a
> écrit :
>
>> You could do the entire thing in DataFrame world and write the result to
>> disk. All you need is unpivot (to be released in Spark 3.4.0, soon).
>>
>> Note this is Scala but should be straightforward to translate into Java:
>>
>> import org.apache.spark.sql.functions.collect_set
>>
>> val df = Seq((1, 10, 123), (2, 20, 124), (3, 20, 123), (4, 10,
>> 123)).toDF("a", "b", "c")
>>
>> df.unpivot(Array.empty, "column", "value")
>>   .groupBy("column")
>>   .agg(collect_set("value").as("distinct_values"))
>>
>> The unpivot operation turns
>> +---+---+---+
>> |  a|  b|  c|
>> +---+---+---+
>> |  1| 10|123|
>> |  2| 20|124|
>> |  3| 20|123|
>> |  4| 10|123|
>> +---+---+---+
>>
>> into
>>
>> +--+-+
>> |column|value|
>> +--+-+
>> | a|1|
>> | b|   10|
>> | c|  123|
>> | a|2|
>> | b|   20|
>> | c|  124|
>> | a|3|
>> | b|   20|
>> | c|  123|
>> | a|4|
>> | b|   10|
>> | c|  123|
>> +--+-+
>>
>> The groupBy("column").agg(collect_set("value").as("distinct_values"))
>> collects distinct values per column:
>> +--+---+
>>
>> |column|distinct_values|
>> +--+---+
>> | c| [123, 124]|
>> | b|   [20, 10]|
>> | a|   [1, 2, 3, 4]|
>> +--+---+
>>
>> Note that unpivot only works if all columns have a "common" type. Then
>> all columns are cast to that common type. If you have incompatible types
>> like Integer and String, you would have to cast them all to String first:
>>
>> import org.apache.spark.sql.types.StringType
>>
>> df.select(df.columns.map(col(_).cast(StringType)): _*).unpivot(...)
>>
>> If you want to preserve the type of the values and have multiple value
>> types, you cannot put everything into a DataFrame with one
>> distinct_values column. You could still have multiple DataFrames, one
>> per data type, and write those, or collect the DataFrame's values into Maps:
>>
>> import scala.collection.immutable
>>
>> import org.apache.spark.sql.DataFrame
>> import org.apache.spark.sql.functions.collect_set
>>
>> // if all you columns have the same type
>> def distinctValuesPerColumnOneType(df: DataFrame): immutable.Map[String,
>> immutable.Seq[Any]] = {
>>   df.unpivot(Array.empty, "column", "value")
>> .groupBy("column")
>> .agg(collect_set("value").as("distinct_values"))
>> .collect()
>> .map(row => row.getString(0) -> row.getSeq[Any](1).toList)
>> .toMap
>> }
>>
>>
>> // if your columns have different types
>> def distinctValuesPerColumn(df: DataFrame): immutable.Map[String,
>> immutable.Seq[Any]] = {
>>   df.schema.fields
>> .groupBy(_.dataType)
>> .mapValues(_.map(_.name))
>> .par
>> .map { case (dataType, columns) => df.select(columns.map(col): _*) }
>> .map(distinctValuesPerColumnOneType)
>> .flatten
>> .toList
>> .toMap
>> }
>>
>> val df = Seq((1, 10, "one"), (2, 20, "two"), (3, 20, "one"), (4, 10,
>> "one")).toDF("a", "b", "c")
>> distinctValuesPerColumn(df)
>>
>> The result is: (list values are of original type)
>> Map(b -> List(20, 10), a -> List(1, 2, 3, 4), c -> List(one, two))
>>
>> Hope this helps,
>> Enrico
>>
>>
>> Am 10.02.23 um 22:56 schrieb sam smith:
>>
>> Hi Apotolos,
>> Can you suggest a better approach while keeping values within a dataframe?
>>
>>

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread Sean Owen

Why would csv or a temp table change anything here? You don't need
windowing for distinct values either

On Fri, Feb 10, 2023, 6:01 PM Mich Talebzadeh 
wrote:

> on top of my head, create a dataframe reading CSV file.
>
> This is python
>
>  listing_df =
> spark.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header", "true").load(csv_file)
>  listing_df.printSchema()
>  listing_df.createOrReplaceTempView("temp")
>
> ## do your distinct columns using windowing functions on temp table with
> SQL
>
>  HTH
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 10 Feb 2023 at 21:59, sam smith 
> wrote:
>
>> I am not sure i understand well " Just need to do the cols one at a
>> time". Plus I think Apostolos is right, this needs a dataframe approach not
>> a list approach.
>>
>> Le ven. 10 févr. 2023 à 22:47, Sean Owen  a écrit :
>>
>>> For each column, select only that call and get distinct values. Similar
>>> to what you do here. Just need to do the cols one at a time. Your current
>>> code doesn't do what you want.
>>>
>>> On Fri, Feb 10, 2023, 3:46 PM sam smith 
>>> wrote:
>>>
>>>> Hi Sean,
>>>>
>>>> "You need to select the distinct values of each col one at a time", how
>>>> ?
>>>>
>>>> Le ven. 10 févr. 2023 à 22:40, Sean Owen  a écrit :
>>>>
>>>>> That gives you all distinct tuples of those col values. You need to
>>>>> select the distinct values of each col one at a time. Sure just collect()
>>>>> the result as you do here.
>>>>>
>>>>> On Fri, Feb 10, 2023, 3:34 PM sam smith 
>>>>> wrote:
>>>>>
>>>>>> I want to get the distinct values of each column in a List (is it
>>>>>> good practice to use List here?), that contains as first element the 
>>>>>> column
>>>>>> name, and the other element its distinct values so that for a dataset we
>>>>>> get a list of lists, i do it this way (in my opinion no so fast):
>>>>>>
>>>>>> List> finalList = new ArrayList>();
>>>>>> Dataset df = spark.read().format("csv").option("header", 
>>>>>> "true").load("/pathToCSV");
>>>>>> String[] columnNames = df.columns();
>>>>>>  for (int i=0;i>>>>> List columnList = new ArrayList();
>>>>>>
>>>>>> columnList.add(columnNames[i]);
>>>>>>
>>>>>>
>>>>>> List columnValues = 
>>>>>> df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList();
>>>>>> for (int j=0;j>>>>> columnList.add(columnValues.get(j).apply(0).toString());
>>>>>>
>>>>>> finalList.add(columnList);
>>>>>>
>>>>>>
>>>>>> How to improve this?
>>>>>>
>>>>>> Also, can I get the results in JSON format?
>>>>>>
>>>>>

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread Sean Owen

That gives you all distinct tuples of those col values. You need to select
the distinct values of each col one at a time. Sure just collect() the
result as you do here.

On Fri, Feb 10, 2023, 3:34 PM sam smith  wrote:

> I want to get the distinct values of each column in a List (is it good
> practice to use List here?), that contains as first element the column
> name, and the other element its distinct values so that for a dataset we
> get a list of lists, i do it this way (in my opinion no so fast):
>
> List> finalList = new ArrayList>();
> Dataset df = spark.read().format("csv").option("header", 
> "true").load("/pathToCSV");
> String[] columnNames = df.columns();
>  for (int i=0;i List columnList = new ArrayList();
>
> columnList.add(columnNames[i]);
>
>
> List columnValues = 
> df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList();
> for (int j=0;j columnList.add(columnValues.get(j).apply(0).toString());
>
> finalList.add(columnList);
>
>
> How to improve this?
>
> Also, can I get the results in JSON format?
>

Re: [PySPark] How to check if value of one column is in array of another column

2023-01-17 Thread Sean Owen

I think you want array_contains:
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_contains.html

On Tue, Jan 17, 2023 at 4:18 PM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

>
>  Hello,
>
>   I have data originally stored as JSON. Column gene contains a string,
> column nearest an array of strings. How can I check whether the value of
> gene is an element of the array of nearest?
>
>   I tried: genes_joined.gene.isin(genes_joined.nearest)
>
>   But I get an error that says:
>
> pyspark.sql.utils.AnalysisException: cannot resolve '(gene IN (nearest))'
> due to data type mismatch: Arguments must be same type but were: string !=
> array;
>
>   How do I do this? Thanks!
>
>  Best, Oliver
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>

Re: pyspark.sql.dataframe.DataFrame versus pyspark.pandas.frame.DataFrame

2023-01-13 Thread Sean Owen

One is a normal Pyspark DataFrame, the other is a pandas work-alike wrapper
on a Pyspark DataFrame. They're the same thing with different APIs.
Neither has a 'storage format'.

spark-excel might be fine, and it's used with Spark DataFrames. Because it
emulates pandas's read_excel API, the Pyspark pandas DataFrame also has a
read_excel method that could work.
You can try both and see which works for you.

On Thu, Jan 12, 2023 at 9:56 PM second_co...@yahoo.com.INVALID
 wrote:

>
> Good day,
>
> May i know what is the different between pyspark.sql.dataframe.DataFrame
> versus pyspark.pandas.frame.DataFrame ? Are both store in Spark dataframe
> format?
>
> I'm looking for a way to load a huge excel file (4-10GB), i wonder should
> i use third party library spark-excel or just use native pyspark.pandas ?
> I prefer to use Spark dataframe so that it uses the parallelization
> feature of Spark in the executors instead of running it on the driver.
>
> Can help to advice ?
>
>
> Detail
> ---
>
> df = spark.read \.format("com.crealytics.spark.excel") \
> .option("header", "true") \.load("/path/big_excel.xls")print(type(df)) # 
> output pyspark.sql.dataframe.DataFrame
>
>
> import pyspark.pandas as psfrom pyspark.sql import DataFrame  
> path="/path/big-excel.xls" df= ps.read_excel(path)
>
> # output pyspark.pandas.frame.DataFrame
>
>
> Thank you.
>
>
>

Re: [pyspark/sparksql]: How to overcome redundant/repetitive code? Is a for loop over an sql statement with a variable a bad idea?

2023-01-06 Thread Sean Owen

Right, nothing wrong with a for loop here. Seems like just the right thing.

On Fri, Jan 6, 2023, 3:20 PM Joris Billen 
wrote:

> Hello Community,
> I am working in pyspark with sparksql and have a very similar very complex
> list of dataframes that Ill have to execute several times for all the
> “models” I have.
> Suppose the code is exactly the same for all models, only the table it
> reads from and some values in the where statements will have the modelname
> in it.
> My question is how to prevent repetitive code.
> So instead of doing somethg like this (this is pseudocode, in reality it
> makes use of lots of complex dataframes) which also would require me to
> change the code every time I change it in the future:
>
> *dfmodel1=sqlContext.sql("SELECT  FROM model1_table
> WHERE model =‘model1’ “).write()*
> *dfmodel2=sqlContext.sql("SELECT  FROM model2_table
> WHERE model =‘model2’ “).write()*
> *dfmodel3=sqlContext.sql("SELECT  FROM model3_table
> WHERE model =‘model3’ “).write()*
>
>
> For loops in spark sound like a bad idea (but that is mainly in terms of
> data, maybe nothing against looping over sql statements). Is it allowed
> to do something like this?
>
>
> *spark-submit withloops.py [“model1”,"model2”,"model3"]*
>
> *code withloops.py*
> *models=sys.arg[1]*
> *qry="""SELECT  FROM {} WHERE model ='{}'"""*
> *for i in models:*
> *  FROM_TABLE=table_model*
> *  sqlContext.sql(qry.format(i,table_model )).write()*
>
>
>
> I was trying to look up about refactoring in pyspark to prevent redundant
> code but didnt find any relevant links.
>
>
>
> Thanks for input!
>

Re: GPU Support

2023-01-05 Thread Sean Owen

Spark itself does not use GPUs, but you can write and run code on Spark
that uses GPUs. You'd typically use software like Tensorflow that uses CUDA
to access the GPU.

On Thu, Jan 5, 2023 at 7:05 AM K B M Kaala Subhikshan <
kbmkaalasubhiks...@gmail.com> wrote:

> Is Gigabyte GeForce RTX 3080  GPU support for running machine learning in
> Spark?
>

Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Sean Owen

That does not appear to be the same input you used in your example. What is
the contents of test.csv?

On Wed, Jan 4, 2023 at 7:45 AM Saurabh Gulati 
wrote:

> Hi @Sean Owen 
> Probably the data is incorrect, and the source needs to fix it.
> But using python's csv parser returns the correct results.
>
> import csv
>
> with open("/tmp/test.csv") as c_file:
>
> csv_reader = csv.reader(c_file, delimiter=",")
> for row in csv_reader:
> print(row)
>
> ['a', 'b', 'c']
> ['1', '', ',see what "I did",\ni am still writing']
> ['2', '', 'abc']
>
> And also, I don't understand why there is a distinction in outputs from
> df.show() and df.select("c").show()
>
> Mvg/Regards
> Saurabh Gulati
> Data Platform
> --
> *From:* Sean Owen 
> *Sent:* 04 January 2023 14:25
> *To:* Saurabh Gulati 
> *Cc:* Mich Talebzadeh ; User <
> user@spark.apache.org>
> *Subject:* Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used
> within the data
>
> That input is just invalid as CSV for any parser. You end a quoted col
> without following with a col separator. What would the intended parsing be
> and how would it work?
>
> On Wed, Jan 4, 2023 at 4:30 AM Saurabh Gulati 
> wrote:
>
>
> @Sean Owen  Also see the example below with quotes
> feedback:
>
> "a","b","c"
> "1","",",see what ""I did"","
> "2","","abc"
>
>

Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Sean Owen

That input is just invalid as CSV for any parser. You end a quoted col
without following with a col separator. What would the intended parsing be
and how would it work?

On Wed, Jan 4, 2023 at 4:30 AM Saurabh Gulati 
wrote:

>
> @Sean Owen  Also see the example below with quotes
> feedback:
>
> "a","b","c"
> "1","",",see what ""I did"","
> "2","","abc"
>
>

Re: Incorrect csv parsing when delimiter used within the data

2023-01-03 Thread Sean Owen

Why does the data even need cleaning? That's all perfectly correct. The
error was setting quote to be an escape char.

On Tue, Jan 3, 2023, 2:32 PM Mich Talebzadeh 
wrote:

> if you take your source CSV as below
>
> "a","b","c"
> "1","",","
> "2","","abc"
>
>
> and define your code as below
>
>
>csv_file="hdfs://rhes75:9000/data/stg/test/testcsv.csv"
> # read hive table in spark
> listing_df =
> spark.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header", "true").load(csv_file)
> listing_df.printSchema()
> print(f"""\n Reading from Hive table {csv_file}\n""")
> listing_df.show(100,False)
> listing_df.select("c").show()
>
>
> results in
>
>
>  Reading from Hive table hdfs://rhes75:9000/data/stg/test/testcsv.csv
>
> +---++---+
> |a  |b   |c  |
> +---++---+
> |1  |null|,  |
> |2  |null|abc|
> +---++---+
>
> +---+
> |  c|
> +---+
> |  ,|
> |abc|
> +---+
>
>
> which assumes that "," is a value for column c in row 1
>
>
> This interpretation is correct. You ought to do data cleansing before.
>
>
> HTH
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 3 Jan 2023 at 17:03, Sean Owen  wrote:
>
>> No, you've set the escape character to double-quote, when it looks like
>> you mean for it to be the quote character (which it already is). Remove
>> this setting, as it's incorrect.
>>
>> On Tue, Jan 3, 2023 at 11:00 AM Saurabh Gulati
>>  wrote:
>>
>>> Hello,
>>> We are seeing a case with csv data when it parses csv data incorrectly.
>>> The issue can be replicated using the below csv data
>>>
>>> "a","b","c"
>>> "1","",","
>>> "2","","abc"
>>>
>>> and using the spark csv read command.
>>>
>>> df = spark.read.format("csv")\
>>> .option("multiLine", True)\
>>> .option("escape", '"')\
>>> .option("enforceSchema", False) \
>>> .option("header", True)\
>>> .load(f"/tmp/test.csv")
>>>
>>> df.show(100, False) # prints both rows
>>> |a  |b   |c  |
>>> +---++---+
>>> |1  |null|,  |
>>> |2  |null|abc|
>>>
>>> df.select("c").show() # merges last column of first row and first
>>> column of second row
>>> +--+
>>> | c|
>>> +--+
>>> |"\n"2"|
>>>
>>> print(df.count()) # prints 1, should be 2
>>>
>>>
>>> It feels like a bug and I thought of asking the community before
>>> creating a bug on jira.
>>>
>>> Mvg/Regards
>>> Saurabh
>>>
>>>

Re: Incorrect csv parsing when delimiter used within the data

2023-01-03 Thread Sean Owen

No, you've set the escape character to double-quote, when it looks like you
mean for it to be the quote character (which it already is). Remove this
setting, as it's incorrect.

On Tue, Jan 3, 2023 at 11:00 AM Saurabh Gulati
 wrote:

> Hello,
> We are seeing a case with csv data when it parses csv data incorrectly.
> The issue can be replicated using the below csv data
>
> "a","b","c"
> "1","",","
> "2","","abc"
>
> and using the spark csv read command.
>
> df = spark.read.format("csv")\
> .option("multiLine", True)\
> .option("escape", '"')\
> .option("enforceSchema", False) \
> .option("header", True)\
> .load(f"/tmp/test.csv")
>
> df.show(100, False) # prints both rows
> |a  |b   |c  |
> +---++---+
> |1  |null|,  |
> |2  |null|abc|
>
> df.select("c").show() # merges last column of first row and first column
> of second row
> +--+
> | c|
> +--+
> |"\n"2"|
>
> print(df.count()) # prints 1, should be 2
>
>
> It feels like a bug and I thought of asking the community before creating
> a bug on jira.
>
> Mvg/Regards
> Saurabh
>
>

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Sean Owen

Not true, you've never been able to use the SparkSession inside a Spark
task. You aren't actually using it, if the application worked in Spark 2.x.
Now, you need to avoid accidentally serializing it, which was the right
thing to do even in Spark 2.x. Just move the sesion inside main(), not a
member.
Or what other explanation do you have? I don't understand.

On Mon, Jan 2, 2023 at 10:10 AM Shrikant Prasad 
wrote:

> If that was the case and deserialized session would not work, the
> application would not have worked.
>
> As per the logs and debug prints, in spark 2.3 the main object is not
> getting deserialized in executor, otherise it would have failed then also.
>
> On Mon, 2 Jan 2023 at 9:15 PM, Sean Owen  wrote:
>
>> It silently allowed the object to serialize, though the
>> serialized/deserialized session would not work. Now it explicitly fails.
>>
>> On Mon, Jan 2, 2023 at 9:43 AM Shrikant Prasad 
>> wrote:
>>
>>> Thats right. But the serialization would be happening in Spark 2.3 also,
>>> why we dont see this error there?
>>>
>>> On Mon, 2 Jan 2023 at 9:09 PM, Sean Owen  wrote:
>>>
>>>> Oh, it's because you are defining "spark" within your driver object,
>>>> and then it's getting serialized because you are trying to use TestMain
>>>> methods in your program.
>>>> This was never correct, but now it's an explicit error in Spark 3. The
>>>> session should not be a member variable.
>>>>
>>>> On Mon, Jan 2, 2023 at 9:24 AM Shrikant Prasad 
>>>> wrote:
>>>>
>>>>> Please see these logs. The error is thrown in executor:
>>>>>
>>>>> 23/01/02 15:14:44 ERROR Executor: Exception in task 0.0 in stage 0.0
>>>>> (TID 0)
>>>>>
>>>>> java.lang.ExceptionInInitializerError
>>>>>
>>>>>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>
>>>>>at
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>>
>>>>>at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>
>>>>>at java.lang.reflect.Method.invoke(Method.java:498)
>>>>>
>>>>>at
>>>>> java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230)
>>>>>
>>>>>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>
>>>>>at
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>>
>>>>>at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>
>>>>>at java.lang.reflect.Method.invoke(Method.java:498)
>>>>>
>>>>>at
>>>>> java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1274)
>>>>>
>>>>>at
>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
>>>>>
>>>>>at
>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
>>>>>
>>>>>at
>>>>> java.io.ObjectInputStream.readArray(ObjectInputStream.java:2093)
>>>>>
>>>>>at
>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1655)
>>>>>
>>>>>at
>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
>>>>>
>>>>>at
>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
>>>>>
>>>>>at
>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
>>>>>
>>>>>at
>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
>>>>>
>>>>>at
>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
>>>>>
>>>>>at
>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
>>>>>
>>>>>at
>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
>>>>>
>>>>>at
>>>>

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Sean Owen

It silently allowed the object to serialize, though the
serialized/deserialized session would not work. Now it explicitly fails.

On Mon, Jan 2, 2023 at 9:43 AM Shrikant Prasad 
wrote:

> Thats right. But the serialization would be happening in Spark 2.3 also,
> why we dont see this error there?
>
> On Mon, 2 Jan 2023 at 9:09 PM, Sean Owen  wrote:
>
>> Oh, it's because you are defining "spark" within your driver object, and
>> then it's getting serialized because you are trying to use TestMain methods
>> in your program.
>> This was never correct, but now it's an explicit error in Spark 3. The
>> session should not be a member variable.
>>
>> On Mon, Jan 2, 2023 at 9:24 AM Shrikant Prasad 
>> wrote:
>>
>>> Please see these logs. The error is thrown in executor:
>>>
>>> 23/01/02 15:14:44 ERROR Executor: Exception in task 0.0 in stage 0.0
>>> (TID 0)
>>>
>>> java.lang.ExceptionInInitializerError
>>>
>>>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>
>>>at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>
>>>at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>
>>>at java.lang.reflect.Method.invoke(Method.java:498)
>>>
>>>at
>>> java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230)
>>>
>>>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>
>>>at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>
>>>at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>
>>>at java.lang.reflect.Method.invoke(Method.java:498)
>>>
>>>at
>>> java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1274)
>>>
>>>at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
>>>
>>>at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
>>>
>>>at
>>> java.io.ObjectInputStream.readArray(ObjectInputStream.java:2093)
>>>
>>>at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1655)
>>>
>>>at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
>>>
>>>at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
>>>
>>>at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
>>>
>>>at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
>>>
>>>at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
>>>
>>>at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
>>>
>>>at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
>>>
>>>at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
>>>
>>>at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
>>>
>>>at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
>>>
>>>at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
>>>
>>>at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
>>>
>>>at
>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
>>>
>>>at
>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
>>>
>>>at
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>>>
>>>at
>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
>>>
>>>at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83)
>>>
>>>at org.apache.spark.scheduler.Task.run(Task.scala:127)
>>>
>>>at
>>> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
>>>
>>>at
>>> org.apache.spark.util.

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Sean Owen

Oh, it's because you are defining "spark" within your driver object, and
then it's getting serialized because you are trying to use TestMain methods
in your program.
This was never correct, but now it's an explicit error in Spark 3. The
session should not be a member variable.

On Mon, Jan 2, 2023 at 9:24 AM Shrikant Prasad 
wrote:

> Please see these logs. The error is thrown in executor:
>
> 23/01/02 15:14:44 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
> 0)
>
> java.lang.ExceptionInInitializerError
>
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
>at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>
>at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
>at java.lang.reflect.Method.invoke(Method.java:498)
>
>at
> java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230)
>
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
>at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>
>at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
>at java.lang.reflect.Method.invoke(Method.java:498)
>
>at
> java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1274)
>
>at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
>
>at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
>
>at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2093)
>
>at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1655)
>
>at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
>
>at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
>
>at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
>
>at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
>
>at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
>
>at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
>
>at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
>
>at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
>
>at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
>
>at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
>
>at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
>
>at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
>
>at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
>
>at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
>
>at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>
>at
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
>
>at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83)
>
>at org.apache.spark.scheduler.Task.run(Task.scala:127)
>
>at
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
>
>at
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>
>at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
>
>at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>
>at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>
>at java.lang.Thread.run(Thread.java:748)
>
> Caused by: org.apache.spark.SparkException: A master URL must be set in
> your configuration
>
>at org.apache.spark.SparkContext.(SparkContext.scala:385)
>
>at
> org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2574)
>
>    at
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:934)
>
>at scala.Option.getOrElse(Option.scala:189)
>
>at
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:928)
>
>at TestMain$.(TestMain.scala:12)
>
>at TestMain$.(TestMain.scala)
>
> On Mon, 2 Jan 2023 at 8:29 PM, Sean Owen  wrote:
>
>> It's not running on the executor; that's not the issue. See your stack
>> trace, where it clearly happens in the driver.
>>
>> On Mon, Jan 2, 2023

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Sean Owen

It's not running on the executor; that's not the issue. See your stack
trace, where it clearly happens in the driver.

On Mon, Jan 2, 2023 at 8:58 AM Shrikant Prasad 
wrote:

> Even if I set the master as yarn, it will not have access to rest of the
> spark confs. It will need spark.yarn.app.id.
>
> The main issue is if its working as it is in Spark 2.3 why its not working
> in Spark 3 i.e why the session is getting created on executor.
> Another thing we tried is removing the df to rdd conversion just for debug
> and it works in Spark 3.
>
> So, it might be something to do with df to rdd conversion or serialization
> behavior change from Spark 2.3 to Spark 3.0 if there is any. But couldn't
> find the root cause.
>
> Regards,
> Shrikant
>
> On Mon, 2 Jan 2023 at 7:54 PM, Sean Owen  wrote:
>
>> So call .setMaster("yarn"), per the error
>>
>> On Mon, Jan 2, 2023 at 8:20 AM Shrikant Prasad 
>> wrote:
>>
>>> We are running it in cluster deploy mode with yarn.
>>>
>>> Regards,
>>> Shrikant
>>>
>>> On Mon, 2 Jan 2023 at 6:15 PM, Stelios Philippou 
>>> wrote:
>>>
>>>> Can we see your Spark Configuration parameters ?
>>>>
>>>> The mater URL refers to as per java
>>>> new SparkConf()setMaster("local[*]")
>>>> according to where you want to run this
>>>>
>>>> On Mon, 2 Jan 2023 at 14:38, Shrikant Prasad 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am trying to migrate one spark application from Spark 2.3 to 3.0.1.
>>>>>
>>>>> The issue can be reproduced using below sample code:
>>>>>
>>>>> object TestMain {
>>>>>
>>>>> val session =
>>>>> SparkSession.builder().appName("test").enableHiveSupport().getOrCreate()
>>>>>
>>>>> def main(args: Array[String]): Unit = {
>>>>>
>>>>> import session.implicits._
>>>>> val a = *session.*sparkContext.parallelize(*Array*
>>>>> (("A",1),("B",2))).toDF("_c1","_c2").*rdd*.map(x=>
>>>>> x(0).toString).collect()
>>>>> *println*(a.mkString("|"))
>>>>>
>>>>> }
>>>>> }
>>>>>
>>>>> It runs successfully in Spark 2.3 but fails with Spark 3.0.1 with
>>>>> below exception:
>>>>>
>>>>> Caused by: org.apache.spark.SparkException: A master URL must be set
>>>>> in your configuration
>>>>>
>>>>> at
>>>>> org.apache.spark.SparkContext.(SparkContext.scala:394)
>>>>>
>>>>> at
>>>>> org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
>>>>>
>>>>> at
>>>>> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
>>>>>
>>>>> at scala.Option.getOrElse(Option.scala:189)
>>>>>
>>>>> at
>>>>> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
>>>>>
>>>>> at TestMain$.(TestMain.scala:7)
>>>>>
>>>>> at TestMain$.(TestMain.scala)
>>>>>
>>>>>
>>>>> From the exception it appears that it tries to create spark session on
>>>>> executor also in Spark 3 whereas its not created again on executor in 
>>>>> Spark
>>>>> 2.3.
>>>>>
>>>>> Can anyone help in identfying why there is this change in behavior?
>>>>>
>>>>> Thanks and Regards,
>>>>>
>>>>> Shrikant
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Shrikant Prasad
>>>>>
>>>> --
>>> Regards,
>>> Shrikant Prasad
>>>
>> --
> Regards,
> Shrikant Prasad
>

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Sean Owen

So call .setMaster("yarn"), per the error

On Mon, Jan 2, 2023 at 8:20 AM Shrikant Prasad 
wrote:

> We are running it in cluster deploy mode with yarn.
>
> Regards,
> Shrikant
>
> On Mon, 2 Jan 2023 at 6:15 PM, Stelios Philippou 
> wrote:
>
>> Can we see your Spark Configuration parameters ?
>>
>> The mater URL refers to as per java
>> new SparkConf()setMaster("local[*]")
>> according to where you want to run this
>>
>> On Mon, 2 Jan 2023 at 14:38, Shrikant Prasad 
>> wrote:
>>
>>> Hi,
>>>
>>> I am trying to migrate one spark application from Spark 2.3 to 3.0.1.
>>>
>>> The issue can be reproduced using below sample code:
>>>
>>> object TestMain {
>>>
>>> val session =
>>> SparkSession.builder().appName("test").enableHiveSupport().getOrCreate()
>>>
>>> def main(args: Array[String]): Unit = {
>>>
>>> import session.implicits._
>>> val a = *session.*sparkContext.parallelize(*Array*
>>> (("A",1),("B",2))).toDF("_c1","_c2").*rdd*.map(x=>
>>> x(0).toString).collect()
>>> *println*(a.mkString("|"))
>>>
>>> }
>>> }
>>>
>>> It runs successfully in Spark 2.3 but fails with Spark 3.0.1 with below
>>> exception:
>>>
>>> Caused by: org.apache.spark.SparkException: A master URL must be set in
>>> your configuration
>>>
>>> at
>>> org.apache.spark.SparkContext.(SparkContext.scala:394)
>>>
>>> at
>>> org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
>>>
>>> at
>>> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
>>>
>>> at scala.Option.getOrElse(Option.scala:189)
>>>
>>> at
>>> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
>>>
>>> at TestMain$.(TestMain.scala:7)
>>>
>>> at TestMain$.(TestMain.scala)
>>>
>>>
>>> From the exception it appears that it tries to create spark session on
>>> executor also in Spark 3 whereas its not created again on executor in Spark
>>> 2.3.
>>>
>>> Can anyone help in identfying why there is this change in behavior?
>>>
>>> Thanks and Regards,
>>>
>>> Shrikant
>>>
>>> --
>>> Regards,
>>> Shrikant Prasad
>>>
>> --
> Regards,
> Shrikant Prasad
>

Re: Profiling data quality with Spark

2022-12-27 Thread Sean Owen

I think this is kind of mixed up. Data warehouses are simple SQL creatures;
Spark is (also) a distributed compute framework. Kind of like comparing
maybe a web server to Java.
Are you thinking of Spark SQL? then I dunno sure you may well find it more
complicated, but it's also just a data warehousey SQL surface.

But none of that relates to the question of data quality tools. You could
use GE with Redshift, or indeed with Spark - are you familiar with it? It's
probably one of the most common tools people use with Spark for this in
fact. It's just a Python lib at heart and you can apply it with Spark, but
_not_ with a data warehouse, so I'm not sure what you're getting at.

Deequ is also commonly seen. It's actually built on Spark, so again,
confused about this "use Redshift or Snowflake not Spark".

On Tue, Dec 27, 2022 at 9:55 PM Gourav Sengupta 
wrote:

> Hi,
>
> SPARK is just another querying engine with a lot of hype.
>
> I would highly suggest using Redshift (storage and compute decoupled mode)
> or Snowflake without all this super complicated understanding of
> containers/ disk-space, mind numbing variables, rocket science tuning, hair
> splitting failure scenarios, etc. After that try to choose solutions like
> Athena, or Trino/ Presto, and then come to SPARK.
>
> Try out solutions like  "great expectations" if you are looking for data
> quality and not entirely sucked into the world of SPARK and want to keep
> your options open.
>
> Dont get me wrong, SPARK used to be great in 2016-2017, but there are
> superb alternatives now and the industry, in this recession, should focus
> on getting more value for every single dollar they spend.
>
> Best of luck.
>
> Regards,
> Gourav Sengupta
>
> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh 
> wrote:
>
>> Well, you need to qualify your statement on data quality. Are you talking
>> about data lineage here?
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 27 Dec 2022 at 19:25, rajat kumar 
>> wrote:
>>
>>> Hi Folks
>>> Hoping you are doing well, I want to implement data quality to detect
>>> issues in data in advance. I have heard about few frameworks like GE/Deequ.
>>> Can anyone pls suggest which one is good and how do I get started on it?
>>>
>>> Regards
>>> Rajat
>>>
>>

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Sean Owen

As Mich says, isn't this just max by population partitioned by country in a
window function?

On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker 
wrote:

>
>  Hello,
>
>   Thank you for the response!
>
>   I can think of two ways to get the largest city by country, but both
> seem to be inefficient:
>
>   (1) I could group by country, sort each group by population, add the row
> number within each group, and then retain only cities with a row number
> equal to 1. But it seems wasteful to sort everything when I only want the
> largest of each country
>
>   (2) I could group by country, get the maximum city population for each
> country, join that with the original data frame, and then retain only
> cities with population equal to the maximum population in the country. But
> that seems also expensive because I need to join.
>
>   Am I missing something?
>
>   Thanks!
>
>  Best, Oliver
>
> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> In spark you can use windowing function
>> s to
>> achieve this
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>>
>>>  Hello,
>>>
>>>   How can I retain from each group only the row for which one value is
>>> the maximum of the group? For example, imagine a DataFrame containing all
>>> major cities in the world, with three columns: (1) City name (2) Country
>>> (3) population. How would I get a DataFrame that only contains the largest
>>> city in each country? Thanks!
>>>
>>>  Best, Oliver
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network , 
>>> Flannick
>>> Lab , Broad Institute
>>> 
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>

Re: Unable to run Spark Job(3.3.2 SNAPSHOT) with Volcano scheduler in Kubernetes

2022-12-16 Thread Sean Owen

OK that's good. Hm, I seem to recall the build needs more mem in Java 11
and/or some envs. As a quick check, try replacing all "-Xss4m" with
"-Xss16m" or something larger, in the project build files. Just search and
replace.

On Fri, Dec 16, 2022 at 9:53 AM Gnana Kumar 
wrote:

> I have been following below steps.
>
> git clone --branch branch-3.3 https://github.com/apache/spark.git
> cd spark
> ./dev/make-distribution.sh --tgz --name with-volcano
> -Pkubernetes,volcano,hadoop-3
>
> How to increase stack size ? Please let me know.
>
> Thanks
> Gnana
>
> On Fri, Dec 16, 2022 at 8:45 PM Sean Owen  wrote:
>
>> You need to increase the stack size during compilation. The included mvn
>> wrapper in build does this. Are you using it?
>>
>> On Fri, Dec 16, 2022 at 9:13 AM Gnana Kumar 
>> wrote:
>>
>>> This is my latest error and fails to build SPARK CATALYST
>>>
>>> Exception in thread "main" java.lang.StackOverflowError
>>> at scala.reflect.internal.Trees.itransform(Trees.scala:1402)
>>> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400)
>>> at
>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>>> at
>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>>> at
>>> scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
>>> at
>>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57)
>>> at
>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275)
>>> at
>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133)
>>> at scala.reflect.internal.Trees.itransform(Trees.scala:1436)
>>> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400)
>>> at
>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>>> at
>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>>> at
>>> scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
>>>
>>> On Fri, Dec 16, 2022 at 7:16 PM Gnana Kumar 
>>> wrote:
>>>
>>>> Any updates on this please ?
>>>>
>>>> On Sat, Dec 10, 2022 at 1:36 PM Gnana Kumar 
>>>> wrote:
>>>>
>>>>> Thanks Bjorn.
>>>>>
>>>>> I have created Google Cloud Computre Engine and tried multiple times
>>>>> but getting below error.
>>>>>
>>>>> I have installed JAVA 11 in my VM machine and is there any other
>>>>> dependency which needs to be installed in VM Machine ? Please let me know.
>>>>>
>>>>> at
>>>>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57)
>>>>> at
>>>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275)
>>>>> at
>>>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133)
>>>>> at scala.reflect.internal.Trees.itransform(Trees.scala:1436)
>>>>> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400)
>>>>> at
>>>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>>>>> at
>>>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>>>>> at
>>>>> scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
>>>>> at
>>>>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57)
>>>>> at
>>>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275)
>>>>> at
>>>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133)
>>>>> at scala.reflect.internal.Trees.itransform(Trees.scala:1411)
>>>>> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400)
>>>>> at
>>>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>>>>> at
>>>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>>>>> at
>>&g

Re: Unable to run Spark Job(3.3.2 SNAPSHOT) with Volcano scheduler in Kubernetes

2022-12-16 Thread Sean Owen

You need to increase the stack size during compilation. The included mvn
wrapper in build does this. Are you using it?

On Fri, Dec 16, 2022 at 9:13 AM Gnana Kumar 
wrote:

> This is my latest error and fails to build SPARK CATALYST
>
> Exception in thread "main" java.lang.StackOverflowError
> at scala.reflect.internal.Trees.itransform(Trees.scala:1402)
> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400)
> at
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> at
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> at scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
> at
> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57)
> at
> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275)
> at
> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133)
> at scala.reflect.internal.Trees.itransform(Trees.scala:1436)
> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400)
> at
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> at
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> at scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
>
> On Fri, Dec 16, 2022 at 7:16 PM Gnana Kumar 
> wrote:
>
>> Any updates on this please ?
>>
>> On Sat, Dec 10, 2022 at 1:36 PM Gnana Kumar 
>> wrote:
>>
>>> Thanks Bjorn.
>>>
>>> I have created Google Cloud Computre Engine and tried multiple times but
>>> getting below error.
>>>
>>> I have installed JAVA 11 in my VM machine and is there any other
>>> dependency which needs to be installed in VM Machine ? Please let me know.
>>>
>>> at
>>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57)
>>> at
>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275)
>>> at
>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133)
>>> at scala.reflect.internal.Trees.itransform(Trees.scala:1436)
>>> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400)
>>> at
>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>>> at
>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>>> at
>>> scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
>>> at
>>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57)
>>> at
>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275)
>>> at
>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133)
>>> at scala.reflect.internal.Trees.itransform(Trees.scala:1411)
>>> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400)
>>> at
>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>>> at
>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>>> at
>>> scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
>>> at
>>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57)
>>> at
>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275)
>>> at
>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133)
>>> at scala.reflect.internal.Trees.itransform(Trees.scala:1430)
>>> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400)
>>> at
>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>>> at
>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>>> at
>>> scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
>>> at
>>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57)
>>> at
>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275)
>>> at
>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133)
>>> at scala.reflect.internal.Trees.itransform(Trees.scala:1409)
>>> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400)
>>> at
>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>>> at
>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>>> at
>>> scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
>>> at
>>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57)
>>> at
>>>

Re: [EXTERNAL] Re: [Spark vulnerability] replace jackson-mapper-asl

2022-12-15 Thread Sean Owen

Please read the CVE you mention. It is not a CVE about the library you are
referencing.
https://nvd.nist.gov/vuln/detail/CVE-2018-14721


On Thu, Dec 15, 2022 at 7:52 PM haibo.w...@morganstanley.com <
haibo.w...@morganstanley.com> wrote:

> Hi Owen
>
>
>
> As confirmed with our firm appsec team, given the library is still being
> used in spark3.3.1. Also I can see the dependency as below:
>
> https://github.com/apache/spark/blob/v3.3.1/pom.xml#L1784
>
>
>
> Something misunderstanding? appreciate if you could clarify more, thanks.
>
>
>
> Regards
>
> Harper
>
>
>
> *From:* Sean Owen 
> *Sent:* Wednesday, December 14, 2022 10:27 PM
> *To:* Wang, Harper (FRPPE) 
> *Cc:* user@spark.apache.org
> *Subject:* Re: [EXTERNAL] Re: [Spark vulnerability] replace
> jackson-mapper-asl
>
>
>
> The CVE you mention seems to affect jackson-databind, not
> jackson-mapper-asl.  3.3.1 already uses databind 2.13.x which is not
> affected.
>
>
>
> On Wed, Dec 14, 2022 at 8:20 AM haibo.w...@morganstanley.com <
> haibo.w...@morganstanley.com> wrote:
>
> Thanks Owen for prompt response
>
> sorry, forgot to mention, it’s latest spark version 3.3.1
>
> Both below spark-py image  or pypi are good to use for us, but both have
> same Jackson-mapper-asl dependencies.
>
>
>
>
> https://hub.docker.com/layers/apache/spark-py/3.3.1/images/sha256-0d4fd8bcb2ad63a35c9ba5be278a3a34c28fc15e898307e458d501a7e11d6d51?context=explore
>
> https://pypi.org/project/pyspark/
>
>
>
> Regards
>
> Harper
>
>
>
>
>
> *From:* Sean Owen 
> *Sent:* Wednesday, December 14, 2022 9:32 PM
> *To:* Wang, Harper (FRPPE) 
> *Cc:* user@spark.apache.org
> *Subject:* [EXTERNAL] Re: [Spark vulnerability] replace jackson-mapper-asl
>
>
>
> What Spark version are you referring to? If it's an unsupported version,
> no, no plans to update it.
>
> What image are you referring to?
>
>
>
> On Wed, Dec 14, 2022 at 7:14 AM haibo.w...@morganstanley.com <
> haibo.w...@morganstanley.com> wrote:
>
> Hi All
>
>
>
> Hope you are doing well.
>
>
>
> Writing this email for an vulnerable issue: CVE-2018-14721
>
> apache/spark-py:
> gav://org.codehaus.jackson:jackson-mapper-asl:1.9.13,CVE-2018-14721,1.8.10-cloudera.2,1.5.0
> <= Version <= 1.9.13
>
>
>
> We are trying to bring in above image into our firm, but due to the
> vulnerable issue, pyspark is not allowed, understand  the version was
> stopped maintaining in 2013, wondering any plan to replace the
> Jackson-mapper-asl or any workaround? thanks
>
>
>
> Regards
>
> Harper Wang
>
> *Morgan Stanley | Corporate & Funding Technology*Kerry Parkside |
> 1155 Fang Dian Road, Pudong New Area
> 201204 Shanghai
> haibo.w...@morganstanley.com
>
>
>
> --
>
> NOTICE: Morgan Stanley is not acting as a municipal advisor and the
> opinions or views contained herein are not intended to be, and do not
> constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall
> Street Reform and Consumer Protection Act. By communicating with Morgan
> Stanley you acknowledge that you have read, understand and consent, (where
> applicable), to the Morgan Stanley General Disclaimers found at
> http://www.morganstanley.com/disclaimers/terms. The entire content of
> this email message and any files attached to it may be sensitive,
> confidential, subject to legal privilege and/or otherwise protected from
> disclosure.
>
>
> --
>
> NOTICE: Morgan Stanley is not acting as a municipal advisor and the
> opinions or views contained herein are not intended to be, and do not
> constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall
> Street Reform and Consumer Protection Act. By communicating with Morgan
> Stanley you acknowledge that you have read, understand and consent, (where
> applicable), to the Morgan Stanley General Disclaimers found at
> http://www.morganstanley.com/disclaimers/terms. The entire content of
> this email message and any files attached to it may be sensitive,
> confidential, subject to legal privilege and/or otherwise protected from
> disclosure.
>
>
> --
> NOTICE: Morgan Stanley is not acting as a municipal advisor and the
> opinions or views contained herein are not intended to be, and do not
> constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall
> Street Reform and Consumer Protection Act. By communicating with Morgan
> Stanley you acknowledge that you have read, understand and consent, (where
> applicable), to the Morgan Stanley General Disclaimers found at
> http://www.morganstanley.com/disclaimers/terms. The entire content of
> this email message and any files attached to it may be sensitive,
> confidential, subject to legal privilege and/or otherwise protected from
> disclosure.
>
>

Re: Query regarding Apache spark version 3.0.1

2022-12-15 Thread Sean Owen

Do you mean, when is branch 3.0.x EOL? It was EOL around the end of 2021.
But there were releases 3.0.2 and 3.0.3 beyond 3.0.1, so not clear what you
mean by support for 3.0.1.

On Thu, Dec 15, 2022 at 9:53 AM Pranav Kumar (EXT)
 wrote:

> Hi Team,
>
>
>
> Could you please help us to know when version 3.0.1 for Apache spark is
> going to be EOS? Till when we are going to get fixes for the version 3.0.1.
>
>
>
> Regards,
>
> Pranav
>
>
>

Re: [EXTERNAL] Re: [Spark vulnerability] replace jackson-mapper-asl

2022-12-14 Thread Sean Owen

The CVE you mention seems to affect jackson-databind, not
jackson-mapper-asl.  3.3.1 already uses databind 2.13.x which is not
affected.

On Wed, Dec 14, 2022 at 8:20 AM haibo.w...@morganstanley.com <
haibo.w...@morganstanley.com> wrote:

> Thanks Owen for prompt response
>
> sorry, forgot to mention, it’s latest spark version 3.3.1
>
> Both below spark-py image  or pypi are good to use for us, but both have
> same Jackson-mapper-asl dependencies.
>
>
>
>
> https://hub.docker.com/layers/apache/spark-py/3.3.1/images/sha256-0d4fd8bcb2ad63a35c9ba5be278a3a34c28fc15e898307e458d501a7e11d6d51?context=explore
>
> https://pypi.org/project/pyspark/
>
>
>
> Regards
>
> Harper
>
>
>
>
>
> *From:* Sean Owen 
> *Sent:* Wednesday, December 14, 2022 9:32 PM
> *To:* Wang, Harper (FRPPE) 
> *Cc:* user@spark.apache.org
> *Subject:* [EXTERNAL] Re: [Spark vulnerability] replace jackson-mapper-asl
>
>
>
> What Spark version are you referring to? If it's an unsupported version,
> no, no plans to update it.
>
> What image are you referring to?
>
>
>
> On Wed, Dec 14, 2022 at 7:14 AM haibo.w...@morganstanley.com <
> haibo.w...@morganstanley.com> wrote:
>
> Hi All
>
>
>
> Hope you are doing well.
>
>
>
> Writing this email for an vulnerable issue: CVE-2018-14721
>
> apache/spark-py:
> gav://org.codehaus.jackson:jackson-mapper-asl:1.9.13,CVE-2018-14721,1.8.10-cloudera.2,1.5.0
> <= Version <= 1.9.13
>
>
>
> We are trying to bring in above image into our firm, but due to the
> vulnerable issue, pyspark is not allowed, understand  the version was
> stopped maintaining in 2013, wondering any plan to replace the
> Jackson-mapper-asl or any workaround? thanks
>
>
>
> Regards
>
> Harper Wang
>
> *Morgan Stanley | Corporate & Funding Technology*Kerry Parkside |
> 1155 Fang Dian Road, Pudong New Area
> 201204 Shanghai
> haibo.w...@morganstanley.com
>
>
>
> --
>
> NOTICE: Morgan Stanley is not acting as a municipal advisor and the
> opinions or views contained herein are not intended to be, and do not
> constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall
> Street Reform and Consumer Protection Act. By communicating with Morgan
> Stanley you acknowledge that you have read, understand and consent, (where
> applicable), to the Morgan Stanley General Disclaimers found at
> http://www.morganstanley.com/disclaimers/terms. The entire content of
> this email message and any files attached to it may be sensitive,
> confidential, subject to legal privilege and/or otherwise protected from
> disclosure.
>
>
> --
> NOTICE: Morgan Stanley is not acting as a municipal advisor and the
> opinions or views contained herein are not intended to be, and do not
> constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall
> Street Reform and Consumer Protection Act. By communicating with Morgan
> Stanley you acknowledge that you have read, understand and consent, (where
> applicable), to the Morgan Stanley General Disclaimers found at
> http://www.morganstanley.com/disclaimers/terms. The entire content of
> this email message and any files attached to it may be sensitive,
> confidential, subject to legal privilege and/or otherwise protected from
> disclosure.
>
>

Re: [Spark vulnerability] replace jackson-mapper-asl

2022-12-14 Thread Sean Owen

What Spark version are you referring to? If it's an unsupported version,
no, no plans to update it.
What image are you referring to?

On Wed, Dec 14, 2022 at 7:14 AM haibo.w...@morganstanley.com <
haibo.w...@morganstanley.com> wrote:

> Hi All
>
>
>
> Hope you are doing well.
>
>
>
> Writing this email for an vulnerable issue: CVE-2018-14721
>
> apache/spark-py:
> gav://org.codehaus.jackson:jackson-mapper-asl:1.9.13,CVE-2018-14721,1.8.10-cloudera.2,1.5.0
> <= Version <= 1.9.13
>
>
>
> We are trying to bring in above image into our firm, but due to the
> vulnerable issue, pyspark is not allowed, understand  the version was
> stopped maintaining in 2013, wondering any plan to replace the
> Jackson-mapper-asl or any workaround? thanks
>
>
>
> Regards
>
> Harper Wang
>
> *Morgan Stanley | Corporate & Funding Technology*Kerry Parkside |
> 1155 Fang Dian Road, Pudong New Area
> 201204 Shanghai
> haibo.w...@morganstanley.com
>
>
> --
> NOTICE: Morgan Stanley is not acting as a municipal advisor and the
> opinions or views contained herein are not intended to be, and do not
> constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall
> Street Reform and Consumer Protection Act. By communicating with Morgan
> Stanley you acknowledge that you have read, understand and consent, (where
> applicable), to the Morgan Stanley General Disclaimers found at
> http://www.morganstanley.com/disclaimers/terms. The entire content of
> this email message and any files attached to it may be sensitive,
> confidential, subject to legal privilege and/or otherwise protected from
> disclosure.
>
>

Re: Create Jira account

2022-11-28 Thread Sean Owen

-user@

Send me your preferred email and username for the ASF JIRA and I'll create
it.

On Mon, Nov 28, 2022 at 10:55 AM Gerben van der Huizen <
gerbenvanderhui...@gmail.com> wrote:

> Hello,
>
> I would like to contribute to the Apache Spark project through Jira, but
> according to this blog post I need to request an account via email (
> https://infra.apache.org/jira-guidelines.html#who). Please let me know if
> you need any more details to create an account.
>
> Kind regards,
>
> Gerben van der Huizen
>

Re: Unable to use GPU with pyspark in windows

2022-11-23 Thread Sean Owen

Using a GPU is unrelated to Spark. You can run code that uses GPUs. This
error indicates that something failed when you ran your code (GPU OOM?) and
you need to investigate why.

On Wed, Nov 23, 2022 at 7:51 AM Vajiha Begum S A <
vajihabegu...@maestrowiz.com> wrote:

>   Hi Sean Owen,
> I'm using windows system with the NVIDIA Quadro K1200.
> GPU memory 20GB, Intel Core: 8 core
> installed - CUDAF 0.14 jar file, Rapid 4 Spark 2.12-22.10.0 jar file, CUDA
> Toolkit 11.8.0 windows version.
> Also installed- WSL 2.0 ( since I'm using windows system)
> I'm running only single server, Master is localhost
>
> I'm trying to run pyspark code through Python idle. [But it showing
> Winerror:10054, An existing connection was forcibly closed by the remote
> host]
>
> I'm getting doubt that I can run pyspark code using GPU in the Windows
> system?
> Or is the only ubuntu/linux system the only option to run pyspark with GPU
> set up.
> Kindly give suggestions. Reply me back.
>

Re: CVE-2022-33891 mitigation

2022-11-21 Thread Sean Owen

CCing Kostya for a better view, but I believe that this will not be an
issue if you're not using the ACLs in Spark, yes.

On Mon, Nov 21, 2022 at 2:38 PM Andrew Pomponio 
wrote:

> I am using Spark 2.3.0 and trying to mitigate
> https://nvd.nist.gov/vuln/detail/CVE-2022-33891. The correct thing to do
> is to update. However, I am told this is not happening. Thus, I am trying
> to determine if the following are set:
>
>
> spark.acls.enable false
>
> spark.history.ui.acls.enable false
>
>
> These are 100% set in the config. I checked the config for weird
> whitespace issues in a hex editor. Nonetheless, the config does not show up
> in the UI. Thus, I took a heap dump. If I read the heap dump in text mode I
> can see this:
>
>
>
> V is abstract � ��spark.acls.enable1 � 0invalid end of optional part at
> position
>
>
>
> I am not able to find this in VisualVM or MAT to determine what that is
> set to. Any thoughts?
>
>
>
>
>
> *Andrew Pomponio | Associate Enterprise Architect, OpenLogic
> *
>
> Perforce Software
> 
>
> P: +1 612.517.2100
>
> Visit us on: LinkedIn
> 
>  | Twitter
> 
>  | Facebook
> 
>  | YouTube
> 
>
>
>
> *Use our new Community portal to submit/track support cases!
> *
>
>
>
> This e-mail may contain information that is privileged or confidential. If
> you are not the intended recipient, please delete the e-mail and any
> attachments and notify us immediately.
>
>

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-18 Thread Sean Owen

Taking this of list

Start here:
https://github.com/apache/spark/blob/70ec696bce7012b25ed6d8acec5e2f3b3e127f11/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala#L144
Look at subclasses of JdbcDialect too, like TeradataDialect.
Note that you are using an old unsupported version, too; that's a link to
master.

On Fri, Nov 18, 2022 at 5:50 AM Ramakrishna Rayudu <
ramakrishna560.ray...@gmail.com> wrote:

> Hi Sean,
>
> Can you please let me know what is query spark internally fires for
> getting count on dataframe.
>
> Long count=dataframe.count();
>
> Is this
>
> SELECT 1 FROM ( QUERY) SUB_TABL
>
> and suming up the all 1s in the response.
> Or directly
>
> SELECT COUNT(*) FROM (QUERY)
> SUB_TABL
>
> Can you please what is approch spark will follow.
>
>
> Thanks,
> Ramakrishna Rayudu
>
> On Fri, Nov 18, 2022, 8:13 AM Ramakrishna Rayudu <
> ramakrishna560.ray...@gmail.com> wrote:
>
>> Sure I will test with latest spark and let you the result.
>>
>> Thanks,
>> Rama
>>
>> On Thu, Nov 17, 2022, 11:16 PM Sean Owen  wrote:
>>
>>> Weird, does Teradata not support LIMIT n? looking at the Spark source
>>> code suggests it won't. The syntax is "SELECT TOP"? I wonder if that's why
>>> the generic query that seems to test existence loses the LIMIT.
>>> But, that "SELECT 1" test seems to be used for MySQL, Postgres, so I'm
>>> still not sure where it's coming from or if it's coming from Spark. You're
>>> using the teradata dialect I assume. Can you use the latest Spark to test?
>>>
>>> On Thu, Nov 17, 2022 at 11:31 AM Ramakrishna Rayudu <
>>> ramakrishna560.ray...@gmail.com> wrote:
>>>
>>>> Yes I am sure that we are not generating this kind of queries. Okay
>>>> then problem is  LIMIT is not coming up in query. Can you please suggest me
>>>> any direction.
>>>>
>>>> Thanks,
>>>> Rama
>>>>
>>>> On Thu, Nov 17, 2022, 10:56 PM Sean Owen  wrote:
>>>>
>>>>> Hm, the existence queries even in 2.4.x had LIMIT 1. Are you sure
>>>>> nothing else is generating or changing those queries?
>>>>>
>>>>> On Thu, Nov 17, 2022 at 11:20 AM Ramakrishna Rayudu <
>>>>> ramakrishna560.ray...@gmail.com> wrote:
>>>>>
>>>>>> We are using spark 2.4.4 version.
>>>>>> I can see two types of queries in DB logs.
>>>>>>
>>>>>> SELECT 1 FROM (INPUT_QUERY) SPARK_GEN_SUB_0
>>>>>>
>>>>>> SELECT * FROM (INPUT_QUERY) SPARK_GEN_SUB_0 WHERE 1=0
>>>>>>
>>>>>> When we see `SELECT *` which ending up with `Where 1=0`  but query
>>>>>> starts with `SELECT 1` there is no where condition.
>>>>>>
>>>>>> Thanks,
>>>>>> Rama
>>>>>>
>>>>>> On Thu, Nov 17, 2022, 10:39 PM Sean Owen  wrote:
>>>>>>
>>>>>>> Hm, actually that doesn't look like the queries that Spark uses to
>>>>>>> test existence, which will be "SELECT 1 ... LIMIT 1" or "SELECT * ... 
>>>>>>> WHERE
>>>>>>> 1=0" depending on the dialect. What version, and are you sure something
>>>>>>> else is not sending those queries?
>>>>>>>
>>>>>>> On Thu, Nov 17, 2022 at 11:02 AM Ramakrishna Rayudu <
>>>>>>> ramakrishna560.ray...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Sean,
>>>>>>>>
>>>>>>>> Thanks for your response I think it has the performance impact
>>>>>>>> because if the query return one million rows then in the response It's 
>>>>>>>> self
>>>>>>>> we will one million rows unnecessarily like below.
>>>>>>>>
>>>>>>>> 1
>>>>>>>> 1
>>>>>>>> 1
>>>>>>>> 1
>>>>>>>> .
>>>>>>>> .
>>>>>>>> 1
>>>>>>>>
>>>>>>>>
>>>>>>>> Its impact the performance. Can we any alternate solution for this.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Rama
>>>>>>>>
>>>>>>>>
>>>>>>

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Sean Owen

Weird, does Teradata not support LIMIT n? looking at the Spark source code
suggests it won't. The syntax is "SELECT TOP"? I wonder if that's why the
generic query that seems to test existence loses the LIMIT.
But, that "SELECT 1" test seems to be used for MySQL, Postgres, so I'm
still not sure where it's coming from or if it's coming from Spark. You're
using the teradata dialect I assume. Can you use the latest Spark to test?

On Thu, Nov 17, 2022 at 11:31 AM Ramakrishna Rayudu <
ramakrishna560.ray...@gmail.com> wrote:

> Yes I am sure that we are not generating this kind of queries. Okay then
> problem is  LIMIT is not coming up in query. Can you please suggest me any
> direction.
>
> Thanks,
> Rama
>
> On Thu, Nov 17, 2022, 10:56 PM Sean Owen  wrote:
>
>> Hm, the existence queries even in 2.4.x had LIMIT 1. Are you sure nothing
>> else is generating or changing those queries?
>>
>> On Thu, Nov 17, 2022 at 11:20 AM Ramakrishna Rayudu <
>> ramakrishna560.ray...@gmail.com> wrote:
>>
>>> We are using spark 2.4.4 version.
>>> I can see two types of queries in DB logs.
>>>
>>> SELECT 1 FROM (INPUT_QUERY) SPARK_GEN_SUB_0
>>>
>>> SELECT * FROM (INPUT_QUERY) SPARK_GEN_SUB_0 WHERE 1=0
>>>
>>> When we see `SELECT *` which ending up with `Where 1=0`  but query
>>> starts with `SELECT 1` there is no where condition.
>>>
>>> Thanks,
>>> Rama
>>>
>>> On Thu, Nov 17, 2022, 10:39 PM Sean Owen  wrote:
>>>
>>>> Hm, actually that doesn't look like the queries that Spark uses to test
>>>> existence, which will be "SELECT 1 ... LIMIT 1" or "SELECT * ... WHERE 1=0"
>>>> depending on the dialect. What version, and are you sure something else is
>>>> not sending those queries?
>>>>
>>>> On Thu, Nov 17, 2022 at 11:02 AM Ramakrishna Rayudu <
>>>> ramakrishna560.ray...@gmail.com> wrote:
>>>>
>>>>> Hi Sean,
>>>>>
>>>>> Thanks for your response I think it has the performance impact because
>>>>> if the query return one million rows then in the response It's self we 
>>>>> will
>>>>> one million rows unnecessarily like below.
>>>>>
>>>>> 1
>>>>> 1
>>>>> 1
>>>>> 1
>>>>> .
>>>>> .
>>>>> 1
>>>>>
>>>>>
>>>>> Its impact the performance. Can we any alternate solution for this.
>>>>>
>>>>> Thanks,
>>>>> Rama
>>>>>
>>>>>
>>>>> On Thu, Nov 17, 2022, 10:17 PM Sean Owen  wrote:
>>>>>
>>>>>> This is a query to check the existence of the table upfront.
>>>>>> It is nearly a no-op query; can it have a perf impact?
>>>>>>
>>>>>> On Thu, Nov 17, 2022 at 10:42 AM Ramakrishna Rayudu <
>>>>>> ramakrishna560.ray...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Team,
>>>>>>>
>>>>>>> I am facing one issue. Can you please help me on this.
>>>>>>>
>>>>>>> <https://stackoverflow.com/>
>>>>>>>
>>>>>>>1.
>>>>>>>
>>>>>>>
>>>>>>> <https://stackoverflow.com/posts/74477662/timeline>
>>>>>>>
>>>>>>> We are connecting Tera data from spark SQL with below API
>>>>>>>
>>>>>>> Dataset jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, 
>>>>>>> connectionProperties);
>>>>>>>
>>>>>>> when we execute above logic on large table with million rows every time 
>>>>>>> we are seeing below
>>>>>>>
>>>>>>> extra query is executing every time as this resulting performance hit 
>>>>>>> on DB.
>>>>>>>
>>>>>>> This below information we got from DBA. We dont have any logs on
>>>>>>> SPARK SQL.
>>>>>>>
>>>>>>> SELECT 1 FROM ONE_MILLION_ROWS_TABLE;
>>>>>>>
>>>>>>> 1
>>>>>>> 1
>>>>>>> 1
>>>>>>> 1
>>>>>>> 1
>>>>>>> 1
>>>>>>> 1
>>>>>>> 1
>>>>>>> 1
>>>>>>>
>>>>>>> Can you please clarify why this query is executing or is there any
>>>>>>> chance that this type of query is executing from our code it self while
>>>>>>> check for rows count from dataframe.
>>>>>>>
>>>>>>> Please provide me your inputs on this.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Rama
>>>>>>>
>>>>>>

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Sean Owen

Hm, the existence queries even in 2.4.x had LIMIT 1. Are you sure nothing
else is generating or changing those queries?

On Thu, Nov 17, 2022 at 11:20 AM Ramakrishna Rayudu <
ramakrishna560.ray...@gmail.com> wrote:

> We are using spark 2.4.4 version.
> I can see two types of queries in DB logs.
>
> SELECT 1 FROM (INPUT_QUERY) SPARK_GEN_SUB_0
>
> SELECT * FROM (INPUT_QUERY) SPARK_GEN_SUB_0 WHERE 1=0
>
> When we see `SELECT *` which ending up with `Where 1=0`  but query starts
> with `SELECT 1` there is no where condition.
>
> Thanks,
> Rama
>
> On Thu, Nov 17, 2022, 10:39 PM Sean Owen  wrote:
>
>> Hm, actually that doesn't look like the queries that Spark uses to test
>> existence, which will be "SELECT 1 ... LIMIT 1" or "SELECT * ... WHERE 1=0"
>> depending on the dialect. What version, and are you sure something else is
>> not sending those queries?
>>
>> On Thu, Nov 17, 2022 at 11:02 AM Ramakrishna Rayudu <
>> ramakrishna560.ray...@gmail.com> wrote:
>>
>>> Hi Sean,
>>>
>>> Thanks for your response I think it has the performance impact because
>>> if the query return one million rows then in the response It's self we will
>>> one million rows unnecessarily like below.
>>>
>>> 1
>>> 1
>>> 1
>>> 1
>>> .
>>> .
>>> 1
>>>
>>>
>>> Its impact the performance. Can we any alternate solution for this.
>>>
>>> Thanks,
>>> Rama
>>>
>>>
>>> On Thu, Nov 17, 2022, 10:17 PM Sean Owen  wrote:
>>>
>>>> This is a query to check the existence of the table upfront.
>>>> It is nearly a no-op query; can it have a perf impact?
>>>>
>>>> On Thu, Nov 17, 2022 at 10:42 AM Ramakrishna Rayudu <
>>>> ramakrishna560.ray...@gmail.com> wrote:
>>>>
>>>>> Hi Team,
>>>>>
>>>>> I am facing one issue. Can you please help me on this.
>>>>>
>>>>> <https://stackoverflow.com/>
>>>>>
>>>>>1.
>>>>>
>>>>>
>>>>> <https://stackoverflow.com/posts/74477662/timeline>
>>>>>
>>>>> We are connecting Tera data from spark SQL with below API
>>>>>
>>>>> Dataset jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, 
>>>>> connectionProperties);
>>>>>
>>>>> when we execute above logic on large table with million rows every time 
>>>>> we are seeing below
>>>>>
>>>>> extra query is executing every time as this resulting performance hit on 
>>>>> DB.
>>>>>
>>>>> This below information we got from DBA. We dont have any logs on SPARK
>>>>> SQL.
>>>>>
>>>>> SELECT 1 FROM ONE_MILLION_ROWS_TABLE;
>>>>>
>>>>> 1
>>>>> 1
>>>>> 1
>>>>> 1
>>>>> 1
>>>>> 1
>>>>> 1
>>>>> 1
>>>>> 1
>>>>>
>>>>> Can you please clarify why this query is executing or is there any
>>>>> chance that this type of query is executing from our code it self while
>>>>> check for rows count from dataframe.
>>>>>
>>>>> Please provide me your inputs on this.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Rama
>>>>>
>>>>

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Sean Owen

Hm, actually that doesn't look like the queries that Spark uses to test
existence, which will be "SELECT 1 ... LIMIT 1" or "SELECT * ... WHERE 1=0"
depending on the dialect. What version, and are you sure something else is
not sending those queries?

On Thu, Nov 17, 2022 at 11:02 AM Ramakrishna Rayudu <
ramakrishna560.ray...@gmail.com> wrote:

> Hi Sean,
>
> Thanks for your response I think it has the performance impact because if
> the query return one million rows then in the response It's self we will
> one million rows unnecessarily like below.
>
> 1
> 1
> 1
> 1
> .
> .
> 1
>
>
> Its impact the performance. Can we any alternate solution for this.
>
> Thanks,
> Rama
>
>
> On Thu, Nov 17, 2022, 10:17 PM Sean Owen  wrote:
>
>> This is a query to check the existence of the table upfront.
>> It is nearly a no-op query; can it have a perf impact?
>>
>> On Thu, Nov 17, 2022 at 10:42 AM Ramakrishna Rayudu <
>> ramakrishna560.ray...@gmail.com> wrote:
>>
>>> Hi Team,
>>>
>>> I am facing one issue. Can you please help me on this.
>>>
>>> <https://stackoverflow.com/>
>>>
>>>1.
>>>
>>>
>>> <https://stackoverflow.com/posts/74477662/timeline>
>>>
>>> We are connecting Tera data from spark SQL with below API
>>>
>>> Dataset jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, 
>>> connectionProperties);
>>>
>>> when we execute above logic on large table with million rows every time we 
>>> are seeing below
>>>
>>> extra query is executing every time as this resulting performance hit on DB.
>>>
>>> This below information we got from DBA. We dont have any logs on SPARK
>>> SQL.
>>>
>>> SELECT 1 FROM ONE_MILLION_ROWS_TABLE;
>>>
>>> 1
>>> 1
>>> 1
>>> 1
>>> 1
>>> 1
>>> 1
>>> 1
>>> 1
>>>
>>> Can you please clarify why this query is executing or is there any
>>> chance that this type of query is executing from our code it self while
>>> check for rows count from dataframe.
>>>
>>> Please provide me your inputs on this.
>>>
>>>
>>> Thanks,
>>>
>>> Rama
>>>
>>

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Sean Owen

This is a query to check the existence of the table upfront.
It is nearly a no-op query; can it have a perf impact?

On Thu, Nov 17, 2022 at 10:42 AM Ramakrishna Rayudu <
ramakrishna560.ray...@gmail.com> wrote:

> Hi Team,
>
> I am facing one issue. Can you please help me on this.
>
> 
>
>1.
>
>
> 
>
> We are connecting Tera data from spark SQL with below API
>
> Dataset jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, 
> connectionProperties);
>
> when we execute above logic on large table with million rows every time we 
> are seeing below
>
> extra query is executing every time as this resulting performance hit on DB.
>
> This below information we got from DBA. We dont have any logs on SPARK SQL.
>
> SELECT 1 FROM ONE_MILLION_ROWS_TABLE;
>
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
>
> Can you please clarify why this query is executing or is there any chance
> that this type of query is executing from our code it self while check for
> rows count from dataframe.
>
> Please provide me your inputs on this.
>
>
> Thanks,
>
> Rama
>

Re: [EXTERNAL] Re: Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-03 Thread Sean Owen

Er, wait, this is what stage-level scheduling is right? this has existed
since 3.1
https://issues.apache.org/jira/browse/SPARK-27495

On Thu, Nov 3, 2022 at 12:10 PM bo yang  wrote:

> Interesting discussion here, looks like Spark does not support configuring
> different number of executors in different stages. Would love to see the
> community come out such a feature.
>
> On Thu, Nov 3, 2022 at 9:10 AM Shay Elbaz  wrote:
>
>> Thanks again Artemis, I really appreciate it. I have watched the video
>> but did not find an answer.
>>
>> Please bear with me just one more iteration 
>>
>> Maybe I'll be more specific:
>> Suppose I start the application with maxExecutors=500, executors.cores=2,
>> because that's the amount of resources needed for the ETL part. But for the
>> DL part I only need 20 GPUs. SLS API only allows to set the resources per
>> executor/task, so Spark would (try to) allocate up to 500 GPUs, assuming I
>> configure the profile with 1 GPU per executor.
>> So, the question is how do I limit the stage resources to 20 GPUs total?
>>
>> Thanks again,
>> Shay
>>
>> --
>> *From:* Artemis User 
>> *Sent:* Thursday, November 3, 2022 5:23 PM
>> *To:* user@spark.apache.org 
>> *Subject:* [EXTERNAL] Re: Re: Stage level scheduling - lower the number
>> of executors when using GPUs
>>
>>
>> *ATTENTION:* This email originated from outside of GM.
>>
>>   Shay,  You may find this video helpful (with some API code samples
>> that you are looking for).
>> https://www.youtube.com/watch?v=JNQu-226wUc=171s.  The issue here
>> isn't how to limit the number of executors but to request for the right
>> GPU-enabled executors dynamically.  Those executors used in pre-GPU stages
>> should be returned back to resource managers with dynamic resource
>> allocation enabled (and with the right DRA policies).  Hope this helps..
>>
>> Unfortunately there isn't a lot of detailed docs for this topic since GPU
>> acceleration is kind of new in Spark (not straightforward like in TF).   I
>> wish the Spark doc team could provide more details in the next release...
>>
>> On 11/3/22 2:37 AM, Shay Elbaz wrote:
>>
>> Thanks Artemis. We are *not* using Rapids, but rather using GPUs through
>> the Stage Level Scheduling feature with ResourceProfile. In Kubernetes
>> you have to turn on shuffle tracking for dynamic allocation, anyhow.
>> The question is how we can limit the *number of executors *when building
>> a new ResourceProfile, directly (API) or indirectly (some advanced
>> workaround).
>>
>> Thanks,
>> Shay
>>
>>
>> --
>> *From:* Artemis User  
>> *Sent:* Thursday, November 3, 2022 1:16 AM
>> *To:* user@spark.apache.org 
>> 
>> *Subject:* [EXTERNAL] Re: Stage level scheduling - lower the number of
>> executors when using GPUs
>>
>>
>> *ATTENTION:* This email originated from outside of GM.
>>
>>   Are you using Rapids for GPU support in Spark?  Couple of options you
>> may want to try:
>>
>>1. In addition to dynamic allocation turned on, you may also need to
>>turn on external shuffling service.
>>2. Sounds like you are using Kubernetes.  In that case, you may also
>>need to turn on shuffle tracking.
>>3. The "stages" are controlled by the APIs.  The APIs for dynamic
>>resource request (change of stage) do exist, but only for RDDs (e.g.
>>TaskResourceRequest and ExecutorResourceRequest).
>>
>>
>> On 11/2/22 11:30 AM, Shay Elbaz wrote:
>>
>> Hi,
>>
>> Our typical applications need less *executors* for a GPU stage than for
>> a CPU stage. We are using dynamic allocation with stage level scheduling,
>> and Spark tries to maximize the number of executors also during the GPU
>> stage, causing a bit of resources chaos in the cluster. This forces us to
>> use a lower value for 'maxExecutors' in the first place, at the cost of the
>> CPU stages performance. Or try to solve this in the Kubernets scheduler
>> level, which is not straightforward and doesn't feel like the right way to
>> go.
>>
>> Is there a way to effectively use less executors in Stage Level
>> Scheduling? The API does not seem to include such an option, but maybe
>> there is some more advanced workaround?
>>
>> Thanks,
>> Shay
>>
>>
>>
>>
>>
>>
>>
>>

Re: Ctrl - left and right now working in Spark Shell in Windows 10

2022-11-01 Thread Sean Owen

This won't be related to Spark, but rather your shell or terminal program.

On Tue, Nov 1, 2022 at 1:57 PM Salil Surendran 
wrote:

> I installed Spark on Windows 10. Everything works fine except for the Ctrl
> - left and Ctrl - right keys which doesn't move a word but just a
> character. How do I fix this or find out what are the correct bindings to
> move a word in Spark Shell
>
> --
> Thanks,
> Salil
> "The surest sign that intelligent life exists elsewhere in the universe is
> that none of it has tried to contact us."
>

Re: spark - local question

2022-10-31 Thread Sean Owen

Sure, as stable and available as your machine is. If you don't need fault
tolerance or scale beyond one machine, sure.

On Mon, Oct 31, 2022 at 8:43 AM 张健BJ  wrote:

> Dear developers:
> I have a question about  the pyspark local
> mode. Can it be used in production and Will it cause unexpected problems?
> The scenario is as follows:
>
> Our team wants to develop an etl component based on python language. Data can 
> be transferred between various data sources.
>
> If there is no yarn environment, can we read data from Database A and write 
> it to Database B in local mode.Will this function be guaranteed to be stable 
> and available?
>
>
>
> Thanks,
> Look forward to your reply
>

Re: Running 30 Spark applications at the same time is slower than one on average

2022-10-26 Thread Sean Owen

That just means G = GB mem, C = cores, but yeah the driver and executors
are very small, possibly related.

On Wed, Oct 26, 2022 at 12:34 PM Artemis User 
wrote:

> Are these Cloudera specific acronyms?  Not sure how Cloudera configures
> Spark differently, but obviously the number of nodes is too small,
> considering each app only uses a small number of cores and RAM.  So you may
> consider increase the number of nodes.   When all these apps jam on a few
> nodes, the cluster manager/scheduler and/or the network becomes
> overwhelmed...
>
> On 10/26/22 8:09 AM, Sean Owen wrote:
>
> Resource contention. Now all the CPU and I/O is competing and probably
> slows down
>
> On Wed, Oct 26, 2022, 5:37 AM eab...@163.com  wrote:
>
>> Hi All,
>>
>> I have a CDH5.16.2 hadoop cluster with 1+3 nodes(64C/128G, 1NN/RM +
>> 3DN/NM), and yarn with 192C/240G. I used the following test scenario:
>>
>> 1.spark app resource with 2G driver memory/2C driver vcore/1 executor
>> nums/2G executor memory/2C executor vcore.
>> 2.one spark app will use 5G4C on yarn.
>> 3.first, I only run one spark app takes 40s.
>> 4.Then, I run 30 the same spark app at once, and each spark app takes 80s
>> on average.
>>
>> So, I want to know why the run time gap is so big, and how to optimize?
>>
>> Thanks
>>
>>
>

Re: Running 30 Spark applications at the same time is slower than one on average

2022-10-26 Thread Sean Owen

Resource contention. Now all the CPU and I/O is competing and probably
slows down

On Wed, Oct 26, 2022, 5:37 AM eab...@163.com  wrote:

> Hi All,
>
> I have a CDH5.16.2 hadoop cluster with 1+3 nodes(64C/128G, 1NN/RM +
> 3DN/NM), and yarn with 192C/240G. I used the following test scenario:
>
> 1.spark app resource with 2G driver memory/2C driver vcore/1 executor
> nums/2G executor memory/2C executor vcore.
> 2.one spark app will use 5G4C on yarn.
> 3.first, I only run one spark app takes 40s.
> 4.Then, I run 30 the same spark app at once, and each spark app takes 80s
> on average.
>
> So, I want to know why the run time gap is so big, and how to optimize?
>
> Thanks
>
>

Re: As a Scala newbie starting to work with Spark does it make more sense to learn Scala 2 or Scala 3?

2022-10-11 Thread Sean Owen

See the pom.xml file
https://github.com/apache/spark/blob/master/pom.xml#L3590
2.13.8 at the moment; IIRC there was some Scala issue that prevented
updating to 2.13.9. Search issues/PRs.

On Tue, Oct 11, 2022 at 6:11 PM Henrik Park  wrote:

> scala 2.13.9 was released. do you know which spark version would have it
> built-in?
>
> thanks
>
> Sean Owen wrote:
> > I would imagine that Scala 2.12 support goes away, and Scala 3 support
> > is added, for maybe Spark 4.0, and maybe that happens in a year or so.
>
> --
> Simple Mail
> https://simplemail.co.in/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: As a Scala newbie starting to work with Spark does it make more sense to learn Scala 2 or Scala 3?

2022-10-11 Thread Sean Owen

For Spark, the issue is maintaining simultaneous support for multiple Scala
versions, which has historically been mutually incompatible across minor
versions.
Until Scala 2.12 support is reasonable to remove, it's hard to also support
Scala 3, as it would mean maintaining three versions of code.
I would imagine that Scala 2.12 support goes away, and Scala 3 support is
added, for maybe Spark 4.0, and maybe that happens in a year or so.

For end users, I don't think there are big differences, so I don't think
learning one or the other matters a lot. Scala 3 is a lot like 2.13.
But I don't think you'll be able to write Scala 3 Spark apps anytime soon.

On Tue, Oct 11, 2022 at 7:57 AM Никита Романов 
wrote:

> No one knows for sure except Apache, but I’d learn Scala 2 if I were you.
> Even if Spark one day migrates to Scala 3 (which is not given), it’ll take
> a while for the industry to adjust. It even takes a while to move from
> Spark 2 to Spark 3 (Scala 2.11 to Scala 2.12). I don’t think your knowledge
> of Scala 2 will be outdated any time soon.
>
> You can also compare it with Python 2 vs 3: although Python 3 dominates
> these days (almost 15 years after the release!), Python 2 is still used.
>
>
> Понедельник, 10 октября 2022, 10:24 +03:00 от Oliver Plohmann <
> oli...@objectscape.org >:
>
> Hello,
>
> I was lucky and will be joining a project where Spark is being used in
> conjunction with Python. Scala will not be used at all. Everything will
> be Python. This means that I have free choice whether to start diving
> into Scala 2 or Scala 3. For future Spark jobs knowledge of Scala will
> be very precious (the job ads here for Spark always mention Java, Python
> and Scala.
>
> I was always interested in Scala and because it is a plus when applying
> for Spark jobs I will start learning and develop some spare time project
> with it. Question is now whether first to learn Scala 2 or start right
> away with learning Scala 3. That also boils down to the question whether
> Spark will ever be migrated to Scala 3. I have way too little
> understanding of Spark and Scala to be able to make some reasonable
> guess here.
>
> So that's why I'm asking here: Does anyone have some idea whether Spark
> will ever be migrated toScala 3 or have some idea how long it will take
> till any migration work might be started?
>
> Thank you.
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
>
>
> --
>
> --
> Никита Романов
> Отправлено из Почты Mail.ru 
>

Re: [Spark Core][Release]Can we consider add SPARK-39725 into 3.3.1 or 3.3.2 release?

2022-10-04 Thread Sean Owen

I think it's fine to backport that to 3.3.x, regardless of whether it
clearly affects Spark or not.

On Tue, Oct 4, 2022 at 11:31 AM phoebe chen  wrote:

> Hi:
> (Not sure if this mailing group is good to use for such question, but just
> try my luck here, thanks)
>
> SPARK-39725  has
> fix for security issues CVE-2022-2047 and CVE2022-2048 (High), which was
> set to 3.4.0 release but that will happen Feb 2023. Is it possible to have
> it in any earlier release such as 3.3.1 or 3.3.2?
>
>
>

Re: Spark ML VarianceThresholdSelector Unexpected Results

2022-09-29 Thread Sean Owen

This is sample variance, not population (i.e. divide by n-1, not n). I
think that's justified as the data are notionally a sample from a
population.

On Thu, Sep 29, 2022 at 9:21 PM 姜鑫  wrote:

> Hi folks,
>
> Has anyone used VarianceThresholdSelector refer to
> https://spark.apache.org/docs/latest/ml-features.html#variancethresholdselector
>  ?
> In the doc, an example is gaven and says `The variance for the 6 features
> are 16.67, 0.67, 8.17, 10.17, 5.07, and 11.47 respectively`, but after
> calculating I found that the variance should be 13.89, 0.56, 6.81, 8.47,
> 4.22, 9.56, and there should be only 3 columns selected. Is there something
> wrong with me or this is a bug?
>
>
> Regards,
> Xin
>

Re: Updating Broadcast Variable in Spark Streaming 2.4.4

2022-09-28 Thread Sean Owen

I don't think that can work. Your BroadcastUpdater is copied to the task,
with a reference to an initial broadcast. When that is later updated on the
driver, this does not affect the broadcast inside the copy in the tasks.

On Wed, Sep 28, 2022 at 10:11 AM Dipl.-Inf. Rico Bergmann <
i...@ricobergmann.de> wrote:

> Hi folks!
>
>
> I'm trying to implement an update of a broadcast var in Spark Streaming.
> The idea is that whenever some configuration value has changed (this is
> periodically checked by the driver) the existing broadcast variable is
> unpersisted and then (re-)broadcasted.
>
> In a local test setup (using a local Spark) it works fine but on a real
> cluster it doesn't work. The broadcast variable never gets updated. What
> I can see after adding some log messages is that the BroadcastUpdater
> thread is only called twice and then never again. Anyone any idea why
> this happens?
>
> Code snippet:
>
> @RequiredArgsConstructor
> public class BroadcastUpdater implements Runnable {
>  private final transient JavaSparkContext sparkContext;
>  @Getter
>  private transient volatile Broadcast>
> broadcastVar;
>  private transient volatile Map configMap;
>
>  public void run() {
>  Map configMap = getConfigMap();
>  if (this.broadcastVar == null ||
> !configMap.equals(this.configMap)) {
>  this.configMap = configMap;
>  if (broadcastVar != null) {
>  broadcastVar.unpersist(true);
>  broadcastVar.destroy(true);
>  }
>  this.broadcastVar =
> this.sparkContext.broadcast(this.configMap);
>  }
>  }
>
>  private Map getConfigMap() {
>  //impl details
>  }
> }
>
> public class StreamingFunction implements Serializable {
>
>  private transient volatile BroadcastUpdater broadcastUpdater;
>  private transient ScheduledThreadPoolExecutor
> scheduledThreadPoolExecutor = new ScheduledThreadPoolExecutor(1);
>
>  protected JavaStreamingContext startStreaming(JavaStreamingContext
> context, ConsumerStrategy consumerStrategy) {
>  broadcastUpdater = new BroadcastUpdater(context.sparkContext());
> scheduledThreadPoolExecutor.scheduleWithFixedDelay(broadcastUpdater, 0,
> 3, TimeUnit.SECONDS);
>
>  final JavaInputDStream ChangeDataRecord>> inputStream = KafkaUtils.createDirectStream(context,
>  LocationStrategies.PreferConsistent(), consumerStrategy);
>
>  inputStream.foreachRDD(rdd -> {
>  Broadcast> broadcastVar =
> broadcastUpdater.getBroadcastVar();
>  rdd.foreachPartition(partition -> {
>  if (partition.hasNext()) {
>  Map configMap =
> broadcastVar.getValue();
>
>  // iterate
>  while (partition.hasNext()) {
>  //impl logic using broadcast variable
>  }
>  }
>  }
>  }
>  }
> }
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: 答复: [how to]RDD using JDBC data source in PySpark

2022-09-19 Thread Sean Owen

Just use the .format('jdbc') data source? This is built in, for all
languages. You can get an RDD out if you must.

On Mon, Sep 19, 2022, 5:28 AM javaca...@163.com  wrote:

> Thank you answer alton.
>
> But i see that is use scala to implement it.
> I know java/scala can get data from mysql using JDBCRDD farily well.
> But i want to get same way in Python Spark.
>
> Would you to give me more advice, very thanks to you.
>
>
> --
> javaca...@163.com
>
>
> *发件人：* Xiao, Alton 
> *发送时间：* 2022-09-19 18:04
> *收件人：* javaca...@163.com; user@spark.apache.org
> *主题：* 答复: [how to]RDD using JDBC data source in PySpark
>
> Hi javacaoyu:
>
> https://hevodata.com/learn/spark-mysql/#Spark-MySQL-Integration
>
> I think spark have already integrated mysql
>
>
>
> *发件人**:* javaca...@163.com 
> *日期**:* 星期一, 2022年9月19日 17:53
> *收件人**:* user@spark.apache.org 
> *主题**:* [how to]RDD using JDBC data source in PySpark
>
> 你通常不会收到来自 javaca...@163.com 的电子邮件。了解这一点为什么很重要
> 
>
> Hi guys:
>
>
>
> Does have some way to let rdd can using jdbc data source in pyspark?
>
>
>
> i want to get data from mysql, but in PySpark, there is not supported
> JDBCRDD like java/scala.
>
> and i search docs from web site, no answer.
>
>
>
>
>
> So i need your guys help,  Thank you very much.
>
>
> --
>
> javaca...@163.com
>
>

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Sean Owen

Wait, how do you start reduce tasks before maps are finished? is the idea
that some reduce tasks don't depend on all the maps, or at least you can
get started?
You can already execute unrelated DAGs in parallel of course.

On Wed, Sep 7, 2022 at 5:49 PM Sungwoo Park  wrote:

> You are right -- Spark can't do this with its current architecture. My
> question was: if there was a new implementation supporting pipelined
> execution, what kind of Spark jobs would benefit (a lot) from it?
>
> Thanks,
>
> --- Sungwoo
>
> On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney 
> wrote:
>
>> I don't think Spark can do this with its current architecture. It has to
>> wait for the step to be done, speculative execution isn't possible. Others
>> probably know more about why that is.
>>
>> Thanks,
>> Russell Jurney @rjurney 
>> russell.jur...@gmail.com LI  FB
>>  datasyndrome.com
>>
>>
>> On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park  wrote:
>>
>>> Hello Spark users,
>>>
>>> I have a question on the architecture of Spark (which could lead to a
>>> research problem). In its current implementation, Spark finishes executing
>>> all the tasks in a stage before proceeding to child stages. For example,
>>> given a two-stage map-reduce DAG, Spark finishes executing all the map
>>> tasks before scheduling reduce tasks.
>>>
>>> We can think of another 'pipelined execution' strategy in which tasks in
>>> child stages can be scheduled and executed concurrently with tasks in
>>> parent stages. For example, for the two-stage map-reduce DAG, while map
>>> tasks are being executed, we could schedule and execute reduce tasks in
>>> advance if the cluster has enough resources. These reduce tasks can also
>>> pre-fetch the output of map tasks.
>>>
>>> Has anyone seen Spark jobs for which this 'pipelined execution' strategy
>>> would be desirable while the current implementation is not quite adequate?
>>> Since Spark tasks usually run for a short period of time, I guess the new
>>> strategy would not have a major performance improvement. However, there
>>> might be some category of Spark jobs for which this new strategy would be
>>> clearly a better choice.
>>>
>>> Thanks,
>>>
>>> --- Sungwoo
>>>
>>>

Re: Spark equivalent to hdfs groups

2022-09-07 Thread Sean Owen

No, because this is a storage concept, and Spark is not a storage system.
You would appeal to tools and interfaces that the storage system provides,
like hdfs. Where or how the hdfs binary is available depends on how you
deploy Spark where; it would be available on a Hadoop cluster. It's just
not a Spark question.

On Wed, Sep 7, 2022 at 9:51 AM  wrote:

> Hi Sean,
> I'm talking about HDFS Groups.
> On Linux, you can type "hdfs groups " to get the list of the groups
> user1 belongs to.
> In Zeppelin/Spark, the hdfs executable is not accessible.
> As a result, I wondered if there was a class in Spark (eg. Security or
> ACL) which would let you access a particular user's groups.
>
>
>
> - Mail original -
> De: "Sean Owen" 
> À: phi...@free.fr
> Cc: "User" 
> Envoyé: Mercredi 7 Septembre 2022 16:41:01
> Objet: Re: Spark equivalent to hdfs groups
>
>
> Spark isn't a storage system or user management system; no there is no
> notion of groups (groups for what?)
>
>
> On Wed, Sep 7, 2022 at 8:36 AM < phi...@free.fr > wrote:
>
>
> Hello,
> is there a Spark equivalent to "hdfs groups "?
> Many thanks.
> Philippe
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Spark equivalent to hdfs groups

2022-09-07 Thread Sean Owen

Spark isn't a storage system or user management system; no there is no
notion of groups (groups for what?)

On Wed, Sep 7, 2022 at 8:36 AM  wrote:

> Hello,
> is there a Spark equivalent to "hdfs groups "?
> Many thanks.
> Philippe
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Error in Spark in Jupyter Notebook

2022-09-06 Thread Sean Owen

That just says a task failed - no real info there. YOu have to look at
Spark logs from the UI to see why.

On Tue, Sep 6, 2022 at 7:07 AM Mamata Shee 
wrote:

> Hello,
>
> I'm using spark in Jupyter Notebook, but when performing some queries
> getting the below error, can you please tell me what is the actual reason
> for this or any suggestions to make it work?
>
> *Error:*
> [image: image.png]
>
> Thank you
>
> 
>
> CONFIDENTIALITY NOTICE:
> The contents of this email message and any attachments are intended solely
> for the addressee(s)  and may contain confidential and/or privileged
> information and may be legally protected from disclosure. Please do not
> share it with others. If you are not the intended recipient of this
> message, please immediately notify the sender by reply email and destroy
> this message and any attachments. XenonStack monitor's email traffic data
> and the content of email for the purposes of security and confidentiality.
> If you are not the intended recipient, you are hereby notified that any
> use, disseminate ,copying, or storage of this message or its attachments is
> strictly prohibited. Before opening any mail and attachments please check
> them for viruses. Xenonstack does not accept any liability for virus
> infected mails.
>

Re: Spark got incorrect scala version while using spark 3.2.1 and spark 3.2.2

2022-08-26 Thread Sean Owen

Spark is built with and ships with a copy of Scala. It doesn't use your
local version.

On Fri, Aug 26, 2022 at 2:55 AM  wrote:

> Hi all,
>
> I found a strange thing. I have run SPARK 3.2.1 prebuilt in local mode. My
> OS scala version is 2.13.7.
> But when I run  spark-sumit then check the SparkUI, the web page shown
> that my scala version is 2.13.5.
> I used spark-shell, it also shown that my scala version is 2.13.5.
> Then I tried SPARK 3.2.2, it also shown that my scala version is 2.13.5.
> I checked the codes, it seems that SparkEnv got scala version from
> "scala.util.Properties.versionString".
> Not sure why it shown different scala version. Is it a bug or not?
>
> Thanks
>
> Liang
>

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Sean Owen

Oh whoa I didn't realize we had this! I stand corrected

On Thu, Aug 25, 2022, 12:52 PM Takuya UESHIN  wrote:

> Hi Subash,
>
> Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3?
> -
> https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf
>
> Hope it can help you.
>
> Thanks.
>
> On Thu, Aug 25, 2022 at 10:18 AM Russell Jurney 
> wrote:
>
>> Subash, I’m here to help :)
>>
>> I started a test script to demonstrate a solution last night but got a
>> cold and haven’t finished it. Give me another day and I’ll get it to you.
>> My suggestion is that you run PySpark locally in pytest with a fixture to
>> generate and yield your SparckContext and SparkSession and the. Write tests
>> that load some test data, perform some count operation and checkpoint to
>> ensure that data is loaded, start a timer, run your UDF on the DataFrame,
>> checkpoint again or write some output to disk to make sure it finishes and
>> then stop the timer and compute how long it takes. I’ll show you some code,
>> I have to do this for Graphlet AI’s RTL utils and other tools to figure out
>> how much overhead there is using Pandera and Spark together to validate
>> data: https://github.com/Graphlet-AI/graphlet
>>
>> I’ll respond by tomorrow evening with code in a fist! We’ll see if it
>> gets consistent, measurable and valid results! :)
>>
>> Russell Jurney
>>
>> On Thu, Aug 25, 2022 at 10:00 AM Sean Owen  wrote:
>>
>>> It's important to realize that while pandas UDFs and pandas on Spark are
>>> both related to pandas, they are not themselves directly related. The first
>>> lets you use pandas within Spark, the second lets you use pandas on Spark.
>>>
>>> Hard to say with this info but you want to look at whether you are doing
>>> something expensive in each UDF call and consider amortizing it with the
>>> scalar iterator UDF pattern. Maybe.
>>>
>>> A pandas UDF is not spark code itself so no there is no tool in spark to
>>> profile it. Conversely any approach to profiling pandas or python would
>>> work here .
>>>
>>> On Thu, Aug 25, 2022, 11:22 AM Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> May be I am jumping to conclusions and making stupid guesses, but have
>>>> you tried koalas now that it is natively integrated with pyspark??
>>>>
>>>> Regards
>>>> Gourav
>>>>
>>>> On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, <
>>>> subashpraba...@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I was wondering if we have any best practices on using pandas UDF ?
>>>>> Profiling UDF is not an easy task and our case requires some drilling down
>>>>> on the logic of the function.
>>>>>
>>>>>
>>>>> Our use case:
>>>>> We are using func(Dataframe) => Dataframe as interface to use Pandas
>>>>> UDF, while running locally only the function, it runs faster but when
>>>>> executed in Spark environment - the processing time is more than expected.
>>>>> We have one column where the value is large (BinaryType -> 600KB),
>>>>> wondering whether this could make the Arrow computation slower ?
>>>>>
>>>>> Is there any profiling or best way to debug the cost incurred using
>>>>> pandas UDF ?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Subash
>>>>>
>>>>> --
>>
>> Thanks,
>> Russell Jurney @rjurney <http://twitter.com/rjurney>
>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
>> <http://facebook.com/jurney> datasyndrome.com
>>
>
>
> --
> Takuya UESHIN
>
>

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Sean Owen

It's important to realize that while pandas UDFs and pandas on Spark are
both related to pandas, they are not themselves directly related. The first
lets you use pandas within Spark, the second lets you use pandas on Spark.

Hard to say with this info but you want to look at whether you are doing
something expensive in each UDF call and consider amortizing it with the
scalar iterator UDF pattern. Maybe.

A pandas UDF is not spark code itself so no there is no tool in spark to
profile it. Conversely any approach to profiling pandas or python would
work here .

On Thu, Aug 25, 2022, 11:22 AM Gourav Sengupta 
wrote:

> Hi,
>
> May be I am jumping to conclusions and making stupid guesses, but have you
> tried koalas now that it is natively integrated with pyspark??
>
> Regards
> Gourav
>
> On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, 
> wrote:
>
>> Hi All,
>>
>> I was wondering if we have any best practices on using pandas UDF ?
>> Profiling UDF is not an easy task and our case requires some drilling down
>> on the logic of the function.
>>
>>
>> Our use case:
>> We are using func(Dataframe) => Dataframe as interface to use Pandas UDF,
>> while running locally only the function, it runs faster but when executed
>> in Spark environment - the processing time is more than expected. We have
>> one column where the value is large (BinaryType -> 600KB), wondering
>> whether this could make the Arrow computation slower ?
>>
>> Is there any profiling or best way to debug the cost incurred using
>> pandas UDF ?
>>
>>
>> Thanks,
>> Subash
>>
>>

Re: spark-3.2.2-bin-without-hadoop : NoClassDefFoundError: org/apache/log4j/spi/Filter when starting the master

2022-08-24 Thread Sean Owen

You have to provide your own Hadoop distro and all its dependencies. This
build is intended for use on a Hadoop cluster, really. If you're running
stand-alone, you should not be using it. Use a 'normal' distribution that
bundles Hadoop libs.

On Wed, Aug 24, 2022 at 9:35 AM FLORANCE Grégory
 wrote:

>   Hi,
>
>
>
>   I’ve downloaded the 3.2.2-without Hadoop Spark distribution
> in order to test it in a without Hadoop context.
>
>   I tested the version with Hadoop and it worked well.
>
>
>
>   When I wanted to start the master, I encountered this basic
> error : Caused by: java.lang.NoClassDefFoundError:
> org/apache/log4j/spi/Filter
>
>
>
> $ ./start-master.sh
>
> starting org.apache.spark.deploy.master.Master, logging to
> /projets/kfk/home/kfkusrm1/spark-3.2.2-bin-without-hadoop/logs/spark-kfkusrm1-org.apache.spark.deploy.master.Master-1-pcld1313.angers.cnp.fr.out
>
> failed to launch: nice -n 0
> /projets/kfk/home/kfkusrm1/spark-3.2.2-bin-without-hadoop/bin/spark-class
> org.apache.spark.deploy.master.Master --host pcld1313.angers.cnp.fr
> --port 7077 --webui-port 8080
>
>   Spark Command: /soft/java/jdk-11.0.11+9/bin/java -cp
> /projets/kfk/home/kfkusrm1/spark-3.2.2-bin-without-hadoop/conf/:/projets/kfk/home/kfkusrm1/spark-3.2.2-bin-without-hadoop/jars/*
> -Xmx1g org.apache.spark.deploy.master.Master --host pcld1313.angers.cnp.fr
> --port 7077 --webui-port 8080
>
>   
>
>   Error: Unable to initialize main class
> org.apache.spark.deploy.master.Master
>
>   Caused by: java.lang.NoClassDefFoundError: org/apache/log4j/spi/Filter
>
> full log in
> /projets/kfk/home/kfkusrm1/spark-3.2.2-bin-without-hadoop/logs/spark-kfkusrm1-org.apache.spark.deploy.master.Master-1-pcld1313.angers.cnp.fr.out
>
>
>
> *I didn’t find any librairies slf4j* in the distribution « without
> hadoop » !
>
> But we can find it in the distribution with Hadoop :
>
> slf4j-log4j12-1.7.30.jar
>
> slf4j-api-1.7.30.jar
>
>
>
>   Is it a problem in the distribution available at
> https://www.apache.org/dyn/closer.lua/spark/spark-3.2.2/spark-3.2.2-bin-without-hadoop.tgz
> ?
>
>
>
>   Thanks for your help.
>
>
>
>   Regards
>
>
>
>   Grégory
>
>
>
>
>
> [image: cid:image001.jpg@01D6E37C.8B6E7490]
>
> *Grégory FLORANCE*
> Service Architecture - YU2
>
> Direction de l’expérience client, des services numériques et de la donnée
>
>
> CNP Assurances
> 1 place François Mitterrand, 49100 ANGERS
> Tél : 02 41 96 39 64 ou 06 08 02 63 45
> gregory.flora...@cnp.fr
>
>
>
> *  Pensez à l'environnement, n'imprimez ce courriel que si nécessaire*
>
>
>
>
> Ce message (et toutes ses pièces jointes éventuelles) est confidentiel et
> établi a l'intention exclusive de ses destinataires.
> Toute utilisation de ce message non conforme a sa destination, toute
> diffusion ou toute publication, totale ou partielle, est
> interdite, sauf autorisation expresse.
> L'internet ne permettant pas d'assurer l’intégrité de ce message, CNP
> Assurances et ses filiales déclinent toute responsabilité
> au titre de ce message, s'il a été altéré, déformé ou falsifié.
>
>
> *
>
> *
>
> This message and any attachments (the "message") are confidential and
> intended solely for the addressees.
> Any unauthorised use or dissemination is prohibited.
> E-mails are susceptible to alteration.
> Neither CNP Assurances nor any of its subsidiaries or affiliates shall be
> liable for the message if altered, changed or falsified.
>

Re: Spark with GPU

2022-08-13 Thread Sean Owen

This isn't a Spark question, but rather a question about whatever Spark
application you are talking about. RAPIDS?

On Sat, Aug 13, 2022 at 10:35 AM rajat kumar 
wrote:

> Thanks Sean.
>
> Also, I observed that lots of things are not supported in GPU by NVIDIA.
> E.g. nested types/decimal type/Udfs etc.
> So, will it use CPU automatically for running those tasks which require
> nested types or will it run on GPU and fail.
>
> Thanks
> Rajat
>
> On Sat, Aug 13, 2022, 18:54 Sean Owen  wrote:
>
>> Spark does not use GPUs itself, but tasks you run on Spark can.
>> The only 'support' there is is for requesting GPUs as resources for
>> tasks, so it's just a question of resource management. That's in OSS.
>>
>> On Sat, Aug 13, 2022 at 8:16 AM rajat kumar 
>> wrote:
>>
>>> Hello,
>>>
>>> I have been hearing about GPU in spark3.
>>>
>>> For batch jobs , will it help to improve GPU performance. Also is GPU
>>> support available only on Databricks or on cloud based Spark clusters ?
>>>
>>> I am new , if anyone can share insight , it will help
>>>
>>> Thanks
>>> Rajat
>>>
>>

Re: Spark with GPU

2022-08-13 Thread Sean Owen

Spark does not use GPUs itself, but tasks you run on Spark can.
The only 'support' there is is for requesting GPUs as resources for tasks,
so it's just a question of resource management. That's in OSS.

On Sat, Aug 13, 2022 at 8:16 AM rajat kumar 
wrote:

> Hello,
>
> I have been hearing about GPU in spark3.
>
> For batch jobs , will it help to improve GPU performance. Also is GPU
> support available only on Databricks or on cloud based Spark clusters ?
>
> I am new , if anyone can share insight , it will help
>
> Thanks
> Rajat
>

Re: Spark Scala API still not updated for 2.13 or it's a mistake?

2022-08-02 Thread Sean Owen

Oh ha we do provide a pre-built binary for 2.13, oops.
I can just remove that line from the docs, I think it's just out of date.

On Tue, Aug 2, 2022 at 11:01 AM Roman I  wrote:

> Ok, though the doc is misleading. The official site provides both 2.12 and
> 2.13 pre-build binaries.
> Thanks
>
> On 2 Aug 2022, at 18:52, Sean Owen  wrote:
>
> Spark 3.3.0 supports 2.13, though you need to build it for 2.13. The
> default binary distro uses 2.12.
>
> On Tue, Aug 2, 2022, 10:47 AM Roman I  wrote:
>
>>
>> For the Scala API, Spark 3.3.0 uses Scala 2.12. You will need to use a
>> compatible Scala version (2.12.x).
>>
>> https://spark.apache.org/docs/latest/
>>
> 
>
>
>

Re: Spark Scala API still not updated for 2.13 or it's a mistake?

2022-08-02 Thread Sean Owen

Spark 3.3.0 supports 2.13, though you need to build it for 2.13. The
default binary distro uses 2.12.

On Tue, Aug 2, 2022, 10:47 AM Roman I  wrote:

>
> For the Scala API, Spark 3.3.0 uses Scala 2.12. You will need to use a
> compatible Scala version (2.12.x).
>
> https://spark.apache.org/docs/latest/
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty

2022-08-02 Thread Sean Owen

That isn't the issue - the table does not exist anyway, but the storage
path does.

On Tue, Aug 2, 2022 at 6:48 AM Stelios Philippou  wrote:

> HI Kumba.
>
> SQL Structure is a bit different for
> CREATE OR REPLACE TABLE
>
>
> You can only do the following
> CREATE TABLE IF NOT EXISTS
>
>
>
>
> https://spark.apache.org/docs/3.3.0/sql-ref-syntax-ddl-create-table-datasource.html
>
> On Tue, 2 Aug 2022 at 14:38, Sean Owen  wrote:
>
>> I don't think "CREATE OR REPLACE TABLE" exists (in SQL?); this isn't a
>> VIEW.
>> Delete the path first; that's simplest.
>>
>> On Tue, Aug 2, 2022 at 12:55 AM Kumba Janga  wrote:
>>
>>> Thanks Sean! That was a simple fix. I changed it to "Create or Replace
>>> Table" but now I am getting the following error. I am still researching
>>> solutions but so far no luck.
>>>
>>> ParseException:
>>> mismatched input '' expecting {'ADD', 'AFTER', 'ALL', 'ALTER', 
>>> 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 
>>> 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 
>>> 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 
>>> 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 
>>> 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 
>>> 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 
>>> 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', 
>>> DATABASES, 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 
>>> 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 
>>> 'DIV', 'DROP', 'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 
>>> 'EXISTS', 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 
>>> 'FETCH', 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 
>>> 'FOREIGN', 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 
>>> 'GLOBAL', 'GRANT', 'GROUP', 'GROUPING', 'HAVING', 'IF', 'IGNORE', 'IMPORT', 
>>> 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 
>>> 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 
>>> 'LATERAL', 'LAZY', 'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 
>>> 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 
>>> 'MATCHED', 'MERGE', 'MSCK', 'NAMESPACE', 'NAMESPACES', 'NATURAL', 'NO', 
>>> NOT, 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 'OPTION', 'OPTIONS', 'OR', 
>>> 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 'OVERLAY', 
>>> 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', 'PIVOT', 
>>> 'PLACING', 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', 'PROPERTIES', 
>>> 'PURGE', 'QUERY', 'RANGE', 'RECORDREADER', 'RECORDWRITER', 'RECOVER', 
>>> 'REDUCE', 'REFERENCES', 'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', 'RESET', 
>>> 'RESTRICT', 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 
>>> 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA', 'SELECT', 'SEMI', 'SEPARATED', 'SERDE', 
>>> 'SERDEPROPERTIES', 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', 
>>> 'SKEWED', 'SOME', 'SORT', 'SORTED', 'START', 'STATISTICS', 'STORED', 
>>> 'STRATIFY', 'STRUCT', 'SUBSTR', 'SUBSTRING', 'TABLE', 'TABLES', 
>>> 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TO', 
>>> 'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRIM', 
>>> 'TRUE', 'TRUNCATE', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 
>>> 'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE', 'USER', 'USING', 
>>> 'VALUES', 'VIEW', 'VIEWS', 'WHEN', 'WHERE', 'WINDOW', 'WITH', IDENTIFIER, 
>>> BACKQUOTED_IDENTIFIER}(line 1, pos 23)
>>>
>>> == SQL ==
>>> CREATE OR REPLACE TABLE
>>>
>>>
>>> On Mon, Aug 1, 2022 at 8:32 PM Sean Owen  wrote:
>>>
>>>> Pretty much what it says? you are creating a table over a path that
>>>> already has data in it. You can't do that without mode=overwrite at least,
>>>> if that's what you intend.
>>>>
>>>> On Mon, Aug 1, 2022 at 7:29 PM Kumba Janga  wrote:
>>>>
>>>>>
>>>>>
>>>>>- Component: Spark Delta, Spark SQL
>>>>>- Level: Beginner
>>>>>- Scenario: Debug, How-to
>>>>>
>>>>> *Python in Jupyter:*
>>>>>
>>>>> import pyspark
>>>>> import pyspark.sql.functions
>>&g

Re: [pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty

2022-08-02 Thread Sean Owen

I don't think "CREATE OR REPLACE TABLE" exists (in SQL?); this isn't a VIEW.
Delete the path first; that's simplest.

On Tue, Aug 2, 2022 at 12:55 AM Kumba Janga  wrote:

> Thanks Sean! That was a simple fix. I changed it to "Create or Replace
> Table" but now I am getting the following error. I am still researching
> solutions but so far no luck.
>
> ParseException:
> mismatched input '' expecting {'ADD', 'AFTER', 'ALL', 'ALTER', 
> 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 
> 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 
> 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 
> 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 
> 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 
> 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 
> 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', 
> DATABASES, 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 
> 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 
> 'DIV', 'DROP', 'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 
> 'EXISTS', 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 
> 'FETCH', 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 
> 'FOREIGN', 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 
> 'GLOBAL', 'GRANT', 'GROUP', 'GROUPING', 'HAVING', 'IF', 'IGNORE', 'IMPORT', 
> 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 
> 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 
> 'LATERAL', 'LAZY', 'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 
> 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 
> 'MATCHED', 'MERGE', 'MSCK', 'NAMESPACE', 'NAMESPACES', 'NATURAL', 'NO', NOT, 
> 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 'OPTION', 'OPTIONS', 'OR', 'ORDER', 
> 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 'OVERLAY', 'OVERWRITE', 
> 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', 'PIVOT', 'PLACING', 
> 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', 'PROPERTIES', 'PURGE', 
> 'QUERY', 'RANGE', 'RECORDREADER', 'RECORDWRITER', 'RECOVER', 'REDUCE', 
> 'REFERENCES', 'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', 'RESET', 'RESTRICT', 
> 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 
> 'ROWS', 'SCHEMA', 'SELECT', 'SEMI', 'SEPARATED', 'SERDE', 'SERDEPROPERTIES', 
> 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', 'SKEWED', 'SOME', 'SORT', 
> 'SORTED', 'START', 'STATISTICS', 'STORED', 'STRATIFY', 'STRUCT', 'SUBSTR', 
> 'SUBSTRING', 'TABLE', 'TABLES', 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 
> 'TERMINATED', 'THEN', 'TO', 'TOUCH', 'TRAILING', 'TRANSACTION', 
> 'TRANSACTIONS', 'TRANSFORM', 'TRIM', 'TRUE', 'TRUNCATE', 'TYPE', 'UNARCHIVE', 
> 'UNBOUNDED', 'UNCACHE', 'UNION', 'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 
> 'UPDATE', 'USE', 'USER', 'USING', 'VALUES', 'VIEW', 'VIEWS', 'WHEN', 'WHERE', 
> 'WINDOW', 'WITH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 23)
>
> == SQL ==
> CREATE OR REPLACE TABLE
>
>
> On Mon, Aug 1, 2022 at 8:32 PM Sean Owen  wrote:
>
>> Pretty much what it says? you are creating a table over a path that
>> already has data in it. You can't do that without mode=overwrite at least,
>> if that's what you intend.
>>
>> On Mon, Aug 1, 2022 at 7:29 PM Kumba Janga  wrote:
>>
>>>
>>>
>>>- Component: Spark Delta, Spark SQL
>>>- Level: Beginner
>>>- Scenario: Debug, How-to
>>>
>>> *Python in Jupyter:*
>>>
>>> import pyspark
>>> import pyspark.sql.functions
>>>
>>> from pyspark.sql import SparkSession
>>> spark = (
>>> SparkSession
>>> .builder
>>> .appName("programming")
>>> .master("local")
>>> .config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0")
>>> .config("spark.sql.extensions", 
>>> "io.delta.sql.DeltaSparkSessionExtension")
>>> .config("spark.sql.catalog.spark_catalog", 
>>> "org.apache.spark.sql.delta.catalog.DeltaCatalog")
>>> .config('spark.ui.port', '4050')
>>> .getOrCreate()
>>>
>>> )
>>> from delta import *
>>>
>>> string_20210609 = '''worked_date,worker_id,delete_flag,hours_worked
>>> 2021-06-09,1001,Y,7
>>> 2021-06-09,1002,Y,3.75
>>> 2021-06-09,1003,Y,7.5
>>> 2021-06-09,1004,Y,6.25'''
>>>
>>> rdd_20210609 = spark.sparkCon

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1849 matches

Mail list logo