from:"Sean Owen"

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-16 Thread Sean Owen

Yeah let's get that fix in, but it seems to be a minor test only issue so
should not block release.

On Fri, Feb 16, 2024, 9:30 AM yangjie01  wrote:

> Very sorry. When I was fixing `SPARK-45242 (
> https://github.com/apache/spark/pull/43594)`
> <https://github.com/apache/spark/pull/43594)>, I noticed that its
> `Affects Version` and `Fix Version` of SPARK-45242 were both 4.0, and I
> didn't realize that it had also been merged into branch-3.5, so I didn't
> advocate for SPARK-45357 to be backported to branch-3.5.
>
>
>
> As far as I know, the condition to trigger this test failure is: when
> using Maven to test the `connect` module, if  `sparkTestRelation` in
> `SparkConnectProtoSuite` is not the first `DataFrame` to be initialized,
> then the `id` of `sparkTestRelation` will no longer be 0. So, I think this
> is indeed related to the order in which Maven executes the test cases in
> the `connect` module.
>
>
>
> I have submitted a backport PR
> <https://github.com/apache/spark/pull/45141> to branch-3.5, and if
> necessary, we can merge it to fix this test issue.
>
>
>
> Jie Yang
>
>
>
> *发件人**: *Jungtaek Lim 
> *日期**: *2024年2月16日 星期五 22:15
> *收件人**: *Sean Owen , Rui Wang 
> *抄送**: *dev 
> *主题**: *Re: [VOTE] Release Apache Spark 3.5.1 (RC2)
>
>
>
> I traced back relevant changes and got a sense of what happened.
>
>
>
> Yangjie figured out the issue via link
> <https://mailshield.baidu.com/check?q=8dOSfwXDFpe5HSp%2b%2bgCPsNQ52B7S7TAFG56Vj3tiFgMkCyOrQEGbg03AVWDX5bwwyIW7sZx3JZox3w8Jz1iw%2bPjaOZYmLWn2>.
> It's a tricky issue according to the comments from Yangjie - the test is
> dependent on ordering of execution for test suites. He said it does not
> fail in sbt, hence CI build couldn't catch it.
>
> He fixed it via link
> <https://mailshield.baidu.com/check?q=ojK3dg%2fDFf3xmQ8SPzsIou3EKaE1ZePctdB%2fUzhWmewnZb5chnQM1%2f8D1JDJnkxF>,
> but we missed that the offending commit was also ported back to 3.5 as
> well, hence the fix wasn't ported back to 3.5.
>
>
>
> Surprisingly, I can't reproduce locally even with maven. In my attempt to
> reproduce, SparkConnectProtoSuite was executed at
> third, SparkConnectStreamingQueryCacheSuite, and ExecuteEventsManagerSuite,
> and then SparkConnectProtoSuite. Maybe very specific to the environment,
> not just maven? My env: MBP M1 pro chip, MacOS 14.3.1, Openjdk 17.0.9. I
> used build/mvn (Maven 3.8.8).
>
>
>
> I'm not 100% sure this is something we should fail the release as it's a
> test only and sounds very environment dependent, but I'll respect your call
> on vote.
>
>
>
> Btw, looks like Rui also made a relevant fix via link
> <https://mailshield.baidu.com/check?q=TUbVzroxG%2fbi2P4qN0kbggzXuPzSN%2bKDoUFGhS9xMet8aXVw6EH0rMr1MKJqp2E2>
>  (not
> to fix the failing test but to fix other issues), but this also wasn't
> ported back to 3.5. @Rui Wang  Do you think this is
> a regression issue and warrants a new RC?
>
>
>
>
>
> On Fri, Feb 16, 2024 at 11:38 AM Sean Owen  wrote:
>
> Is anyone seeing this Spark Connect test failure? then again, I have some
> weird issue with this env that always fails 1 or 2 tests that nobody else
> can replicate.
>
>
>
> - Test observe *** FAILED ***
>   == FAIL: Plans do not match ===
>   !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS
> max_val#0, sum(id#0) AS sum(id)#0L], 0   CollectMetrics my_metric,
> [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L],
> 44
>+- LocalRelation , [id#0, name#0]
>   +- LocalRelation , [id#0, name#0]
> (PlanTest.scala:179)
>
>
>
> On Thu, Feb 15, 2024 at 1:34 PM Jungtaek Lim 
> wrote:
>
> DISCLAIMER: RC for Apache Spark 3.5.1 starts with RC2 as I lately figured
> out doc generation issue after tagging RC1.
>
>
>
> Please vote on releasing the following candidate as Apache Spark version
> 3.5.1.
>
> The vote is open until February 18th 9AM (PST) and passes if a majority +1
> PMC votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.5.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
> <https://mailshield.baidu.com/check?q=iR6md5rYrz%2bpTPJlEXXlR6NN3aGjunZT0DADO3Pcgs0%3d>
>
> The tag to be voted on is v3.5.1-rc2 (commit
> fd86f85e181fc2dc0f50a096855acf83a6cc5d9c):
> https://github.com/apache/spark/tree/v3.5.1-rc2
> <https://mailshield.baidu.com/check?q=BMfFodF3wXGjeH1b9pbW8V4xeWam1vqNNCMtg1lcpC0d4WtLLiIr8UPiFKSwNMjbEy0AJw%3d%3d>
>
> The release files, including signatures, dige

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-15 Thread Sean Owen

Is anyone seeing this Spark Connect test failure? then again, I have some
weird issue with this env that always fails 1 or 2 tests that nobody else
can replicate.

- Test observe *** FAILED ***
  == FAIL: Plans do not match ===
  !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS
max_val#0, sum(id#0) AS sum(id)#0L], 0   CollectMetrics my_metric,
[min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L],
44
   +- LocalRelation , [id#0, name#0]
+- LocalRelation , [id#0, name#0]
(PlanTest.scala:179)

On Thu, Feb 15, 2024 at 1:34 PM Jungtaek Lim 
wrote:

> DISCLAIMER: RC for Apache Spark 3.5.1 starts with RC2 as I lately figured
> out doc generation issue after tagging RC1.
>
> Please vote on releasing the following candidate as Apache Spark version
> 3.5.1.
>
> The vote is open until February 18th 9AM (PST) and passes if a majority +1
> PMC votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.5.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v3.5.1-rc2 (commit
> fd86f85e181fc2dc0f50a096855acf83a6cc5d9c):
> https://github.com/apache/spark/tree/v3.5.1-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1452/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-docs/
>
> The list of bug fixes going into 3.5.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12353495
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC via "pip install
> https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-bin/pyspark-3.5.1.tar.gz
> "
> and see if anything important breaks.
> In the Java/Scala, you can add the staging repository to your projects
> resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.5.1?
> ===
>
> The current list of open tickets targeted at 3.5.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.5.1
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>

Re: Removing Kinesis in Spark 4

2024-01-20 Thread Sean Owen

I'm not aware of much usage. but that doesn't mean a lot.

FWIW, in the past month or so, the Kinesis docs page got about 700 views,
compared to about 1400 for Kafka
https://analytics.apache.org/index.php?module=CoreHome=index=yesterday=day=40#?idSite=40=range=2023-12-15,2024-01-20=General_Actions=Actions_SubmenuPageTitles

Those are "low" in general, compared to the views for streaming pages,
which got tens of thousands of views.

I do feel like it's unmaintained, and do feel like it might be a stretch to
leave it lying around until Spark 5.
It's not exactly unused though.

I would not object to removing it unless there is some voice of support
here.

On Sat, Jan 20, 2024 at 10:38 AM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> From the dev thread: What else could be removed in Spark 4?
> 
>
> On Aug 17, 2023, at 1:44 AM, Yang Jie  wrote:
>
> I would like to know how we should handle the two Kinesis-related modules
> in Spark 4.0. They have a very low frequency of code updates, and because
> the corresponding tests are not continuously executed in any GitHub Actions
> pipeline, so I think they significantly lack quality assurance. On top of
> that, I am not certain if the test cases, which require AWS credentials in
> these modules, get verified during each Spark version release.
>
>
> Did we ever reach a decision about removing Kinesis in Spark 4?
>
> I was cleaning up some docs related to Kinesis and came across a reference
> to some Java API docs that I could not find
> . And
> looking around I came across both this email thread and this thread on
> JIRA
> 
>  about
> potentially removing Kinesis.
>
> But as far as I can tell we haven’t made a clear decision one way or the
> other.
>
> Nick
>
>

Re: Regression? - UIUtils::formatBatchTime - [SPARK-46611][CORE] Remove ThreadLocal by replace SimpleDateFormat with DateTimeFormatter

2024-01-08 Thread Sean Owen

Agreed, that looks wrong. From the code, it seems that "timezone" is only
used for testing, though apparently no test caught this. I'll submit a PR
to patch it in any event: https://github.com/apache/spark/pull/44619

On Mon, Jan 8, 2024 at 1:33 AM Janda Martin  wrote:

> I think that
>  [SPARK-46611][CORE] Remove ThreadLocal by replace SimpleDateFormat with
> DateTimeFormatter
>
>   introduced regression in UIUtils::formatBatchTime when timezone is
> defined.
>
> DateTimeFormatter is thread-safe and immutable according to JavaDoc so
> method DateTimeFormatter::withZone returns new instance when zone is
> changed.
>
> Following code has no effect:
>   val oldTimezones = (batchTimeFormat.getZone,
> batchTimeFormatWithMilliseconds.getZone)
>   if (timezone != null) {
>   val zoneId = timezone.toZoneId
>   batchTimeFormat.withZone(zoneId)
>   batchTimeFormatWithMilliseconds.withZone(zoneId)
> }
>
> Suggested fix:
> introduce local variables for "batchTimeFormat" and
> "batchTimeFormatWithMilliseconds" and remove "oldTimezones" and "finally"
> block.
>
>   I hope that I'm right. I just read the code. I didn't make any tests.
>
>  Thank you
>Martin
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Should Spark 4.x use Java modules (those you define with module-info.java sources)?

2023-12-04 Thread Sean Owen

It already does. I think that's not the same idea?

On Mon, Dec 4, 2023, 8:12 PM Almog Tavor  wrote:

> I think Spark should start shading it’s problematic deps similar to how
> it’s done in Flink
>
> On Mon, 4 Dec 2023 at 2:57 Sean Owen  wrote:
>
>> I am not sure we can control that - the Scala _x.y suffix has particular
>> meaning in the Scala ecosystem for artifacts and thus the naming of .jar
>> files. And we need to work with the Scala ecosystem.
>>
>> What can't handle these files, Spring Boot? does it somehow assume the
>> .jar file name relates to Java modules?
>>
>> By the by, Spark 4 is already moving to the jakarta.* packages for
>> similar reasons.
>>
>> I don't think Spark does or can really leverage Java modules. It started
>> waaay before that and expect that it has some structural issues that are
>> incompatible with Java modules, like multiple places declaring code in the
>> same Java package.
>>
>> As in all things, if there's a change that doesn't harm anything else and
>> helps support for Java modules, sure, suggest it. If it has the conflicts I
>> think it will, probably not possible and not really a goal I think.
>>
>>
>> On Sun, Dec 3, 2023 at 11:30 AM Marc Le Bihan 
>> wrote:
>>
>>> Hello,
>>>
>>> Last month, I've attempted the experience of upgrading my
>>> Spring-Boot 2 Java project, that relies heavily on Spark 3.4.2, to
>>> Spring-Boot 3. It didn't succeed yet, but was informative.
>>>
>>> Spring-Boot 2 → 3 means especially javax.* becoming jakarka.* :
>>> javax.activation, javax.ws.rs, javax.persistence, javax.validation,
>>> javax.servlet... all of these have to change their packages and
>>> dependencies.
>>> Apart of that, they were some trouble with ANTLR 4 against ANTLR 3,
>>> and few things with SFL4 and Log4J.
>>>
>>> It was not easy, and I guessed that going into modules could be a
>>> key. But when I'm near the Spark submodules of my project, it fail with
>>> messages such as:
>>> package org.apache.spark.sql.types is declared in the unnamed
>>> module, but module fr.ecoemploi.outbound.spark.core does not read it
>>>
>>> But I can't handle the spark dependencies easily, because they have
>>> an "invalid name" for Java. It's a matter that it doesn't want the "_" that
>>> is in the "_2.13" suffix of the jars.
>>> [WARNING] Can't extract module name from
>>> breeze-macros_2.13-2.1.0.jar: breeze.macros.2.13: Invalid module name: '2'
>>> is not a Java identifier
>>> [WARNING] Can't extract module name from
>>> spark-tags_2.13-3.4.2.jar: spark.tags.2.13: Invalid module name: '2' is not
>>> a Java identifier
>>> [WARNING] Can't extract module name from
>>> spark-unsafe_2.13-3.4.2.jar: spark.unsafe.2.13: Invalid module name: '2' is
>>> not a Java identifier
>>> [WARNING] Can't extract module name from
>>> spark-mllib_2.13-3.4.2.jar: spark.mllib.2.13: Invalid module name: '2' is
>>> not a Java identifier
>>> [... around 30 ...]
>>>
>>> I think that changing the naming pattern of the Spark jars for the
>>> 4.x could be a good idea,
>>> but beyond that, what about attempting to integrate Spark into
>>> modules, it's submodules defining module-info.java?
>>>
>>> Is it something that you think that [must | should | might | should
>>> not | must not] be done?
>>>
>>> Regards,
>>>
>>> Marc Le Bihan
>>>
>>

Re: Should Spark 4.x use Java modules (those you define with module-info.java sources)?

2023-12-03 Thread Sean Owen

I am not sure we can control that - the Scala _x.y suffix has particular
meaning in the Scala ecosystem for artifacts and thus the naming of .jar
files. And we need to work with the Scala ecosystem.

What can't handle these files, Spring Boot? does it somehow assume the .jar
file name relates to Java modules?

By the by, Spark 4 is already moving to the jakarta.* packages for similar
reasons.

I don't think Spark does or can really leverage Java modules. It started
waaay before that and expect that it has some structural issues that are
incompatible with Java modules, like multiple places declaring code in the
same Java package.

As in all things, if there's a change that doesn't harm anything else and
helps support for Java modules, sure, suggest it. If it has the conflicts I
think it will, probably not possible and not really a goal I think.

On Sun, Dec 3, 2023 at 11:30 AM Marc Le Bihan  wrote:

> Hello,
>
> Last month, I've attempted the experience of upgrading my Spring-Boot
> 2 Java project, that relies heavily on Spark 3.4.2, to Spring-Boot 3. It
> didn't succeed yet, but was informative.
>
> Spring-Boot 2 → 3 means especially javax.* becoming jakarka.* :
> javax.activation, javax.ws.rs, javax.persistence, javax.validation,
> javax.servlet... all of these have to change their packages and
> dependencies.
> Apart of that, they were some trouble with ANTLR 4 against ANTLR 3,
> and few things with SFL4 and Log4J.
>
> It was not easy, and I guessed that going into modules could be a key.
> But when I'm near the Spark submodules of my project, it fail with messages
> such as:
> package org.apache.spark.sql.types is declared in the unnamed
> module, but module fr.ecoemploi.outbound.spark.core does not read it
>
> But I can't handle the spark dependencies easily, because they have an
> "invalid name" for Java. It's a matter that it doesn't want the "_" that is
> in the "_2.13" suffix of the jars.
> [WARNING] Can't extract module name from
> breeze-macros_2.13-2.1.0.jar: breeze.macros.2.13: Invalid module name: '2'
> is not a Java identifier
> [WARNING] Can't extract module name from
> spark-tags_2.13-3.4.2.jar: spark.tags.2.13: Invalid module name: '2' is not
> a Java identifier
> [WARNING] Can't extract module name from
> spark-unsafe_2.13-3.4.2.jar: spark.unsafe.2.13: Invalid module name: '2' is
> not a Java identifier
> [WARNING] Can't extract module name from
> spark-mllib_2.13-3.4.2.jar: spark.mllib.2.13: Invalid module name: '2' is
> not a Java identifier
> [... around 30 ...]
>
> I think that changing the naming pattern of the Spark jars for the 4.x
> could be a good idea,
> but beyond that, what about attempting to integrate Spark into
> modules, it's submodules defining module-info.java?
>
> Is it something that you think that [must | should | might | should
> not | must not] be done?
>
> Regards,
>
> Marc Le Bihan
>

Re: Spark Compatibility with Spring Boot 3.x

2023-10-05 Thread Sean Owen

I think we already updated this in Spark 4. However for now you would have
to also include a JAR with the jakarta.* classes instead.
You are welcome to try Spark 4 now by building from master, but it's far
from release.

On Thu, Oct 5, 2023 at 11:53 AM Ahmed Albalawi
 wrote:

> Hello team,
>
> We are in the process of upgrading one of our apps to Spring Boot 3.x
> while using Spark, and we have encountered an issue with Spark
> compatibility, specifically with Jakarta Servlet. Spring Boot 3.x uses
> Jakarta Servlet while Spark uses Javax Servlet. Can we get some guidance on
> how to upgrade to Spring Boot 3.x while continuing to use Spark.
>
> The specific error is listed below:
>
> java.lang.NoClassDefFoundError: javax/servlet/Servlet
> at org.apache.spark.ui.SparkUI$.create(SparkUI.scala:239)
> at org.apache.spark.SparkContext.(SparkContext.scala:503)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2888)
> at org.apache.spark.SparkContext.getOrCreate(SparkContext.scala)
>
> The error comes up when we try to run a mvn clean install, and the issue is 
> in our test cases. This issue happens specifically when we build our spark 
> session. The line of code it traces down to is as follows:
>
> *session = 
> SparkSession.builder().sparkContext(SparkContext.getOrCreate(sparkConf)).getOrCreate();*
>
> What we have tried:
>
> - We noticed according to this post 
> ,
>  there are no compatible versions of spark using version 5 of the Jakarta 
> Servlet API
>
> - We've tried 
> 
>  using the maven shade plugin to use jakarta instead of javax, but are 
> running into some other issues with this.
> - We've also looked at the following 
> 
>  to use jakarta 4.x with jersey 2.x and still have an issue with the servlet
>
>
> Please let us know if there are any solutions to this issue. Thanks!
>
>
> --
> *Ahmed Albalawi*
>
> Senior Associate Software Engineer • EP2 Tech - CuRE
>
> 571-668-3911 •  1680 Capital One Dr.
> --
>
> The information contained in this e-mail may be confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>
>
>
>
>

Re: PySpark 3.5.0 on PyPI

2023-09-20 Thread Sean Owen

I think the announcement mentioned there were some issues with pypi and the
upload size this time. I am sure it's intended to be there when possible.

On Wed, Sep 20, 2023, 3:00 PM Kezhi Xiong  wrote:

> Hi,
>
> Are there any plans to upload PySpark 3.5.0 to PyPI (
> https://pypi.org/project/pyspark/)? It's still 3.4.1.
>
> Thanks,
> Kezhi
>
>
>

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-20 Thread Sean Owen

This has turned into a big thread for a simple thing and has been answered
3 times over now.

Neither is better, they just calculate different things. That the 'default'
is sample stddev is just convention.
stddev_pop is the simple standard deviation of a set of numbers
stddev_samp is used when the set of numbers is a sample from a notional
larger population, and you estimate the stddev of the population from the
sample.

They only differ in the denominator. Neither is more efficient at all or
more/less sensitive to outliers.

On Wed, Sep 20, 2023 at 3:06 AM Mich Talebzadeh 
wrote:

> Spark uses the sample standard deviation stddev_samp by default, whereas
> *Hive* uses population standard deviation stddev_pop as default.
>
> My understanding is that spark uses sample standard deviation by default
> because
>
>- It is more commonly used.
>- It is more efficient to calculate.
>- It is less sensitive to outliers. (data points that differ
>significantly from other observations in a dataset. They can be caused by a
>variety of factors, such as measurement errors or edge events.)
>
> The sample standard deviation is less sensitive to outliers because it
> divides by N-1 instead of N. This means that a single outlier will have a
> smaller impact on the sample standard deviation than it would on the
> population standard deviation.
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 19 Sept 2023 at 21:50, Sean Owen  wrote:
>
>> Pyspark follows SQL databases here. stddev is stddev_samp, and sample
>> standard deviation is the calculation with the Bessel correction, n-1 in
>> the denominator. stddev_pop is simply standard deviation, with n in the
>> denominator.
>>
>> On Tue, Sep 19, 2023 at 7:13 AM Helene Bøe 
>> wrote:
>>
>>> Hi!
>>>
>>>
>>>
>>> I am applying the stddev function (so actually stddev_samp), however
>>> when comparing with the sample standard deviation in Excel the resuls do
>>> not match.
>>>
>>> I cannot find in your documentation any more specifics on how the sample
>>> standard deviation is calculated, so I cannot compare the difference toward
>>> excel, which uses
>>>
>>> .
>>>
>>> I am trying to avoid using Excel at all costs, but if the stddev_samp
>>> function is not calculating the standard deviation correctly I have a
>>> problem.
>>>
>>> I hope you can help me resolve this issue.
>>>
>>>
>>>
>>> Kindest regards,
>>>
>>>
>>>
>>> *Helene Bøe*
>>> *Graduate Project Engineer*
>>> Recycling Process & Support
>>>
>>> M: +47 980 00 887
>>> helene.b...@hydro.com
>>> <https://intra.hydro.com/EPiServer/CMS/Content/en/%2c%2c9/?epieditmode=False>
>>>
>>> Norsk Hydro ASA
>>> Drammensveien 264
>>> NO-0283 Oslo, Norway
>>> www.hydro.com
>>> <https://intra.hydro.com/EPiServer/CMS/Content/en/%2c%2c9/?epieditmode=False>
>>>
>>>
>>> NOTICE: This e-mail transmission, and any documents, files or previous
>>> e-mail messages attached to it, may contain confidential or privileged
>>> information. If you are not the intended recipient, or a person responsible
>>> for delivering it to the intended recipient, you are hereby notified that
>>> any disclosure, copying, distribution or use of any of the information
>>> contained in or attached to this message is STRICTLY PROHIBITED. If you
>>> have received this transmission in error, please immediately notify the
>>> sender and delete the e-mail and attached documents. Thank you.
>>>
>>

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Sean Owen

Pyspark follows SQL databases here. stddev is stddev_samp, and sample
standard deviation is the calculation with the Bessel correction, n-1 in
the denominator. stddev_pop is simply standard deviation, with n in the
denominator.

On Tue, Sep 19, 2023 at 7:13 AM Helene Bøe 
wrote:

> Hi!
>
>
>
> I am applying the stddev function (so actually stddev_samp), however when
> comparing with the sample standard deviation in Excel the resuls do not
> match.
>
> I cannot find in your documentation any more specifics on how the sample
> standard deviation is calculated, so I cannot compare the difference toward
> excel, which uses
>
> .
>
> I am trying to avoid using Excel at all costs, but if the stddev_samp
> function is not calculating the standard deviation correctly I have a
> problem.
>
> I hope you can help me resolve this issue.
>
>
>
> Kindest regards,
>
>
>
> *Helene Bøe*
> *Graduate Project Engineer*
> Recycling Process & Support
>
> M: +47 980 00 887
> helene.b...@hydro.com
> 
>
> Norsk Hydro ASA
> Drammensveien 264
> NO-0283 Oslo, Norway
> www.hydro.com
> 
>
>
> NOTICE: This e-mail transmission, and any documents, files or previous
> e-mail messages attached to it, may contain confidential or privileged
> information. If you are not the intended recipient, or a person responsible
> for delivering it to the intended recipient, you are hereby notified that
> any disclosure, copying, distribution or use of any of the information
> contained in or attached to this message is STRICTLY PROHIBITED. If you
> have received this transmission in error, please immediately notify the
> sender and delete the e-mail and attached documents. Thank you.
>

Re: getting emails in different order!

2023-09-18 Thread Sean Owen

I have seen this, and not sure if it's just the ASF mailer being weird, or
more likely, because emails are moderated and we inadvertently moderate
them out of order

On Mon, Sep 18, 2023 at 10:59 AM Mich Talebzadeh 
wrote:

> Hi,
>
> I use gmail to receive spark user group emails.
>
> On occasions, I get the latest emails first and later in the day I receive
> the original email.
>
> Has anyone else seen this behaviour recently?
>
> Thanks
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Re: Are DataFrame rows ordered without an explicit ordering clause?

2023-09-18 Thread Sean Owen

I think it's the same, and always has been - yes you don't have a
guaranteed ordering unless an operation produces a specific ordering. Could
be the result of order by, yes; I believe you would be guaranteed that
reading input files results in data in the order they appear in the file,
etc. 1:1 operations like map() don't change ordering. But not the result of
a shuffle, for example. So yeah anything like limit or head might give
different results in the future (or simply on different cluster setups with
different parallelism, etc). The existence of operations like offset
doesn't contradict that. Maybe that's totally fine in some situations (ex:
I just want to display some sample rows) but otherwise yeah you've always
had to state your ordering for "first" or "nth" to have a guaranteed result.

On Mon, Sep 18, 2023 at 10:48 AM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> I’ve always considered DataFrames to be logically equivalent to SQL tables
> or queries.
>
> In SQL, the result order of any query is implementation-dependent without
> an explicit ORDER BY clause. Technically, you could run `SELECT * FROM
> table;` 10 times in a row and get 10 different orderings.
>
> I thought the same applied to DataFrames, but the docstring for the
> recently added method DataFrame.offset
> 
>  implies
> otherwise.
>
> This example will work fine in practice, of course. But if DataFrames are
> technically unordered without an explicit ordering clause, then in theory a
> future implementation change may result in “Bob" being the “first” row in
> the DataFrame, rather than “Tom”. That would make the example incorrect.
>
> Is that not the case?
>
> Nick
>
>

Re: Spark stand-alone mode

2023-09-15 Thread Sean Owen

Yes, should work fine, just set up according to the docs. There needs to be
network connectivity between whatever the driver node is and these 4 nodes.

On Thu, Sep 14, 2023 at 11:57 PM Ilango  wrote:

>
> Hi all,
>
> We have 4 HPC nodes and installed spark individually in all nodes.
>
> Spark is used as local mode(each driver/executor will have 8 cores and 65
> GB) in Sparklyr/pyspark using Rstudio/Posit workbench. Slurm is used as
> scheduler.
>
> As this is local mode, we are facing performance issue(as only one
> executor) when it comes dealing with large datasets.
>
> Can I convert this 4 nodes into spark standalone cluster. We dont have
> hadoop so yarn mode is out of scope.
>
> Shall I follow the official documentation for setting up standalone
> cluster. Will it work? Do I need to aware anything else?
> Can you please share your thoughts?
>
> Thanks,
> Elango
>

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Sean Owen

I mean, have you checked if this is in your jar? Are you building an
assembly? Where do you expect elastic classes to be and are they there?
Need some basic debugging here

On Thu, Sep 7, 2023, 8:49 PM Dipayan Dev  wrote:

> Hi Sean,
>
> Removed the provided thing, but still the same issue.
>
> 
> org.elasticsearch
> elasticsearch-spark-30_${scala.compat.version}
> 7.12.1
> 
>
>
> On Fri, Sep 8, 2023 at 4:41 AM Sean Owen  wrote:
>
>> By marking it provided, you are not including this dependency with your
>> app. If it is also not somehow already provided by your spark cluster (this
>> is what it means), then yeah this is not anywhere on the class path at
>> runtime. Remove the provided scope.
>>
>> On Thu, Sep 7, 2023, 4:09 PM Dipayan Dev  wrote:
>>
>>> Hi,
>>>
>>> Can you please elaborate your last response? I don’t have any external
>>> dependencies added, and just updated the Spark version as mentioned below.
>>>
>>> Can someone help me with this?
>>>
>>> On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers  wrote:
>>>
>>>> could the provided scope be the issue?
>>>>
>>>> On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev 
>>>> wrote:
>>>>
>>>>> Using the following dependency for Spark 3 in POM file (My Scala
>>>>> version is 2.12.14)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *org.elasticsearch
>>>>> elasticsearch-spark-30_2.12
>>>>> 7.12.0provided*
>>>>>
>>>>>
>>>>> The code throws error at this line :
>>>>> df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name")
>>>>> The same code is working with Spark 2.4.0 and the following dependency
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *org.elasticsearch
>>>>> elasticsearch-spark-20_2.12
>>>>> 7.12.0*
>>>>>
>>>>>
>>>>> On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau 
>>>>> wrote:
>>>>>
>>>>>> What’s the version of the ES connector you are using?
>>>>>>
>>>>>> On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> We're using Spark 2.4.x to write dataframe into the Elasticsearch
>>>>>>> index.
>>>>>>> As we're upgrading to Spark 3.3.0, it throwing out error
>>>>>>> Caused by: java.lang.ClassNotFoundException: es.DefaultSource
>>>>>>> at
>>>>>>> java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
>>>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
>>>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>>>>>>>
>>>>>>> Looking at a few responses from Stackoverflow
>>>>>>> <https://stackoverflow.com/a/66452149>. it seems this is not yet
>>>>>>> supported by Elasticsearch-hadoop.
>>>>>>>
>>>>>>> Does anyone have experience with this? Or faced/resolved this issue
>>>>>>> in Spark 3?
>>>>>>>
>>>>>>> Thanks in advance!
>>>>>>>
>>>>>>> Regards
>>>>>>> Dipayan
>>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>
>>>> CONFIDENTIALITY NOTICE: This electronic communication and any files
>>>> transmitted with it are confidential, privileged and intended solely for
>>>> the use of the individual or entity to whom they are addressed. If you are
>>>> not the intended recipient, you are hereby notified that any disclosure,
>>>> copying, distribution (electronic or otherwise) or forwarding of, or the
>>>> taking of any action in reliance on the contents of this transmission is
>>>> strictly prohibited. Please notify the sender immediately by e-mail if you
>>>> have received this email by mistake and delete this email from your system.
>>>>
>>>> Is it necessary to print this email? If you care about the environment
>>>> like we do, please refrain from printing emails. It helps to keep the
>>>> environment forested and litter-free.
>>>
>>>

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Sean Owen

By marking it provided, you are not including this dependency with your
app. If it is also not somehow already provided by your spark cluster (this
is what it means), then yeah this is not anywhere on the class path at
runtime. Remove the provided scope.

On Thu, Sep 7, 2023, 4:09 PM Dipayan Dev  wrote:

> Hi,
>
> Can you please elaborate your last response? I don’t have any external
> dependencies added, and just updated the Spark version as mentioned below.
>
> Can someone help me with this?
>
> On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers  wrote:
>
>> could the provided scope be the issue?
>>
>> On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev 
>> wrote:
>>
>>> Using the following dependency for Spark 3 in POM file (My Scala version
>>> is 2.12.14)
>>>
>>>
>>>
>>>
>>>
>>>
>>> *org.elasticsearch
>>> elasticsearch-spark-30_2.12
>>> 7.12.0provided*
>>>
>>>
>>> The code throws error at this line :
>>> df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name")
>>> The same code is working with Spark 2.4.0 and the following dependency
>>>
>>>
>>>
>>>
>>>
>>> *org.elasticsearch
>>> elasticsearch-spark-20_2.12
>>> 7.12.0*
>>>
>>>
>>> On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau 
>>> wrote:
>>>
 What’s the version of the ES connector you are using?

 On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev 
 wrote:

> Hi All,
>
> We're using Spark 2.4.x to write dataframe into the Elasticsearch
> index.
> As we're upgrading to Spark 3.3.0, it throwing out error
> Caused by: java.lang.ClassNotFoundException: es.DefaultSource
> at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>
> Looking at a few responses from Stackoverflow
> . it seems this is not yet
> supported by Elasticsearch-hadoop.
>
> Does anyone have experience with this? Or faced/resolved this issue in
> Spark 3?
>
> Thanks in advance!
>
> Regards
> Dipayan
>
 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>> CONFIDENTIALITY NOTICE: This electronic communication and any files
>> transmitted with it are confidential, privileged and intended solely for
>> the use of the individual or entity to whom they are addressed. If you are
>> not the intended recipient, you are hereby notified that any disclosure,
>> copying, distribution (electronic or otherwise) or forwarding of, or the
>> taking of any action in reliance on the contents of this transmission is
>> strictly prohibited. Please notify the sender immediately by e-mail if you
>> have received this email by mistake and delete this email from your system.
>>
>> Is it necessary to print this email? If you care about the environment
>> like we do, please refrain from printing emails. It helps to keep the
>> environment forested and litter-free.
>
>

Re: Okio Vulnerability in Spark 3.4.1

2023-08-31 Thread Sean Owen

It's a dependency of some other HTTP library. Use mvn dependency:tree to
see where it comes from. It may be more straightforward to upgrade the
library that brings it in, assuming a later version brings in a later okio.
You can also manage up the version directly with a new entry in


However, does this affect Spark? all else equal it doesn't hurt to upgrade,
but wondering if there is even a theory that it needs to be updated.


On Thu, Aug 31, 2023 at 7:42 AM Agrawal, Sanket 
wrote:

> I don’t see an entry in pom.xml while building spark. I think it is being
> downloaded as part of some other dependency.
>
>
>
> *From:* Sean Owen 
> *Sent:* Thursday, August 31, 2023 5:10 PM
> *To:* Agrawal, Sanket 
> *Cc:* user@spark.apache.org
> *Subject:* [EXT] Re: Okio Vulnerability in Spark 3.4.1
>
>
>
> Does the vulnerability affect Spark?
>
> In any event, have you tried updating Okio in the Spark build? I don't
> believe you could just replace the JAR, as other libraries probably rely on
> it and compiled against the current version.
>
>
>
> On Thu, Aug 31, 2023 at 6:02 AM Agrawal, Sanket <
> sankeagra...@deloitte.com.invalid> wrote:
>
> Hi All,
>
>
>
> Amazon inspector has detected a vulnerability in okio-1.15.0.jar JAR in
> Spark 3.4.1. It suggests to upgrade the jar version to 3.4.0. But when we
> try this version of jar then the spark application is failing with below
> error:
>
>
>
> py4j.protocol.Py4JJavaError: An error occurred while calling
> None.org.apache.spark.api.java.JavaSparkContext.
>
> : java.lang.NoClassDefFoundError: okio/BufferedSource
>
> at okhttp3.internal.Util.(Util.java:62)
>
> at okhttp3.OkHttpClient.(OkHttpClient.java:127)
>
> at okhttp3.OkHttpClient$Builder.(OkHttpClient.java:475)
>
> at
> io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory.newOkHttpClientBuilder(OkHttpClientFactory.java:41)
>
> at
> io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory.newBuilder(OkHttpClientFactory.java:56)
>
> at
> io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory.newBuilder(OkHttpClientFactory.java:68)
>
> at
> io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory.newBuilder(OkHttpClientFactory.java:30)
>
> at
> io.fabric8.kubernetes.client.KubernetesClientBuilder.getHttpClient(KubernetesClientBuilder.java:88)
>
> at
> io.fabric8.kubernetes.client.KubernetesClientBuilder.build(KubernetesClientBuilder.java:78)
>
> at
> org.apache.spark.deploy.k8s.SparkKubernetesClientFactory$.createKubernetesClient(SparkKubernetesClientFactory.scala:120)
>
> at
> org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:111)
>
> at
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:3037)
>
> at org.apache.spark.SparkContext.(SparkContext.scala:568)
>
> at
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
>
> at
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
> at
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
> Source)
>
> at
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> Source)
>
> at java.base/java.lang.reflect.Constructor.newInstance(Unknown
> Source)
>
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
>
> at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
>
> at py4j.Gateway.invoke(Gateway.java:238)
>
> at
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
>
> at
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
>
> at
> py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
>
> at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
>
> at java.base/java.lang.Thread.run(Unknown Source)
>
> Caused by: java.lang.ClassNotFoundException: okio.BufferedSource
>
> at
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(Unknown Source)
>
> at
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(Unknown
> Source)
>
> at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
>
>

Re: Okio Vulnerability in Spark 3.4.1

2023-08-31 Thread Sean Owen

Does the vulnerability affect Spark?
In any event, have you tried updating Okio in the Spark build? I don't
believe you could just replace the JAR, as other libraries probably rely on
it and compiled against the current version.

On Thu, Aug 31, 2023 at 6:02 AM Agrawal, Sanket
 wrote:

> Hi All,
>
>
>
> Amazon inspector has detected a vulnerability in okio-1.15.0.jar JAR in
> Spark 3.4.1. It suggests to upgrade the jar version to 3.4.0. But when we
> try this version of jar then the spark application is failing with below
> error:
>
>
>
> py4j.protocol.Py4JJavaError: An error occurred while calling
> None.org.apache.spark.api.java.JavaSparkContext.
>
> : java.lang.NoClassDefFoundError: okio/BufferedSource
>
> at okhttp3.internal.Util.(Util.java:62)
>
> at okhttp3.OkHttpClient.(OkHttpClient.java:127)
>
> at okhttp3.OkHttpClient$Builder.(OkHttpClient.java:475)
>
> at
> io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory.newOkHttpClientBuilder(OkHttpClientFactory.java:41)
>
> at
> io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory.newBuilder(OkHttpClientFactory.java:56)
>
> at
> io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory.newBuilder(OkHttpClientFactory.java:68)
>
> at
> io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory.newBuilder(OkHttpClientFactory.java:30)
>
> at
> io.fabric8.kubernetes.client.KubernetesClientBuilder.getHttpClient(KubernetesClientBuilder.java:88)
>
> at
> io.fabric8.kubernetes.client.KubernetesClientBuilder.build(KubernetesClientBuilder.java:78)
>
> at
> org.apache.spark.deploy.k8s.SparkKubernetesClientFactory$.createKubernetesClient(SparkKubernetesClientFactory.scala:120)
>
> at
> org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:111)
>
> at
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:3037)
>
> at org.apache.spark.SparkContext.(SparkContext.scala:568)
>
> at
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
>
> at
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
> at
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
> Source)
>
> at
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> Source)
>
> at java.base/java.lang.reflect.Constructor.newInstance(Unknown
> Source)
>
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
>
> at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
>
> at py4j.Gateway.invoke(Gateway.java:238)
>
> at
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
>
> at
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
>
> at
> py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
>
> at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
>
> at java.base/java.lang.Thread.run(Unknown Source)
>
> Caused by: java.lang.ClassNotFoundException: okio.BufferedSource
>
> at
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(Unknown Source)
>
> at
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(Unknown
> Source)
>
> at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
>
> ... 26 more
>
>
>
> Replaced the existing jar with the JAR file at
> https://repo1.maven.org/maven2/com/squareup/okio/okio/3.4.0/okio-3.4.0.jar
>
>
>
>
>
> PFB, the vulnerability details:
>
> Link: https://nvd.nist.gov/vuln/detail/CVE-2023-3635
>
>
>
> Any guidance here would be of great help.
>
>
>
> Thanks,
>
> Sanket A.
>
> This message (including any attachments) contains confidential information
> intended for a specific individual and purpose, and is protected by law. If
> you are not the intended recipient, you should delete this message and any
> disclosure, copying, or distribution of this message, or the taking of any
> action based on it, by you is strictly prohibited.
>
> Deloitte refers to a Deloitte member firm, one of its related entities, or
> Deloitte Touche Tohmatsu Limited ("DTTL"). Each Deloitte member firm is a
> separate legal entity and a member of DTTL. DTTL does not provide services
> to clients. Please see www.deloitte.com/about to learn more.
>
> v.E.1
>

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-31 Thread Sean Owen

I think you're talking past Hyukjin here.

I think the response is: none of that is managed by Pyspark now, and this
proposal does not change that. Your current interpreter and environment is
used to execute the stored procedure, which is just Python code. It's on
you to bring an environment that runs the code correctly. This is just the
same as how running any python code works now.

I think you have exactly the same problems with UDFs now, and that's all a
real problem, just not something Spark has ever tried to solve for you.
Think of this as exactly like: I have a bit of python code I import as a
function and share across many python workloads. Just, now that chunk is
stored as a 'stored procedure'.

I agree this raises the same problem in new ways - now, you are storing and
sharing a chunk of code across many workloads. There is more potential for
compatibility and environment problems, as all of that is simply punted to
the end workloads. But, it's not different from importing common code and
the world doesn't fall apart.

On Wed, Aug 30, 2023 at 11:16 PM Alexander Shorin  wrote:

>
> Which Python version will run that stored procedure?
>>
>> All Python versions supported in PySpark
>>
>
> Where in stored procedure defines the exact python version which will run
> the code? That was the question.
>
>
>> How to manage external dependencies?
>>
>> Existing way we have
>> https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html
>> .
>> In fact, this will use the external dependencies within your Python
>> interpreter so you can use all existing conda or venvs.
>>
> Current proposal solves this issue nohow (the stored code doesn't provide
> any manifest about its dependencies and what is required to run it). So
> feels like it's better to stay with UDF since they are under control and
> their behaviour is predictable. Did I miss something?
>
> How to test it via a common CI process?
>>
>> Existing way of PySpark unittests, see
>> https://github.com/apache/spark/tree/master/python/pyspark/tests
>>
> Sorry, but this wouldn't work since stored procedure thing requires some
> specific definition and this code will not be stored as regular python
> code. Do you have any examples how to test stored python procedures as a
> unit e.g. without spark?
>
> How to manage versions and do upgrades? Migrations?
>>
>> This is a new feature so no migration is needed. We will keep the
>> compatibility according to the sember we follow.
>>
> Question was not about spark, but about stored procedures itself. Any
> guidelines which will not copy flaws of other systems?
>
> Current Python UDF solution handles these problems in a good way since
>> they delegate them to project level.
>>
>> Current UDF solution cannot handle stored procedures because UDF is on
>> the worker side. This is Driver side.
>>
> How so? Currently it works and we never faced such issue. May be you
> should have the same Python code also on the driver side? But such trivial
> idea doesn't require new feature on Spark since you already have to ship
> that code somehow.
>
> --
> ,,,^..^,,,
>

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-30 Thread Sean Owen

It worked fine after I ran it again I included "package test" instead of
"test" (I had previously run "install") +1

On Wed, Aug 30, 2023 at 6:06 AM yangjie01  wrote:

> Hi, Sean
>
>
>
> I have performed testing with Java 17 and Scala 2.13 using maven (`mvn
> clean install` and `mvn package test`), and have not encountered the issue
> you mentioned.
>
>
>
> The test for the connect module depends on the `spark-protobuf` module to
> complete the `package,` was it successful? Or could you provide the test
> command for me to verify?
>
>
>
> Thanks,
>
> Jie Yang
>
>
>
> *发件人**: *Dipayan Dev 
> *日期**: *2023年8月30日 星期三 17:01
> *收件人**: *Sean Owen 
> *抄送**: *Yuanjian Li , Spark dev list <
> dev@spark.apache.org>
> *主题**: *Re: [VOTE] Release Apache Spark 3.5.0 (RC3)
>
>
>
> Can we fix this bug in Spark 3.5.0?
>
> https://issues.apache.org/jira/browse/SPARK-44884
> <https://mailshield.baidu.com/check?q=cuZ00%2b0zbrN1TxhY0HTgyAub3lGN0J5FSjbfsBPL0yoIU71LdJTYoAVapkFmUjxgZT0WPdJBLus%3d>
>
>
>
>
> On Wed, Aug 30, 2023 at 11:51 AM Sean Owen  wrote:
>
> It looks good except that I'm getting errors running the Spark Connect
> tests at the end (Java 17, Scala 2.13) It looks like I missed something
> necessary to build; is anyone getting this?
>
>
>
> [ERROR] [Error]
> /tmp/spark-3.5.0/connector/connect/server/target/generated-test-sources/protobuf/java/org/apache/spark/sql/protobuf/protos/TestProto.java:9:46:
>  error: package org.sparkproject.spark_protobuf.protobuf does not exist
>
>
>
> On Tue, Aug 29, 2023 at 11:25 AM Yuanjian Li 
> wrote:
>
> Please vote on releasing the following candidate(RC3) as Apache Spark
> version 3.5.0.
>
>
>
> The vote is open until 11:59pm Pacific time *Aug 31st* and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
>
>
> [ ] +1 Release this package as Apache Spark 3.5.0
>
> [ ] -1 Do not release this package because ...
>
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
> <https://mailshield.baidu.com/check?q=eJcUboQ1HRRomPZKEwRzpl69wA8DbI%2fNIiRNsQ%3d%3d>
>
>
>
> The tag to be voted on is v3.5.0-rc3 (commit
> 9f137aa4dc43398aafa0c3e035ed3174182d7d6c):
>
> https://github.com/apache/spark/tree/v3.5.0-rc3
> <https://mailshield.baidu.com/check?q=M8bk44BhojXSL5a%2bfp%2fAiXPgzvf1z8IY9RiBF4qXAQxEMaMvBeSTzrTW4aDYfv61SNEvZQ%3d%3d>
>
>
>
> The release files, including signatures, digests, etc. can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-bin/
> <https://mailshield.baidu.com/check?q=Y5B1AfmG5NfNnTciPizGUdNVAVSofSiQkkSPsdSlVX%2fPPccSlHQtGK4nriJZRzVyOyOEL1evkXHLFUDt%2fF%2fl9Q%3d%3d>
>
>
>
> Signatures used for Spark RCs can be found in this file:
>
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> <https://mailshield.baidu.com/check?q=E6fHbSXEWw02TTJBpc3bfA9mi7ea0YiWcNHkm%2fDJxwlaWinGnMdaoO1PahHhgj00vKwcbElpuHA%3d>
>
>
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1447
> <https://mailshield.baidu.com/check?q=RKosLPjotKC8t%2fbhRUl%2fPI4aNpBuK2BpNhu6N7dXyO7vfBBIc2nx2st8hHY8kR%2f%2byciK%2bMWsc9QPqZCv6O3A2prmaWrVFOSOjhTPWA%3d%3d>
>
>
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-docs/
> <https://mailshield.baidu.com/check?q=UisDsKXdd3IJ4Kv657YN4LyF4nLuG%2bzB3bin1GDxnnjSLLtyS4sJmD%2f3asF8Ihv6p62TDzMlUG%2fg5wYGfJ0EfUSOJL0%3d>
>
>
>
> The list of bug fixes going into 3.5.0 can be found at the following URL:
>
> https://issues.apache.org/jira/projects/SPARK/versions/12352848
> <https://mailshield.baidu.com/check?q=rOHxO3EFdnYTS41rF0m9qsTrteyGHUmLHghEJgmTMLY2%2bhbNu4VZqqsL4J8TXbsKbVjS4fDayxhT%2fqjJjgSX8zM00bc%3d>
>
>
>
> This release is using the release script of the tag v3.5.0-rc3.
>
>
>
> FAQ
>
>
>
> =
>
> How can I help test this release?
>
> =
>
> If you are a Spark user, you can help us test this release by taking
>
> an existing Spark workload and running on this release candidate, then
>
> reporting any regressions.
>
>
>
> If you're working in PySpark you can set up a virtual env and install
>
> the current RC and see if anything important breaks, in the Java/Scala
>
> you can add the staging repository to your projects resolvers and test
>
> with the RC (make sure to clean up the artifact cache before/after so
>
> you don't end up building with an out of date RC going f

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-29 Thread Sean Owen

It looks good except that I'm getting errors running the Spark Connect
tests at the end (Java 17, Scala 2.13) It looks like I missed something
necessary to build; is anyone getting this?

[ERROR] [Error]
/tmp/spark-3.5.0/connector/connect/server/target/generated-test-sources/protobuf/java/org/apache/spark/sql/protobuf/protos/TestProto.java:9:46:
 error: package org.sparkproject.spark_protobuf.protobuf does not exist

On Tue, Aug 29, 2023 at 11:25 AM Yuanjian Li  wrote:

> Please vote on releasing the following candidate(RC3) as Apache Spark
> version 3.5.0.
>
> The vote is open until 11:59pm Pacific time Aug 31st and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.5.0
>
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.5.0-rc3 (commit
> 9f137aa4dc43398aafa0c3e035ed3174182d7d6c):
>
> https://github.com/apache/spark/tree/v3.5.0-rc3
>
> The release files, including signatures, digests, etc. can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
>
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1447
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-docs/
>
> The list of bug fixes going into 3.5.0 can be found at the following URL:
>
> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>
> This release is using the release script of the tag v3.5.0-rc3.
>
>
> FAQ
>
> =
>
> How can I help test this release?
>
> =
>
> If you are a Spark user, you can help us test this release by taking
>
> an existing Spark workload and running on this release candidate, then
>
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
>
> the current RC and see if anything important breaks, in the Java/Scala
>
> you can add the staging repository to your projects resolvers and test
>
> with the RC (make sure to clean up the artifact cache before/after so
>
> you don't end up building with an out of date RC going forward).
>
> ===
>
> What should happen to JIRA tickets still targeting 3.5.0?
>
> ===
>
> The current list of open tickets targeted at 3.5.0 can be found at:
>
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.5.0
>
> Committers should look at those and triage. Extremely important bug
>
> fixes, documentation, and API tweaks that impact compatibility should
>
> be worked on immediately. Everything else please retarget to an
>
> appropriate release.
>
> ==
>
> But my bug isn't fixed?
>
> ==
>
> In order to make timely releases, we will typically not hold the
>
> release unless the bug in question is a regression from the previous
>
> release. That being said, if there is something which is a regression
>
> that has not been correctly targeted please ping me or a committer to
>
> help target the issue.
>
> Thanks,
>
> Yuanjian Li
>

Re: error trying to save to database (Phoenix)

2023-08-21 Thread Sean Owen

It is. But you have a third party library in here which seems to require a
different version.

On Mon, Aug 21, 2023, 7:04 PM Kal Stevens  wrote:

> OK, it was my impression that scala was packaged with Spark to avoid a
> mismatch
> https://spark.apache.org/downloads.html
>
> It looks like spark 3.4.1 (my version) uses scala Scala 2.12
> How do I specify the scala version?
>
> On Mon, Aug 21, 2023 at 4:47 PM Sean Owen  wrote:
>
>> That's a mismatch in the version of scala that your library uses vs spark
>> uses.
>>
>> On Mon, Aug 21, 2023, 6:46 PM Kal Stevens  wrote:
>>
>>> I am having a hard time figuring out what I am doing wrong here.
>>> I am not sure if I have an incompatible version of something installed
>>> or something else.
>>> I can not find anything relevant in google to figure out what I am doing
>>> wrong
>>> I am using *spark 3.4.1*, and *python3.10*
>>>
>>> This is my code to save my dataframe
>>> urls = []
>>> pull_sitemap_xml(robot, urls)
>>> df = spark.createDataFrame(data=urls, schema=schema)
>>> df.write.format("org.apache.phoenix.spark") \
>>> .mode("overwrite") \
>>> .option("table", "property") \
>>> .option("zkUrl", "192.168.1.162:2181") \
>>> .save()
>>>
>>> urls is an array of maps, containing a "url" and a "last_mod" field.
>>>
>>> Here is the error that I am getting
>>>
>>> Traceback (most recent call last):
>>>
>>>   File "/home/kal/real-estate/pullhttp/pull_properties.py", line 65, in
>>> main
>>>
>>> .save()
>>>
>>>   File
>>> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
>>> line 1396, in save
>>>
>>> self._jwrite.save()
>>>
>>>   File
>>> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
>>> line 1322, in __call__
>>>
>>> return_value = get_return_value(
>>>
>>>   File
>>> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py",
>>> line 169, in deco
>>>
>>> return f(*a, **kw)
>>>
>>>   File
>>> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py",
>>> line 326, in get_return_value
>>>
>>> raise Py4JJavaError(
>>>
>>> py4j.protocol.Py4JJavaError: An error occurred while calling o636.save.
>>>
>>> : java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps
>>> scala.Predef$.refArrayOps(java.lang.Object[])'
>>>
>>> at
>>> org.apache.phoenix.spark.DataFrameFunctions.getFieldArray(DataFrameFunctions.scala:76)
>>>
>>> at
>>> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:35)
>>>
>>> at
>>> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:28)
>>>
>>> at
>>> org.apache.phoenix.spark.DefaultSource.createRelation(DefaultSource.scala:47)
>>>
>>> at
>>> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
>>>
>>> at
>>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>>>
>>> at
>>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>>>
>>

Re: error trying to save to database (Phoenix)

2023-08-21 Thread Sean Owen

That's a mismatch in the version of scala that your library uses vs spark
uses.

On Mon, Aug 21, 2023, 6:46 PM Kal Stevens  wrote:

> I am having a hard time figuring out what I am doing wrong here.
> I am not sure if I have an incompatible version of something installed or
> something else.
> I can not find anything relevant in google to figure out what I am doing
> wrong
> I am using *spark 3.4.1*, and *python3.10*
>
> This is my code to save my dataframe
> urls = []
> pull_sitemap_xml(robot, urls)
> df = spark.createDataFrame(data=urls, schema=schema)
> df.write.format("org.apache.phoenix.spark") \
> .mode("overwrite") \
> .option("table", "property") \
> .option("zkUrl", "192.168.1.162:2181") \
> .save()
>
> urls is an array of maps, containing a "url" and a "last_mod" field.
>
> Here is the error that I am getting
>
> Traceback (most recent call last):
>
>   File "/home/kal/real-estate/pullhttp/pull_properties.py", line 65, in
> main
>
> .save()
>
>   File
> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
> line 1396, in save
>
> self._jwrite.save()
>
>   File
> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
> line 1322, in __call__
>
> return_value = get_return_value(
>
>   File
> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py",
> line 169, in deco
>
> return f(*a, **kw)
>
>   File
> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py",
> line 326, in get_return_value
>
> raise Py4JJavaError(
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o636.save.
>
> : java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps
> scala.Predef$.refArrayOps(java.lang.Object[])'
>
> at
> org.apache.phoenix.spark.DataFrameFunctions.getFieldArray(DataFrameFunctions.scala:76)
>
> at
> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:35)
>
> at
> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:28)
>
> at
> org.apache.phoenix.spark.DefaultSource.createRelation(DefaultSource.scala:47)
>
> at
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
>
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>

Re: [VOTE] Release Apache Spark 3.5.0 (RC2)

2023-08-19 Thread Sean Owen

+1 this looks better to me. Works with Scala 2.13 / Java 17 for me.

On Sat, Aug 19, 2023 at 3:23 AM Yuanjian Li  wrote:

> Please vote on releasing the following candidate(RC2) as Apache Spark
> version 3.5.0.
>
> The vote is open until 11:59pm Pacific time Aug 23th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.5.0
>
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.5.0-rc2 (commit
> 010c4a6a05ff290bec80c12a00cd1bdaed849242):
>
> https://github.com/apache/spark/tree/v3.5.0-rc2
>
> The release files, including signatures, digests, etc. can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
>
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1446
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc2-docs/
>
> The list of bug fixes going into 3.5.0 can be found at the following URL:
>
> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>
> This release is using the release script of the tag v3.5.0-rc2.
>
>
> FAQ
>
> =
>
> How can I help test this release?
>
> =
>
> If you are a Spark user, you can help us test this release by taking
>
> an existing Spark workload and running on this release candidate, then
>
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
>
> the current RC and see if anything important breaks, in the Java/Scala
>
> you can add the staging repository to your projects resolvers and test
>
> with the RC (make sure to clean up the artifact cache before/after so
>
> you don't end up building with an out of date RC going forward).
>
> ===
>
> What should happen to JIRA tickets still targeting 3.5.0?
>
> ===
>
> The current list of open tickets targeted at 3.5.0 can be found at:
>
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.5.0
>
> Committers should look at those and triage. Extremely important bug
>
> fixes, documentation, and API tweaks that impact compatibility should
>
> be worked on immediately. Everything else please retarget to an
>
> appropriate release.
>
> ==
>
> But my bug isn't fixed?
>
> ==
>
> In order to make timely releases, we will typically not hold the
>
> release unless the bug in question is a regression from the previous
>
> release. That being said, if there is something which is a regression
>
> that has not been correctly targeted please ping me or a committer to
>
> help target the issue.
>
> Thanks,
>
> Yuanjian Li
>

Re: Spark Vulnerabilities

2023-08-14 Thread Sean Owen

Yeah, we generally don't respond to "look at the output of my static
analyzer".
Some of these are already addressed in a later version.
Some don't affect Spark.
Some are possibly an issue but hard to change without breaking lots of
things - they are really issues with upstream dependencies.

But for any you find that seem possibly relevant, that are directly
fixable, yes please open a PR with the change and your reasoning.

On Mon, Aug 14, 2023 at 7:42 AM Bjørn Jørgensen 
wrote:

> I have added links to the github PR. Or comment for those that I have not
> seen before.
>
> Apache Spark has very many dependencies, some can easily be upgraded while
> others are very hard to fix.
>
> Please feel free to open a PR if you wanna help.
>
> man. 14. aug. 2023 kl. 14:06 skrev Sankavi Nagalingam
> :
>
>> Hi Team,
>>
>>
>>
>> We could see there are many dependent vulnerabilities present in the
>> latest spark-core:3.4.1.jar. PFA
>>
>> Could you please let us know when will be the fix version available for
>> the users.
>>
>>
>>
>> Thanks,
>>
>> Sankavi
>>
>>
>>
>> The information in this e-mail and any attachments is confidential and
>> may be legally privileged. It is intended solely for the addressee or
>> addressees. Any use or disclosure of the contents of this
>> e-mail/attachments by a not intended recipient is unauthorized and may be
>> unlawful. If you have received this e-mail in error please notify the
>> sender. Please note that any views or opinions presented in this e-mail are
>> solely those of the author and do not necessarily represent those of
>> TEMENOS. We recommend that you check this e-mail and any attachments
>> against viruses. TEMENOS accepts no liability for any damage caused by any
>> malicious code or virus transmitted by this e-mail.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Question about ARRAY_INSERT between Spark and Databricks

2023-08-13 Thread Sean Owen

There shouldn't be any difference here. In fact, I get the results you list
for 'spark' from Databricks. It's possible the difference is a bug fix
along the way that is in the Spark version you are using locally but not in
the DBR you are using. But, yeah seems to work as. you say.

If you're asking about the Spark semantics being 1-indexed vs 0-indexed?
some comments here:
https://github.com/apache/spark/pull/38867#discussion_r1097054656

On Sun, Aug 13, 2023 at 7:28 AM Ran Tao  wrote:

> Hi, devs.
>
> I found that the  ARRAY_INSERT[1] function (from spark 3.4.0) has
> different semantics with databricks[2].
>
> e.g.
>
> // spark
> SELECT array_insert(array('a', 'b', 'c'), -1, 'z');
>  ["a","b","z","c"]
>
> // databricks
> SELECT array_insert(array('a', 'b', 'c'), -1, 'z');
>  ["a","b","c","z"]
>
> // spark
> SELECT array_insert(array('a', 'b', 'c'), -5, 'z');
> ["z",null,null,"a","b","c"]
>
> // databricks
> SELECT array_insert(array('a', 'b', 'c'), -5, 'z');
>  ["z",NULL,"a","b","c"]
>
> It looks like that inserting negative index is more reasonable in
> Databricks.
>
> Of cause, I read the source code of spark, and I can understand the logic
> of spark, but my question is whether spark is designed like this on purpose?
>
>
> [1] https://spark.apache.org/docs/latest/api/sql/index.html#array_insert
> [2]
> https://docs.databricks.com/en/sql/language-manual/functions/array_insert.html
>
>
> Best Regards,
> Ran Tao
> https://github.com/chucheng92
>

What else could be removed in Spark 4?

2023-08-07 Thread Sean Owen

While we're noodling on the topic, what else might be worth removing in
Spark 4?

For example, looks like we're finally hitting problems supporting Java 8
through 21 all at once, related to Scala 2.13.x updates. It would be
reasonable to require Java 11, or even 17, as a baseline for the multi-year
lifecycle of Spark 4.

Dare I ask: drop Scala 2.12? supporting 2.12 / 2.13 / 3.0 might get hard
otherwise.

There was a good discussion about whether old deprecated methods should be
removed. They can't be removed at other times, but, doesn't mean they all
*should* be. createExternalTable was brought up as a first example. What
deprecated methods are worth removing?

There's Mesos support, long since deprecated, which seems like something to
prune.

Are there old Hive/Hadoop version combos we should just stop supporting?

Re: [VOTE] Release Apache Spark 3.5.0 (RC1)

2023-08-06 Thread Sean Owen

Let's keep testing 3.5.0 of course while that change is going in. (See
https://github.com/apache/spark/pull/42364#issuecomment-1666878287 )

Otherwise testing is pretty much as usual, except I get this test failure
in Connect, which is new. Anyone else? this is Java 8, Scala 2.13, Debian
12.

- from_protobuf_messageClassName_options *** FAILED ***
  org.apache.spark.sql.AnalysisException: [CANNOT_LOAD_PROTOBUF_CLASS]
Could not load Protobuf class with name
org.apache.spark.connect.proto.StorageLevel.
org.apache.spark.connect.proto.StorageLevel does not extend shaded Protobuf
Message class org.sparkproject.spark_protobuf.protobuf.Message. The jar
with Protobuf classes needs to be shaded (com.google.protobuf.* -->
org.sparkproject.spark_protobuf.protobuf.*).
  at
org.apache.spark.sql.errors.QueryCompilationErrors$.protobufClassLoadError(QueryCompilationErrors.scala:3554)
  at
org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptorFromJavaClass(ProtobufUtils.scala:198)
  at
org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptor(ProtobufUtils.scala:156)
  at
org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor$lzycompute(ProtobufDataToCatalyst.scala:58)
  at
org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor(ProtobufDataToCatalyst.scala:57)
  at
org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType$lzycompute(ProtobufDataToCatalyst.scala:43)
  at
org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType(ProtobufDataToCatalyst.scala:42)
  at
org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:194)
  at
org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:73)
  at scala.collection.immutable.List.map(List.scala:246)

On Sat, Aug 5, 2023 at 5:42 PM Sean Owen  wrote:

> I'm still testing other combinations, but it looks like tests fail on Java
> 17 after building with Java 8, which should be a normal supported
> configuration.
> This is described at https://github.com/apache/spark/pull/41943 and looks
> like it is resolved by moving back to Scala 2.13.8 for now.
> Unless I'm missing something we need to fix this for 3.5 or it's not clear
> the build will run on Java 17.
>
> On Fri, Aug 4, 2023 at 5:45 PM Yuanjian Li  wrote:
>
>> Please vote on releasing the following candidate(RC1) as Apache Spark
>> version 3.5.0.
>>
>> The vote is open until 11:59pm Pacific time Aug 9th and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.5.0
>>
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.5.0-rc1 (commit
>> 7e862c01fc9a1d3b47764df8b6a4b5c4cafb0807):
>>
>> https://github.com/apache/spark/tree/v3.5.0-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>>
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>>
>> https://repository.apache.org/content/repositories/orgapachespark-1444
>>
>> The documentation corresponding to this release can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc1-docs/
>>
>> The list of bug fixes going into 3.5.0 can be found at the following URL:
>>
>> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>>
>> This release is using the release script of the tag v3.5.0-rc1.
>>
>>
>> FAQ
>>
>> =
>>
>> How can I help test this release?
>>
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>>
>> an existing Spark workload and running on this release candidate, then
>>
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>>
>> the current RC and see if anything important breaks, in the Java/Scala
>>
>> you can add the staging repository to your projects resolvers and test
>>
>> with the RC (make sure to clean up the artifact cache before/after so
>>
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>>
>> What should happen to JIRA tickets still targeting 3.5.0?
>>
>> ===
>>
>> The current list of open tickets targeted at 3.5.0 can

Re: [VOTE] Release Apache Spark 3.5.0 (RC1)

2023-08-05 Thread Sean Owen

I'm still testing other combinations, but it looks like tests fail on Java
17 after building with Java 8, which should be a normal supported
configuration.
This is described at https://github.com/apache/spark/pull/41943 and looks
like it is resolved by moving back to Scala 2.13.8 for now.
Unless I'm missing something we need to fix this for 3.5 or it's not clear
the build will run on Java 17.

On Fri, Aug 4, 2023 at 5:45 PM Yuanjian Li  wrote:

> Please vote on releasing the following candidate(RC1) as Apache Spark
> version 3.5.0.
>
> The vote is open until 11:59pm Pacific time Aug 9th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.5.0
>
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.5.0-rc1 (commit
> 7e862c01fc9a1d3b47764df8b6a4b5c4cafb0807):
>
> https://github.com/apache/spark/tree/v3.5.0-rc1
>
> The release files, including signatures, digests, etc. can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
>
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1444
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc1-docs/
>
> The list of bug fixes going into 3.5.0 can be found at the following URL:
>
> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>
> This release is using the release script of the tag v3.5.0-rc1.
>
>
> FAQ
>
> =
>
> How can I help test this release?
>
> =
>
> If you are a Spark user, you can help us test this release by taking
>
> an existing Spark workload and running on this release candidate, then
>
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
>
> the current RC and see if anything important breaks, in the Java/Scala
>
> you can add the staging repository to your projects resolvers and test
>
> with the RC (make sure to clean up the artifact cache before/after so
>
> you don't end up building with an out of date RC going forward).
>
> ===
>
> What should happen to JIRA tickets still targeting 3.5.0?
>
> ===
>
> The current list of open tickets targeted at 3.5.0 can be found at:
>
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.5.0
>
> Committers should look at those and triage. Extremely important bug
>
> fixes, documentation, and API tweaks that impact compatibility should
>
> be worked on immediately. Everything else please retarget to an
>
> appropriate release.
>
> ==
>
> But my bug isn't fixed?
>
> ==
>
> In order to make timely releases, we will typically not hold the
>
> release unless the bug in question is a regression from the previous
>
> release. That being said, if there is something which is a regression
>
> that has not been correctly targeted please ping me or a committer to
>
> help target the issue.
>
> Thanks,
>
> Yuanjian Li
>
>

Re: conver panda image column to spark dataframe

2023-08-03 Thread Sean Owen

pp4 has one row, I'm guessing - containing an array of 10 images. You want
10 rows of 1 image each.
But, just don't do this. Pass the bytes of the image as an array,
along with width/height/channels, and reshape it on use. It's just easier.
That is how the Spark image representation works anyway

On Thu, Aug 3, 2023 at 8:43 PM second_co...@yahoo.com.INVALID
 wrote:

> Hello Adrian,
>
>   here is the snippet
>
> import tensorflow_datasets as tfds
>
> (ds_train, ds_test), ds_info = tfds.load(
> dataset_name, data_dir='',  split=["train",
> "test"], with_info=True, as_supervised=True
> )
>
> schema = StructType([
> StructField("image",
> ArrayType(ArrayType(ArrayType(IntegerType(, nullable=False),
> StructField("label", IntegerType(), nullable=False)
> ])
> pp4 =
> spark.createDataFrame(pd.DataFrame(tfds.as_dataframe(ds_train.take(4),
> ds_info)), schema)
>
>
>
> raised error
>
> , TypeError: field image: ArrayType(ArrayType(ArrayType(IntegerType(), True), 
> True), True) can not accept object array([[[14, 14, 14],
> [14, 14, 14],
> [14, 14, 14],
> ...,
> [19, 17, 20],
> [19, 17, 20],
> [19, 17, 20]],
>
>
>
>
>
> On Thursday, August 3, 2023 at 11:34:08 PM GMT+8, Adrian Pop-Tifrea <
> poptifreaadr...@gmail.com> wrote:
>
>
> Hello,
>
> can you also please show us how you created the pandas dataframe? I mean,
> how you added the actual data into the dataframe. It would help us for
> reproducing the error.
>
> Thank you,
> Pop-Tifrea Adrian
>
> On Mon, Jul 31, 2023 at 5:03 AM second_co...@yahoo.com <
> second_co...@yahoo.com> wrote:
>
> i changed to
>
> ArrayType(ArrayType(ArrayType(IntegerType( , still get same error
>
> Thank you for responding
>
> On Thursday, July 27, 2023 at 06:58:09 PM GMT+8, Adrian Pop-Tifrea <
> poptifreaadr...@gmail.com> wrote:
>
>
> Hello,
>
> when you said your pandas Dataframe has 10 rows, does that mean it
> contains 10 images? Because if that's the case, then you'd want ro only use
> 3 layers of ArrayType when you define the schema.
>
> Best regards,
> Adrian
>
>
>
> On Thu, Jul 27, 2023, 11:04 second_co...@yahoo.com.INVALID
>  wrote:
>
> i have panda dataframe with column 'image' using numpy.ndarray. shape is (500,
> 333, 3) per image. my panda dataframe has 10 rows, thus, shape is (10,
> 500, 333, 3)
>
> when using spark.createDataframe(panda_dataframe, schema), i need to
> specify the schema,
>
> schema = StructType([
> StructField("image",
> ArrayType(ArrayType(ArrayType(ArrayType(IntegerType(), nullable=False)
> ])
>
>
> i get error
>
> raise TypeError(
> , TypeError: field image: 
> ArrayType(ArrayType(ArrayType(ArrayType(IntegerType(), True), True), True), 
> True) can not accept object array([[[14, 14, 14],
>
> ...
>
> Can advise how to set schema for image with numpy.ndarray ?
>
>
>
>

Re: Interested in contributing to SPARK-24815

2023-08-03 Thread Sean Owen

Formally, an ICLA is required, and you can read more here:
https://www.apache.org/licenses/contributor-agreements.html

In practice, it's unrealistic to collect and verify an ICLA for every PR
contributed by 1000s of people. We have not gated on that.
But, contributions are in all cases governed by the same terms, even
without a signed ICLA. That's the verbiage you're referring to.
A CLA is a good idea, for sure, if there are any questions about the terms
of your contribution.

Here there does seem to be a question - retaining Twilio copyright headers
in source code. That is generally not what would happen for your everyday
contributions to an ASF project, as the copyright header (and CLAs) already
describe the relevant questions of rights: it has been licensed to the ASF.
(There are other situations where retaining a distinct copyright header is
required, typically when adding code licensed under another OSS license,
but I don't think they apply here)

I would say you should review and execute a CCLA for Twilio (assuming you
agree with the terms) to avoid doubt.


On Thu, Aug 3, 2023 at 6:34 PM Rinat Shangeeta 
wrote:

> (Adding my manager Eugene Kim who will cover me as I plan to be out of the
> office soon)
>
> Hi Kent and Sean,
>
> Nice to meet you. I am working on the OSS legal aspects with Pavan who is
> planning to make the contribution request to the Spark project. I saw that
> Sean mentioned in his email that the contributions would be governed under
> the ASF CCLA. In the Spark contribution guidelines
> <https://spark.apache.org/contributing.html>, there is no mention of
> having to sign a CCLA. In fact, this is what I found in the contribution
> guidelines:
>
> Contributing code changes
>
> Please review the preceding section before proposing a code change. This
> section documents how to do so.
>
> When you contribute code, you affirm that the contribution is your
> original work and that you license the work to the project under the
> project’s open source license. Whether or not you state this explicitly,
> by submitting any copyrighted material via pull request, email, or other
> means you agree to license the material under the project’s open source
> license and warrant that you have the legal authority to do so.
>
> Can you please point us to an authoritative source about the process?
>
> Also, is there a way to find out if a signed CCLA already exists for
> Twilio from your end? Thanks and appreciate your help!
>
>
> Best,
> Rinat
>
> *Rinat Shangeeta*
> Sr. Patent/Open Source Counsel
> [image: Twilio] <https://www.twilio.com/?utm_source=email_signature>
>
>
> On Wed, Jul 26, 2023 at 2:27 PM Pavan Kotikalapudi <
> pkotikalap...@twilio.com> wrote:
>
>> Thanks for the response with all the information Sean and Kent.
>>
>> Is there a way to figure out if my employer (Twilio) part of CCLA?
>>
>> cc'ing: @Rinat Shangeeta  our Open Source Counsel
>> at twilio
>>
>> Thank you,
>>
>> Pavan
>>
>> On Tue, Jul 25, 2023 at 10:48 PM Kent Yao  wrote:
>>
>>> Hi Pavan,
>>>
>>> Refer to the ASF Source Header and Copyright Notice Policy[1], code
>>> directly submitted to ASF should include the Apache license header
>>> without any additional copyright notice.
>>>
>>>
>>> Kent Yao
>>>
>>> [1]
>>> https://urldefense.com/v3/__https://www.apache.org/legal/src-headers.html*headers__;Iw!!NCc8flgU!c_mZKzBbSjJtYRjillV20gRzzzDOgW2ooH6ctfrqaJA8Eu4D5yfA7OlQnGm5JpdAZIU_doYmrsufzUc$
>>>
>>> Sean Owen  于2023年7月25日周二 07:22写道：
>>>
>>> >
>>> > When contributing to an ASF project, it's governed by the terms of the
>>> ASF ICLA:
>>> https://urldefense.com/v3/__https://www.apache.org/licenses/icla.pdf__;!!NCc8flgU!c_mZKzBbSjJtYRjillV20gRzzzDOgW2ooH6ctfrqaJA8Eu4D5yfA7OlQnGm5JpdAZIU_doYmZDPppZg$
>>> or CCLA:
>>> https://urldefense.com/v3/__https://www.apache.org/licenses/cla-corporate.pdf__;!!NCc8flgU!c_mZKzBbSjJtYRjillV20gRzzzDOgW2ooH6ctfrqaJA8Eu4D5yfA7OlQnGm5JpdAZIU_doYmUNwE-5A$
>>> >
>>> > I don't believe ASF projects ever retain an original author copyright
>>> statement, but rather source files have a statement like:
>>> >
>>> > ...
>>> >  * Licensed to the Apache Software Foundation (ASF) under one or more
>>> >  * contributor license agreements.  See the NOTICE file distributed
>>> with
>>> >  * this work for additional information regarding copyright ownership.
>>> > ...
>>> >
>>> > While it's conceivable that such a statement could live in a NOTICE
>>> file, I don't bel

Re: [VOTE] SPIP: XML data source support

2023-07-28 Thread Sean Owen

+1 I think that porting the package 'as is' into Spark is probably
worthwhile.
That's relatively easy; the code is already pretty battle-tested and not
that big and even originally came from Spark code, so is more or less
similar already.

One thing it never got was DSv2 support, which means XML reading would
still be somewhat behind other formats. (I was not able to implement it.)
This isn't a necessary goal right now, but would be possibly part of the
logic of moving it into the Spark code base.

On Fri, Jul 28, 2023 at 5:38 PM Sandip Agarwala
 wrote:

> Dear Spark community,
>
> I would like to start the vote for "SPIP: XML data source support".
>
> XML is a widely used data format. An external spark-xml package (
> https://github.com/databricks/spark-xml) is available to read and write
> XML data in spark. Making spark-xml built-in will provide a better user
> experience for Spark SQL and structured streaming. The proposal is to
> inline code from the spark-xml package.
>
> SPIP link:
>
> https://docs.google.com/document/d/1ZaOBT4-YFtN58UCx2cdFhlsKbie1ugAn-Fgz_Dddz-Q/edit?usp=sharing
>
> JIRA:
> https://issues.apache.org/jira/browse/SPARK-44265
>
> Discussion Thread:
> https://lists.apache.org/thread/q32hxgsp738wom03mgpg9ykj9nr2n1fh
>
> Please vote on the SPIP for the next 72 hours:
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because __.
>
> Thanks, Sandip
>

Re: spark context list_packages()

2023-07-27 Thread Sean Owen

There is no such method in Spark. I think that's some EMR-specific
modification.

On Wed, Jul 26, 2023 at 11:06 PM second_co...@yahoo.com.INVALID
 wrote:

> I ran the following code
>
> spark.sparkContext.list_packages()
>
> on spark 3.4.1 and i get below error
>
> An error was encountered:
> AttributeError
> [Traceback (most recent call last):
> ,   File "/tmp/spark-3d66c08a-08a3-4d4e-9fdf-45853f65e03d/shell_wrapper.py", 
> line 113, in exec
> self._exec_then_eval(code)
> ,   File "/tmp/spark-3d66c08a-08a3-4d4e-9fdf-45853f65e03d/shell_wrapper.py", 
> line 106, in _exec_then_eval
> exec(compile(last, '', 'single'), self.globals)
> ,   File "", line 1, in 
> , AttributeError: 'SparkContext' object has no attribute 'list_packages'
> ]
>
>
> Is list_packages and install_pypi_package available for vanilla spark or
> only available for AWS services?
>
>
> Thank you
>

Re: Spark 3.0.0 EOL

2023-07-26 Thread Sean Owen

There aren't "LTS" releases, though you might expect the last 3.x release
will see maintenance releases longer. See end of
https://spark.apache.org/versioning-policy.html

On Wed, Jul 26, 2023 at 3:56 AM Manu Zhang  wrote:

> Will Apache Spark 3.5 be a LTS version?
>
> Thanks,
> Manu
>
> On Mon, Jul 24, 2023 at 4:26 PM Dongjoon Hyun 
> wrote:
>
>> As Hyukjin replied, Apache Spark 3.0.0 is already in EOL status.
>>
>> To Pralabh, FYI, in the community,
>>
>> - Apache Spark 3.2 also reached the EOL already.
>>   https://lists.apache.org/thread/n4mdfwr5ksgpmrz0jpqp335qpvormos1
>>
>> If you are considering Apache Spark 4, here is the other 3.x timeline,
>>
>> - Apache Spark 3.3 => December, 2023.
>> - Apache Spark 3.4 => October, 2024
>> - Upcoming Apache Spark 3.5 => 18 months from the release
>>
>> Thanks,
>> Dongjoon.
>>
>>
>> On Mon, Jul 24, 2023 at 12:21 AM Hyukjin Kwon 
>> wrote:
>>
>>>
>>> It's already EOL
>>>
>>> On Mon, Jul 24, 2023 at 4:17 PM Pralabh Kumar 
>>> wrote:
>>>
 Hi Dev Team

 If possible , can you please provide the Spark 3.0.0 EOL timelines .

 Regards
 Pralabh Kumar

Re: Interested in contributing to SPARK-24815

2023-07-24 Thread Sean Owen

When contributing to an ASF project, it's governed by the terms of the ASF
ICLA: https://www.apache.org/licenses/icla.pdf or CCLA:
https://www.apache.org/licenses/cla-corporate.pdf

I don't believe ASF projects ever retain an original author copyright
statement, but rather source files have a statement like:

...
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
...

While it's conceivable that such a statement could live in a NOTICE file, I
don't believe that's been done for any of the thousands of other
contributors. That's really more for noting the license of
non-Apache-licensed code. Code directly contributed to the project is
assumed to have been licensed per above already.

It might be wise to review the CCLA with Twilio and consider establishing
that to govern contributions.

On Mon, Jul 24, 2023 at 6:10 PM Pavan Kotikalapudi
 wrote:

> Hi Spark Dev,
>
> My name is Pavan Kotikalapudi, I work at Twilio.
>
> I am looking to contribute to this spark issue
> https://issues.apache.org/jira/browse/SPARK-24815.
>
> There is a clause from the company's OSS saying
>
> - The proposed contribution is about 100 lines of code modification in the
> Spark project, involving two files - this is considered a large
> contribution. An appropriate Twilio copyright notice needs to be added for
> the portion of code that is newly added.
>
> Please let me know if that is acceptable?
>
> Thank you,
>
> Pavan
>
>

Re: How to read excel file in PySpark

2023-06-20 Thread Sean Owen

No, a pandas on Spark DF is distributed.

On Tue, Jun 20, 2023, 1:45 PM Mich Talebzadeh 
wrote:

> Thanks but if you create a Spark DF from Pandas DF that Spark DF is not
> distributed and remains on the driver. I recall a while back we had this
> conversation. I don't think anything has changed.
>
> Happy to be corrected
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 20 Jun 2023 at 20:09, Bjørn Jørgensen 
> wrote:
>
>> Pandas API on spark is an API so that users can use spark as they use
>> pandas. This was known as koalas.
>>
>> Is this limitation still valid for Pandas?
>> For pandas, yes. But what I did show wos pandas API on spark so its spark.
>>
>>  Additionally when we convert from Panda DF to Spark DF, what process is
>> involved under the bonnet?
>> I gess pyarrow and drop the index column.
>>
>> Have a look at
>> https://github.com/apache/spark/tree/master/python/pyspark/pandas
>>
>> tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh <
>> mich.talebza...@gmail.com>:
>>
>>> Whenever someone mentions Pandas I automatically think of it as an excel
>>> sheet for Python.
>>>
>>> OK my point below needs some qualification
>>>
>>> Why Spark here. Generally, parallel architecture comes into play when
>>> the data size is significantly large which cannot be handled on a single
>>> machine, hence, the use of Spark becomes meaningful. In cases where (the
>>> generated) data size is going to be very large (which is often norm rather
>>> than the exception these days), the data cannot be processed and stored in
>>> Pandas data frames as these data frames store data in RAM. Then, the whole
>>> dataset from a storage like HDFS or cloud storage cannot be collected,
>>> because it will take significant time and space and probably won't fit in a
>>> single machine RAM. (in this the driver memory)
>>>
>>> Is this limitation still valid for Pandas? Additionally when we convert
>>> from Panda DF to Spark DF, what process is involved under the bonnet?
>>>
>>> Thanks
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen 
>>> wrote:
>>>
 This is pandas API on spark

 from pyspark import pandas as ps
 df = ps.read_excel("testexcel.xlsx")
 [image: image.png]
 this will convert it to pyspark
 [image: image.png]

 tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme
 :

> Good day,
>
>
>
> I have a task to read excel files in databricks but I cannot seem to
> proceed. I am referencing the API documents -  read_excel
> 
> , but there is an error sparksession object has no attribute
> 'read_excel'. Can you advise?
>
>
>
> *JOHN PAUL JAYME*
> Data Engineer
>
> m. +639055716384  w. www.tdcx.com
>
>
>
> *Winner of over 350 Industry Awards*
>
> [image: Linkedin]  [image:
> Facebook]  [image: Twitter]
>  [image: Youtube]
>  [image: Instagram]
> 
>
>
>
> This is a confidential email that may be privileged or legally
> protected. You are not authorized to copy or disclose the contents of this
> email. If you are not the intended addressee, please inform the sender and
> delete this email.
>
>
>
>
>


 --
 Bjørn Jørgensen
 Vestre Aspehaug 4, 6010 Ålesund
 Norge

 +47 480 94 297

>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

Re: How to read excel file in PySpark

2023-06-20 Thread Sean Owen

It is indeed not part of SparkSession. See the link you cite. It is part of
the pyspark pandas API

On Tue, Jun 20, 2023, 5:42 AM John Paul Jayme 
wrote:

> Good day,
>
>
>
> I have a task to read excel files in databricks but I cannot seem to
> proceed. I am referencing the API documents -  read_excel
> 
> , but there is an error sparksession object has no attribute
> 'read_excel'. Can you advise?
>
>
>
> *JOHN PAUL JAYME*
> Data Engineer
>
> m. +639055716384  w. www.tdcx.com
>
>
>
> *Winner of over 350 Industry Awards*
>
> [image: Linkedin]  [image:
> Facebook]  [image: Twitter]
>  [image: Youtube]
>  [image: Instagram]
> 
>
>
>
> This is a confidential email that may be privileged or legally protected.
> You are not authorized to copy or disclose the contents of this email. If
> you are not the intended addressee, please inform the sender and delete
> this email.
>
>
>
>
>

Re: [VOTE] Apache Spark PMC asks Databricks to differentiate its Spark version string

2023-06-16 Thread Sean Owen

On Fri, Jun 16, 2023 at 3:58 PM Dongjoon Hyun 
wrote:

> I started the thread about already publicly visible version issues
> according to the ASF PMC communication guideline. It's no confidential,
> personal, or security-related stuff. Are you insisting this is confidential?
>

Discussion about a particular company should be on private@ - this is IMHO
like "personnel matters", in the doc you link. The principle is that
discussing whether an entity is doing something right or wrong is better in
private, because, hey, if the conclusion is "nothing's wrong here" then you
avoid disseminating any implication to the contrary.

I agreed with you, there's some value in discussing the general issue on
dev@. (I even said who the company was, though, it was I think clear before)

But, your thread title here is: "Apache Spark PMC asks Databricks to
differentiate its Spark version string"
(You separately claim this vote is about whether the PMC has a role here,
but, that's plainly not how this thread begins.)

Given that this has stopped being about ASF policy, and seems to be about
taking some action related to a company, I find it inappropriate again for
dev@, for exactly the reason I gave above. We have a PMC member repeating
this claim over and over, without support. This is why we don't do this in
public.

> May I ask which relevant context you are insisting not to receive
> specifically? I gave the specific examples (UI/logs/screenshot), and got
> the specific legal advice from `legal-discuss@` and replied why the
> version should be different.
>

It is the thread I linked in my reply:
https://lists.apache.org/thread/k7gr65wt0fwtldc7hp7bd0vkg1k93rrb
This has already been discussed at length, and you're aware of it, but,
didn't mention it. I think that's critical; your text contains no problem
statement at all by itself.

Since we're here, fine: I vote -1, simply because this states no reason for
the action at all.
If we assume the thread ^^^ above is the extent of the logic, then, -1 for
the following reasons:
- Relevant ASF policy seems to say this is fine, as argued at
https://lists.apache.org/thread/p15tc772j9qwyvn852sh8ksmzrol9cof
- There is no argument any of this has caused a problem for the community
anyway; there is just nothing to 'fix'

I would again ask we not simply repeat the same thread again.

Re: [VOTE] Apache Spark PMC asks Databricks to differentiate its Spark version string

2023-06-16 Thread Sean Owen

As we noted in the last thread, this discussion should have been on private@
to begin with, but, the ship has sailed.

You are suggesting that non-PMC members vote on whether the PMC has to do
something? No, that's not how anything works here.
It's certainly the PMC that decides what to put in the board report, or
take action on behalf of the project.

This doesn't make sense here. Frankly, repeating this publicly without
relevant context, and avoiding the response you already got, is
inappropriate.

You may call a PMC vote on whether there's even an issue here, sure. If you
pursue it, you should explain specifically what the issue is w.r.t. policy,
and argue against the response you've already received.
We put valid issues in the board report, for sure. We do not include
invalid issues in the board report. That part needs no decision from anyone.

On Fri, Jun 16, 2023 at 3:08 PM Dongjoon Hyun 
wrote:

> No, this is a vote on dev@ intentionally as a part of our previous
> thread, "ASF policy violation and Scala version issues" (
> https://lists.apache.org/thread/k7gr65wt0fwtldc7hp7bd0vkg1k93rrb)
>
> > did you mean this for the PMC list?
>
> I clearly started the thread with the following.
> > - Apache Spark PMC should include this incident report and the result in
> the next Apache Spark Quarterly Report (August).
>
> However, there is a perspective that this is none of Apache Spark PMC's
> role here.
>
> That's the rationale of this vote.
>
> This vote is whether this is Apache Spark PMC's role or not.
>
> Dongjoon.
>

Re: [VOTE] Apache Spark PMC asks Databricks to differentiate its Spark version string

2023-06-16 Thread Sean Owen

What does a vote on dev@ mean? did you mean this for the PMC list?

Dongjoon - this offers no rationale about "why". The more relevant thread
begins here:
https://lists.apache.org/thread/k7gr65wt0fwtldc7hp7bd0vkg1k93rrb but it
likewise never got to connecting a specific observation to policy. Could
you explain your logic more concretely? otherwise this is still going
nowhere.

On Fri, Jun 16, 2023 at 2:53 PM Dongjoon Hyun  wrote:

> Please vote on the following statement. The vote is open until June 23th
> 1AM (PST) and passes if a majority +1 PMC votes are cast, with a minimum of
> 3 +1 votes.
>
> Apache Spark PMC asks Databricks to differentiate its Spark
> version string to avoid confusions because Apache Spark PMC
> is responsible for ensuring to follow ASF requirements[1] and
> respects ASF's legal advice [2, 3],
>
> [ ] +1 Yes
> [ ] -1 No because ...
>
> 
> 1. https://www.apache.org/foundation/governance/pmcs#organization
> 2. https://lists.apache.org/thread/mzhggd0rpz8t4d7vdsbhkp38mvd3lty4
> 3. https://www.apache.org/foundation/marks/downstream.html#source
>

Re: Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread Sean Owen

You sure it is not just that it's displaying in your local TZ? Check the
actual value as a long for example. That is likely the same time.

On Thu, Jun 8, 2023, 5:50 PM karan alang  wrote:

> ref :
> https://stackoverflow.com/questions/76436159/apache-spark-not-reading-utc-timestamp-from-mongodb-correctly
>
> Hello All,
> I've data stored in MongoDB collection and the timestamp column is not
> being read by Apache Spark correctly. I'm running Apache Spark on GCP
> Dataproc.
>
> Here is sample data :
>
> -
>
> In Mongo :
>
> timeslot_date  :
> timeslot  |timeslot_date |
> +--+--1683527400|{2023-05-08T06:30:00Z}|
>
>
> When I use pyspark to read this  :
>
> +--+---+
> timeslot  |timeslot_date  |
> +--+---+1683527400|2023-05-07 23:30:00|
> ++---+-
>
> -
>
> My understanding is, data in Mongo is in UTC format i.e. 2023-05-08T06:30:00Z 
> is in UTC format. I'm in PST timezone. I'm not clear why spark is reading it 
> a different timezone format (neither PST nor UTC) Note - it is not reading it 
> as PST timezone, if it was doing that it would advance the time by 7 hours, 
> instead it is doing the opposite.
>
> Where is the default timezone format taken from, when Spark is reading data 
> from MongoDB ?
>
> Any ideas on this ?
>
> tia!
>
>
>
>
>

Re: JDK version support policy?

2023-06-08 Thread Sean Owen

Noted, but for that you'd simply run your app on Java 17. If Spark works,
and your app's dependencies work on Java 17 because you compile it for 17
(and jakarta.* classes for example) then there's no issue.

On Thu, Jun 8, 2023 at 3:13 AM Martin Andersson 
wrote:

> There are some reasons to drop Java 11 as well. Java 17 included a large
> change, breaking backwards compatibility with their transition from Java
> EE to Jakarta EE
> <https://blogs.oracle.com/javamagazine/post/transition-from-java-ee-to-jakarta-ee>.
> This means that any users using Spark 4.0 together with Spring 6.x or any
> recent version of servlet containers such as Tomcat or Jetty will
> experience issues. (For security reasons it's beneficial to float your
> dependencies to the latest version of these libraries/frameworks)
>
> I'm not explicitly saying Java 11 should be dropped in Spark 4, just
> thought I'd bring this issue to your attention.
>
> Best Regards, Martin
> --
> *From:* Jungtaek Lim 
> *Sent:* Wednesday, June 7, 2023 23:19
> *To:* Sean Owen 
> *Cc:* Dongjoon Hyun ; Holden Karau <
> hol...@pigscanfly.ca>; dev 
> *Subject:* Re: JDK version support policy?
>
>
> EXTERNAL SENDER. Do not click links or open attachments unless you
> recognize the sender and know the content is safe. DO NOT provide your
> username or password.
>
> +1 to drop Java 8 but +1 to set the lowest support version to Java 11.
>
> Considering the phase for only security updates, 11 LTS would not be EOLed
> in very long time. Unless that’s coupled with other deps which require
> bumping JDK version (hope someone can bring up lists), it doesn’t seem to
> buy much. And given the strong backward compatibility JDK provides, that’s
> less likely.
>
> Purely from the project’s source code view, does anyone know how much
> benefits we can leverage for picking up 17 rather than 11? I lost the
> track, but some of their proposals are more likely catching up with other
> languages, which don’t make us be happy since Scala provides them for years.
>
> 2023년 6월 8일 (목) 오전 2:35, Sean Owen 님이 작성:
>
> I also generally perceive that, after Java 9, there is much less breaking
> change. So working on Java 11 probably means it works on 20, or can be
> easily made to without pain. Like I think the tweaks for Java 17 were quite
> small.
>
> Targeting Java >11 excludes Java 11 users and probably wouldn't buy much.
> Keeping the support probably doesn't interfere with working on much newer
> JVMs either.
>
> On Wed, Jun 7, 2023, 12:29 PM Holden Karau  wrote:
>
> So JDK 11 is still supported in open JDK until 2026, I'm not sure if we're
> going to see enough folks moving to JRE17 by the Spark 4 release unless we
> have a strong benefit from dropping 11 support I'd be inclined to keep it.
>
> On Tue, Jun 6, 2023 at 9:08 PM Dongjoon Hyun  wrote:
>
> I'm also +1 on dropping both Java 8 and 11 in Apache Spark 4.0, too.
>
> Dongjoon.
>
> On 2023/06/07 02:42:19 yangjie01 wrote:
> > +1 on dropping Java 8 in Spark 4.0, and I even hope Spark 4.0 can only
> support Java 17 and the upcoming Java 21.
> >
> > 发件人: Denny Lee 
> > 日期: 2023年6月7日 星期三 07:10
> > 收件人: Sean Owen 
> > 抄送: David Li , "dev@spark.apache.org" <
> dev@spark.apache.org>
> > 主题: Re: JDK version support policy?
> >
> > +1 on dropping Java 8 in Spark 4.0, saying this as a fan of the
> fast-paced (positive) updates to Arrow, eh?!
> >
> > On Tue, Jun 6, 2023 at 4:02 PM Sean Owen  sro...@gmail.com>> wrote:
> > I haven't followed this discussion closely, but I think we could/should
> drop Java 8 in Spark 4.0, which is up next after 3.5?
> >
> > On Tue, Jun 6, 2023 at 2:44 PM David Li  lidav...@apache.org>> wrote:
> > Hello Spark developers,
> >
> > I'm from the Apache Arrow project. We've discussed Java version support
> [1], and crucially, whether to continue supporting Java 8 or not. As Spark
> is a big user of Arrow in Java, I was curious what Spark's policy here was.
> >
> > If Spark intends to stay on Java 8, for instance, we may also want to
> stay on Java 8 or otherwise provide some supported version of Arrow for
> Java 8.
> >
> > We've seen dependencies dropping or planning to drop support. gRPC may
> drop Java 8 at any time [2], possibly this September [3], which may affect
> Spark (due to Spark Connect). And today we saw that Arrow had issues
> running tests with Mockito on Java 20, but we couldn't update Mockito since
> it had dropped Java 8 support. (We pinned the JDK version in that CI
> pipeline for now.)
> >
> > So at least, I am curious if Arrow could start the long proces

Re: JDK version support policy?

2023-06-07 Thread Sean Owen

I also generally perceive that, after Java 9, there is much less breaking
change. So working on Java 11 probably means it works on 20, or can be
easily made to without pain. Like I think the tweaks for Java 17 were quite
small.

Targeting Java >11 excludes Java 11 users and probably wouldn't buy much.
Keeping the support probably doesn't interfere with working on much newer
JVMs either.

On Wed, Jun 7, 2023, 12:29 PM Holden Karau  wrote:

> So JDK 11 is still supported in open JDK until 2026, I'm not sure if we're
> going to see enough folks moving to JRE17 by the Spark 4 release unless we
> have a strong benefit from dropping 11 support I'd be inclined to keep it.
>
> On Tue, Jun 6, 2023 at 9:08 PM Dongjoon Hyun  wrote:
>
>> I'm also +1 on dropping both Java 8 and 11 in Apache Spark 4.0, too.
>>
>> Dongjoon.
>>
>> On 2023/06/07 02:42:19 yangjie01 wrote:
>> > +1 on dropping Java 8 in Spark 4.0, and I even hope Spark 4.0 can only
>> support Java 17 and the upcoming Java 21.
>> >
>> > 发件人: Denny Lee 
>> > 日期: 2023年6月7日 星期三 07:10
>> > 收件人: Sean Owen 
>> > 抄送: David Li , "dev@spark.apache.org" <
>> dev@spark.apache.org>
>> > 主题: Re: JDK version support policy?
>> >
>> > +1 on dropping Java 8 in Spark 4.0, saying this as a fan of the
>> fast-paced (positive) updates to Arrow, eh?!
>> >
>> > On Tue, Jun 6, 2023 at 4:02 PM Sean Owen > sro...@gmail.com>> wrote:
>> > I haven't followed this discussion closely, but I think we could/should
>> drop Java 8 in Spark 4.0, which is up next after 3.5?
>> >
>> > On Tue, Jun 6, 2023 at 2:44 PM David Li > lidav...@apache.org>> wrote:
>> > Hello Spark developers,
>> >
>> > I'm from the Apache Arrow project. We've discussed Java version support
>> [1], and crucially, whether to continue supporting Java 8 or not. As Spark
>> is a big user of Arrow in Java, I was curious what Spark's policy here was.
>> >
>> > If Spark intends to stay on Java 8, for instance, we may also want to
>> stay on Java 8 or otherwise provide some supported version of Arrow for
>> Java 8.
>> >
>> > We've seen dependencies dropping or planning to drop support. gRPC may
>> drop Java 8 at any time [2], possibly this September [3], which may affect
>> Spark (due to Spark Connect). And today we saw that Arrow had issues
>> running tests with Mockito on Java 20, but we couldn't update Mockito since
>> it had dropped Java 8 support. (We pinned the JDK version in that CI
>> pipeline for now.)
>> >
>> > So at least, I am curious if Arrow could start the long process of
>> migrating Java versions without impacting Spark, or if we should continue
>> to cooperate. Arrow Java doesn't see quite so much activity these days, so
>> it's not quite critical, but it's possible that these dependency issues
>> will start to affect us more soon. And looking forward, Java is working on
>> APIs that should also allow us to ditch the --add-opens flag requirement
>> too.
>> >
>> > [1]: https://lists.apache.org/thread/phpgpydtt3yrgnncdyv4qdq1gf02s0yj<
>> https://mailshield.baidu.com/check?q=Nz%2bGj2hdKguk92URjA7sg0PfbSN%2fXUIMgrHTmW45gOOKEr3Shre45B7TRzhEpb%2baVsnyuRL%2fl%2f0cu7IVGHunSGDVnxM%3d
>> >
>> > [2]:
>> https://github.com/grpc/proposal/blob/master/P5-jdk-version-support.md<
>> https://mailshield.baidu.com/check?q=s89S3eo8GCJkV7Mpx7aG1SXId7uCRYGjQMA6DeLuX9duS86LhIODZMJfeFdGMWdFzJ8S7minyHoC7mCrzHagbJXCXYTBH%2fpZBpfTbw%3d%3d
>> >
>> > [3]: https://github.com/grpc/grpc-java/issues/9386<
>> https://mailshield.baidu.com/check?q=R0HtWZIkY5eIxpz8jtqHLzd0ugNbcaXIKW2LbUUxpIn0t9Y9yAhuHPuZ4buryfNwRnnJTA%3d%3d
>> >
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: ASF policy violation and Scala version issues

2023-06-07 Thread Sean Owen

Hi Dongjoon, I think this conversation is not advancing anymore. I
personally consider the matter closed unless you can find other support or
respond with more specifics. While this perhaps should be on private@, I
think it's not wrong as an instructive discussion on dev@.

I don't believe you've made a clear argument about the problem, or how it
relates specifically to policy. Nevertheless I will show you my logic.

You are asserting that a vendor cannot call a product Apache Spark 3.4.0 if
it omits a patch updating a Scala maintenance version. This difference has
no known impact on usage, as far as I can tell.

Let's see what policy requires:

1/ All source code changes must meet at least one of the acceptable changes
criteria set out below:
- The change has accepted by the relevant Apache project community for
inclusion in a future release. Note that the process used to accept changes
and how that acceptance is documented varies between projects.
- A change is a fix for an undisclosed security issue; and the fix is not
publicly disclosed as as security fix; and the Apache project has been
notified of the both issue and the proposed fix; and the PMC has rejected
neither the vulnerability report nor the proposed fix.
- A change is a fix for a bug; and the Apache project has been notified of
both the bug and the proposed fix; and the PMC has rejected neither the bug
report nor the proposed fix.
- Minor changes (e.g. alterations to the start-up and shutdown scripts,
configuration files, file layout etc.) to integrate with the target
platform providing the Apache project has not objected to those changes.

The change you cite meets the 4th point, minor change, made for integration
reasons. There is no known technical objection; this was after all at one
point the state of Apache Spark.

2/ A version number must be used that both clearly differentiates it from
an Apache Software Foundation release and clearly identifies the Apache
Software Foundation version on which the software is based.

Keep in mind the product here is not "Apache Spark", but the "Databricks
Runtime 13.1 (including Apache Spark 3.4.0)". That is, there is far more
than a version number differentiating this product from Apache Spark. There
is no standalone distribution of Apache Spark anywhere here. I believe that
easily matches the intent.

3/ The documentation must clearly identify the Apache Software Foundation
version on which the software is based.

Clearly, yes.

4/ The end user expects that the distribution channel will back-port fixes.
It is not necessary to back-port all fixes. Selection of fixes to back-port
must be consistent with the update policy of that distribution channel.

I think this is safe to say too. Indeed this explicitly contemplates not
back-porting a change.

Backing up, you can see from this document that the spirit of it is: don't
include changes in your own Apache Foo x.y that aren't wanted by the
project, and still call it Apache Foo x.y. I don't believe your case
matches this spirit either.

I do think it's not crazy to suggest, hey vendor, would you call this
"Apache Spark + patches" or ".vendor123". But that's at best a suggestion,
and I think it does nothing in particular for users. You've made the
suggestion, and I do not see some police action from the PMC must follow.

I think you're simply objecting to a vendor choice, but that is not
on-topic here unless you can specifically rebut the reasoning above and
show it's connected.

On Wed, Jun 7, 2023 at 11:02 AM Dongjoon Hyun  wrote:

> Sean, it seems that you are confused here. We are not talking about your
> upper system (the notebook environment). We are talking about the
> submodule, "Apache Spark 3.4.0-databricks". Whatever you call it, both of
> us knows "Apache Spark 3.4.0-databricks" is different from "Apache Spark
> 3.4.0". You should not use "3.4.0" at your subsystem.
>
> > This also is aimed at distributions of "Apache Foo", not products that
> > "include Apache Foo", which are clearly not Apache Foo.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: ASF policy violation and Scala version issues

2023-06-07 Thread Sean Owen

(With consent, shall we move this to the PMC list?)

No, I don't think that's what this policy says.

First, could you please be more specific here? why do you think a certain
release is at odds with this?
Because so far you've mentioned, I think, not taking a Scala maintenance
release update.

But this says things like:

The source code on which the software is based must either be identical to
an Apache Software Foundation source code release or all of the following
must also be true:
  ...
  - The end user expects that the distribution channel will back-port
fixes. It is not necessary to back-port all fixes. Selection of fixes to
back-port must be consistent with the update policy of that distribution
channel.

That describes what you're talking about.

This also is aimed at distributions of "Apache Foo", not products that
"include Apache Foo", which are clearly not Apache Foo.
The spirit of it is, more generally: don't keep new features and fixes to
yourself. That does not seem to apply here.

On Tue, Jun 6, 2023 at 11:34 PM Dongjoon Hyun 
wrote:

> Hi, All and Matei (as the Chair of Spark PMC).
>
> For the ASF policy violation part, here is a legal recommendation
> documentation (draft) from `legal-discuss@`.
>
> https://www.apache.org/foundation/marks/downstream.html#source
>
> > A version number must be used that both clearly differentiates it from
> an Apache Software Foundation release and clearly identifies the Apache
> Software Foundation version on which the software is based.
>
> In short, Databricks should not claim its product like "Apache Spark
> 3.4.0". The version number should clearly differentiate it from Apache
> Spark 3.4.0. I hope we can conclude this together in this way and move our
> focus forward to the other remaining issues.
>
> To Matei, could you do the legal follow-up officially with Databricks with
> the above info?
>
> If there is a person to do this, I believe you are the best person to
> drive this.
>
> Thank you in advance.
>
> Dongjoon.
>
>
> On Tue, Jun 6, 2023 at 2:49 PM Dongjoon Hyun  wrote:
>
>> It goes to "legal-discuss@".
>>
>> https://lists.apache.org/thread/mzhggd0rpz8t4d7vdsbhkp38mvd3lty4
>>
>> I hope we can conclude the legal part clearly and shortly in one way or
>> another which we will follow with confidence.
>>
>> Dongjoon
>>
>> On 2023/06/06 20:06:42 Dongjoon Hyun wrote:
>> > Thank you, Sean, Mich, Holden, again.
>> >
>> > For this specific part, let's ask the ASF board via bo...@apache.org to
>> > find a right answer because it's a controversial legal issue here.
>> >
>> > > I think you'd just prefer Databricks make a different choice, which is
>> > legitimate, but, an issue to take up with Databricks, not here.
>> >
>> > Dongjoon.
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: JDK version support policy?

2023-06-06 Thread Sean Owen

I haven't followed this discussion closely, but I think we could/should
drop Java 8 in Spark 4.0, which is up next after 3.5?

On Tue, Jun 6, 2023 at 2:44 PM David Li  wrote:

> Hello Spark developers,
>
> I'm from the Apache Arrow project. We've discussed Java version support
> [1], and crucially, whether to continue supporting Java 8 or not. As Spark
> is a big user of Arrow in Java, I was curious what Spark's policy here was.
>
> If Spark intends to stay on Java 8, for instance, we may also want to stay
> on Java 8 or otherwise provide some supported version of Arrow for Java 8.
>
> We've seen dependencies dropping or planning to drop support. gRPC may
> drop Java 8 at any time [2], possibly this September [3], which may affect
> Spark (due to Spark Connect). And today we saw that Arrow had issues
> running tests with Mockito on Java 20, but we couldn't update Mockito since
> it had dropped Java 8 support. (We pinned the JDK version in that CI
> pipeline for now.)
>
> So at least, I am curious if Arrow could start the long process of
> migrating Java versions without impacting Spark, or if we should continue
> to cooperate. Arrow Java doesn't see quite so much activity these days, so
> it's not quite critical, but it's possible that these dependency issues
> will start to affect us more soon. And looking forward, Java is working on
> APIs that should also allow us to ditch the --add-opens flag requirement
> too.
>
> [1]: https://lists.apache.org/thread/phpgpydtt3yrgnncdyv4qdq1gf02s0yj
> [2]:
> https://github.com/grpc/proposal/blob/master/P5-jdk-version-support.md
> [3]: https://github.com/grpc/grpc-java/issues/9386
>

Re: ASF policy violation and Scala version issues

2023-06-05 Thread Sean Owen

I think the issue is whether a distribution of Spark is so materially
different from OSS that it causes problems for the larger community of
users. There's a legitimate question of whether such a thing can be called
"Apache Spark + changes", as describing it that way becomes meaningfully
inaccurate. And if it's inaccurate, then it's a trademark usage issue, and
a matter for the PMC to act on. I certainly recall this type of problem
from the early days of Hadoop - the project itself had 2 or 3 live branches
in development (was it 0.20.x vs 0.23.x vs 1.x? YARN vs no YARN?) picked up
by different vendors and it was unclear what "Apache Hadoop" meant in a
vendor distro. Or frankly, upstream.

In comparison, variation in Scala maintenance release seems trivial. I'm
not clear from the thread what actual issue this causes to users. Is there
more to it - does this go hand in hand with JDK version and Ammonite, or
are those separate? What's an example of the practical user issue. Like, I
compile vs Spark 3.4.0 and because of Scala version differences it doesn't
run on some vendor distro? That's not great, but seems like a vendor
problem. Unless you tell me we are getting tons of bug reports to OSS Spark
as a result or something.

Is the implication that something in OSS Spark is being blocked to prefer
some set of vendor choices? because the changes you're pointing to seem to
be going into Apache Spark, actually. It'd be more useful to be specific
and name names at this point, seems fine.

The rest of this is just a discussion about Databricks choices. (If it's
not clear, I'm at Databricks but do not work on the Spark distro). We can
discuss but it seems off-topic _if_ it can't be connected to a problem for
OSS Spark. Anyway:

If it helps, _some_ important patches are described at
https://docs.databricks.com/release-notes/runtime/maintenance-updates.html
; I don't think this is exactly hidden.

Out of curiosity, how would you describe this software in the UI instead?
"3.4.0" is shorthand, because this is a little dropdown menu; the terminal
output is likewise not a place to list all patches. You would propose
requesting calling this "3.4.0 + patches"? That's the best I can think of,
but I don't think it addresses what you're getting at anyway. I think you'd
just prefer Databricks make a different choice, which is legitimate, but,
an issue to take up with Databricks, not here.

On Mon, Jun 5, 2023 at 6:58 PM Dongjoon Hyun 
wrote:

> Hi, Sean.
>
> "+ patches" or "powered by Apache Spark 3.4.0" is not a problem as you
> mentioned. For the record, I also didn't bring up any old story here.
>
> > "Apache Spark 3.4.0 + patches"
>
> However, "including Apache Spark 3.4.0" still causes confusion even in a
> different way because of those missing patches, SPARK-40436 (Upgrade Scala
> to 2.12.17) and SPARK-39414 (Upgrade Scala to 2.12.16). Technically,
> Databricks Runtime doesn't include Apache Spark 3.4.0 while it claims it to
> the users.
>
> [image: image.png]
>
> It's a sad story from the Apache Spark Scala perspective because the users
> cannot even try to use the correct Scala 2.12.17 version in the runtime.
>
> All items I've shared are connected via a single theme, hurting Apache
> Spark Scala users.
> From (1) building Spark, (2) creating a fragmented Scala Spark runtime
> environment and (3) hidden user-facing documentation.
>
> Of course, I don't think those are designed in an organized way
> intentionally. It just happens at the same time.
>
> Based on your comments, let me ask you two questions. (1) When Databricks
> builds its internal Spark from its private code repository, is it a company
> policy to always expose "Apache 3.4.0" to the users like the following by
> ignoring all changes (whatever they are). And, (2) Do you insist that it is
> normative and clear to the users and the community?
>
> > - The runtime logs "23/06/05 04:23:27 INFO SparkContext: Running Spark
> version 3.4.0"
> > - UI shows Apache Spark logo and `3.4.0`.
>
>>
>>

Re: ASF policy violation and Scala version issues

2023-06-05 Thread Sean Owen

On Mon, Jun 5, 2023 at 12:01 PM Dongjoon Hyun 
wrote:

> 1. For the naming, yes, but the company should use different version
> numbers instead of the exact "3.4.0". As I shared the screenshot in my
> previous email, the company exposes "Apache Spark 3.4.0" exactly because
> they build their distribution without changing their version number at all.
>

I don't believe this is supported by guidance on the underlying issue here,
which is trademark. There is nothing wrong with nominative use, and I think
that's what this is. A thing can be "Apache Spark 3.4.0 + patches" and be
described that way.
Calling it "Apache Spark 3.4.0.vendor123" is argubaly more confusing IMHO,
as there is no such Apache Spark version.

> 2. According to
> https://mvnrepository.com/artifact/org.apache.spark/spark-core,
> all the other companies followed  "Semantic Versioning" or added
> additional version numbers at their distributions, didn't they? AFAIK,
> nobody claims to take over the exact, "3.4.0" version string, in source
> code level like this company.
>

Here you're talking about software artifact numbering, for companies that
were also releasing their own maintenance branch of OSS. That pretty much
requires some sub-versioning scheme. I think that's fine too, although as
above I think this is arguably _worse_ w.r.t. reuse of the Apache name and
namespace.
I'm not aware of any policy on this, and don't find this problematic
myself. Doesn't mean it's right, but does mean implicitly this has never
before been viewed as an issue?

The one I'm aware of was releasing a product "including Apache Spark 2.0"
before it existed, which does seem to potentially cause confusion, and that
was addressed.

Can you describe what policy is violated? we can disagree about what we'd
prefer or not, but the question is, what if anything is disallowed? I'm not
seeing that.

> 3. This company not only causes the 'Scala Version Segmentation'
> environment in a subtle way, but also defames Apache Spark 3.4.0 by
> removing many bug fixes of SPARK-40436 (Upgrade Scala to 2.12.17) and
> SPARK-39414 (Upgrade Scala to 2.12.16) for some unknown reason. Apparently,
> this looks like not a superior version of Apache Spark 3.4.0. For me, it's
> the inferior version. If a company disagrees with Scala 2.12.17 for some
> internal reason, they are able to stick to 2.12.15, of course. However,
> Apache Spark PMC should not allow them to lie to the customers that "Apache
> Spark 3.4.0" uses Scala 2.12.15 by default. That's the reason why I
> initiated this email because I'm considering this as a serious blocker to
> make Apache Spark Scala improvement.
> - https://github.com/scala/scala/releases/tag/v2.12.17 (21 Merged PRs)
> - https://github.com/scala/scala/releases/tag/v2.12.16 (68 Merged PRs)
>

To be clear, this seems unrelated to your first two points above?

I'm having trouble following what you are arguing here. You are saying a
vendor release based on "Apache Spark 3.4.0" is not the same in some
material way that you don't like. That's a fine position to take, but I
think the product is still substantially describable as "Apache Spark
3.4.0 + patches". You can take up the issue with the vendor.

But more importantly, I am not seeing how that constrains anything in
Apache Spark? those updates were merged to OSS. But even taking up the
point you describe, why is the scala maintenance version even such a
material issue that is so severe it warrants PMC action?

Could you connect the dots a little more?

>

Re: ASF policy violation and Scala version issues

2023-06-05 Thread Sean Owen

1/ Regarding naming - I believe releasing "Apache Foo X.Y + patches" is
acceptable, if it is substantially Apache Foo X.Y. This is common practice
for downstream vendors. It's fair nominative use. The principle here is
consumer confusion. Is anyone substantially misled? Here I don't think so.
I know that we have in the past decided it would not be OK, for example, to
release a product with "Apache Spark 4.0" now as there is no such release,
even building from master. A vendor should elaborate the changes
somewhere, ideally. I'm sure this one is about Databricks but I'm also sure
Cloudera, Hortonworks, etc had Spark releases with patches, too.

2a/ That issue seems to be about just flipping which code sample is shown
by default. It seemed widely agree that this would slightly help more users
than it harms. I agree with the change and don't see a need to escalate.
the question of further Python parity is a big one but is separate.

2b/ If a single dependency blocks important updates, yeah it's fair to
remove it, IMHO. I wouldn't remove in 3.5 unless the other updates are
critical, and it's not clear they are. In 4.0 yes.

2c/ Scala 2.13 is already supported in 3.x, and does not require 4.0. This
was about what the default non-Scala release convenience binaries use.
Sticking to 2.12 in 3.x doesn't seem like an issue, even desirable.

2d/ Same as 2b

3/ I don't think 1/ is an incident. Yes to moving towards 4.0 after 3.5,
IMHO, and to removing Ammonite in 4.0 if there is no resolution forthcoming

On Mon, Jun 5, 2023 at 2:46 AM Dongjoon Hyun 
wrote:

> Hi, All and Matei (as the Chair of Apache Spark PMC).
>
> Sorry for a long email, I want to share two topics and corresponding
> action items.
> You can go to "Section 3: Action Items" directly for the conclusion.
>
>
> ### 1. ASF Policy Violation ###
>
> ASF has a rule for "MAY I CALL MY MODIFIED CODE 'APACHE'?"
>
> https://www.apache.org/foundation/license-faq.html#Name-changes
>
> For example, when we call `Apache Spark 3.4.0`, it's supposed to be the
> same with one of our official distributions.
>
> https://downloads.apache.org/spark/spark-3.4.0/
>
> Specifically, in terms of the Scala version, we believe it should have
> Scala 2.12.17 because of 'SPARK-40436 Upgrade Scala to 2.12.17'.
>
> There is a company claiming something non-Apache like "Apache Spark 3.4.0
> minus SPARK-40436" with the name "Apache Spark 3.4.0."
>
> - The company website shows "X.Y (includes Apache Spark 3.4.0, Scala
> 2.12)"
> - The runtime logs "23/06/05 04:23:27 INFO SparkContext: Running Spark
> version 3.4.0"
> - UI shows Apache Spark logo and `3.4.0`.
> - However, Scala Version is '2.12.15'
>
> [image: Screenshot 2023-06-04 at 9.37.16 PM.png][image: Screenshot
> 2023-06-04 at 10.14.45 PM.png]
>
> Lastly, this is not a single instance. For example, the same company also
> claims "Apache Spark 3.3.2" with a mismatched Scala version.
>
>
> ### 2. Scala Issues ###
>
> In addition to (1), although we proceeded with good intentions and great
> care
> including dev mailing list discussion, there are several concerning areas
> which
> need more attention and our love.
>
> a) Scala Spark users will experience UX inconvenience from Spark 3.5.
>
> SPARK-42493 Make Python the first tab for code examples
>
> For the record, we discussed it here.
> - https://lists.apache.org/thread/1p8s09ysrh4jqsfd47qdtrl7rm4rrs05
>   "[DISCUSS] Show Python code examples first in Spark documentation"
>
> b) Scala version upgrade is blocked by the Ammonite library dev cycle
> currently.
>
> Although we discussed it here and it had good intentions,
> the current master branch cannot use the latest Scala.
>
> - https://lists.apache.org/thread/4nk5ddtmlobdt8g3z8xbqjclzkhlsdfk
> "Ammonite as REPL for Spark Connect"
>  SPARK-42884 Add Ammonite REPL integration
>
> Specifically, the following are blocked and I'm monitoring the
> Ammonite repository.
> - SPARK-40497 Upgrade Scala to 2.13.11
> - SPARK-43832 Upgrade Scala to 2.12.18
> - According to https://github.com/com-lihaoyi/Ammonite/issues ,
>   Scala 3.3.0 LTS support also looks infeasible.
>
> Although we may be able to wait for a while, there are two fundamental
> solutions
> to unblock this situation in a long-term maintenance perspective.
> - Replace it with a Scala-shell based implementation
> - Move `connector/connect/client/jvm/pom.xml` outside from Spark repo.
>Maybe, we can put it into the new repo like Rust and Go client.
>
> c) Scala 2.13 and above needs Apache Spark 4.0.
>
> In "Apache Spark 3.5.0 Expectations?" and "Apache Spark 4.0
> Timeframe?" threads,
> we discussed Spark 3.5.0 scope and decided to revert
> "SPARK-43836 Make Scala 2.13 as default in Spark 3.5".
> Apache Spark 4.0.0 is the only way to support Scala 2.13 or higher.
>
> - https://lists.apache.org/thread/3x6dh17bmy20n3frtt3crgxjydnxh2o0
> ("Apache Spark 3.5.0

Re: Apache Spark 3.5.0 Expectations (?)

2023-05-29 Thread Sean Owen

It does seem risky; there are still likely libs out there that don't cross
compile for 2.13. I would make it the default at 4.0, myself.

On Mon, May 29, 2023 at 7:16 PM Hyukjin Kwon  wrote:

> While I support going forward with a higher version, actually using Scala
> 2.13 by default is a big deal especially in a way that:
>
>- Users would likely download the built-in version assuming that it’s
>backward binary compatible.
>- PyPI doesn't allow specifying the Scala version, meaning that users
>wouldn’t have a way to 'pip install pyspark' based on Scala 2.12.
>
> I wonder if it’s safer to do it in Spark 4 (which I believe will be
> discussed soon).
>
>
> On Mon, 29 May 2023 at 13:21, Jia Fan  wrote:
>
>> Thanks Dongjoon!
>> There are some ticket I want to share.
>> SPARK-39420 Support ANALYZE TABLE on v2 tables
>> SPARK-42750 Support INSERT INTO by name
>> SPARK-43521 Support CREATE TABLE LIKE FILE
>>
>> Dongjoon Hyun  于2023年5月29日周一 08:42写道：
>>
>>> Hi, All.
>>>
>>> Apache Spark 3.5.0 is scheduled for August (1st Release Candidate) and
>>> currently a few notable things are under discussions in the mailing list.
>>>
>>> I believe it's a good time to share a short summary list (containing
>>> both completed and in-progress items) to give a highlight in advance and to
>>> collect your targets too.
>>>
>>> Please share your expectations or working items if you want to
>>> prioritize them more in the community in Apache Spark 3.5.0 timeframe.
>>>
>>> (Sorted by ID)
>>> SPARK-40497 Upgrade Scala 2.13.11
>>> SPARK-42452 Remove hadoop-2 profile from Apache Spark 3.5.0
>>> SPARK-42913 Upgrade to Hadoop 3.3.5 (aws-java-sdk-bundle: 1.12.262 ->
>>> 1.12.316)
>>> SPARK-43024 Upgrade Pandas to 2.0.0
>>> SPARK-43200 Remove Hadoop 2 reference in docs
>>> SPARK-43347 Remove Python 3.7 Support
>>> SPARK-43348 Support Python 3.8 in PyPy3
>>> SPARK-43351 Add Spark Connect Go prototype code and example
>>> SPARK-43379 Deprecate old Java 8 versions prior to 8u371
>>> SPARK-43394 Upgrade to Maven 3.8.8
>>> SPARK-43436 Upgrade to RocksDbjni 8.1.1.1
>>> SPARK-43446 Upgrade to Apache Arrow 12.0.0
>>> SPARK-43447 Support R 4.3.0
>>> SPARK-43489 Remove protobuf 2.5.0
>>> SPARK-43519 Bump Parquet to 1.13.1
>>> SPARK-43581 Upgrade kubernetes-client to 6.6.2
>>> SPARK-43588 Upgrade to ASM 9.5
>>> SPARK-43600 Update K8s doc to recommend K8s 1.24+
>>> SPARK-43738 Upgrade to DropWizard Metrics 4.2.18
>>> SPARK-43831 Build and Run Spark on Java 21
>>> SPARK-43832 Upgrade to Scala 2.12.18
>>> SPARK-43836 Make Scala 2.13 as default in Spark 3.5
>>> SPARK-43842 Upgrade gcs-connector to 2.2.14
>>> SPARK-43844 Update to ORC 1.9.0
>>> UMBRELLA: Add SQL functions into Scala, Python and R API
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>> PS. The above is not a list of release blockers. Instead, it could be a
>>> nice-to-have from someone's perspective.
>>>
>>

Re: JDK version support information

2023-05-29 Thread Sean Owen

Per docs, it is Java 8. It's possible Java 11 partly works with 2.x but not
supported. But then again 2.x is not supported either.

On Mon, May 29, 2023, 6:43 AM Poorna Murali  wrote:

> We are currently using JDK 11 and spark 2.4.5.1 is working fine with that.
> So, we wanted to check the maximum JDK version supported for 2.4.5.1.
>
> On Mon, 29 May, 2023, 5:03 pm Aironman DirtDiver, 
> wrote:
>
>> Spark version 2.4.5.1 is based on Apache Spark 2.4.5. According to the
>> official Spark documentation for version 2.4.5, the maximum supported JDK
>> (Java Development Kit) version is JDK 8 (Java 8).
>>
>> Spark 2.4.5 is not compatible with JDK versions higher than Java 8.
>> Therefore, you should use JDK 8 to ensure compatibility and avoid any
>> potential issues when running Spark 2.4.5.
>>
>> El lun, 29 may 2023 a las 13:28, Poorna Murali ()
>> escribió:
>>
>>> Hi,
>>>
>>> We are using spark version 2.4.5.1. We would like to know the maximum
>>> JDK version supported for the same.
>>>
>>> Thanks,
>>> Poorna
>>>
>>
>>
>> --
>> Alonso Isidoro Roman
>> [image: https://]about.me/alonso.isidoro.roman
>>
>> 
>>
>

Re: [MLlib] how-to find implementation of Decision Tree Regressor fit function

2023-05-25 Thread Sean Owen

Are you looking for
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala


On Thu, May 25, 2023 at 6:54 AM Max 
wrote:

> Good day, I'm working on an Implantation from Joint Probability Trees
> (JPT) using the Spark framework. For this to be as efficient as possible, I
> tried to find the Implementation of the Code for the fit function of
> Decision Tree Regressors. Unfortunately, I had little success in finding
> the specifics on how to handle the RDDs in the documentation or in the
> GitHub. Hence, I am asking for a pointer on where these documents are. For
> Context, here are the links to JPT GitHub and the article:
> https://github.com/joint-probability-trees
> https://arxiv.org/abs/2302.07167 Thanks in advance. Sincerely, Maximilian
> Neumann
>

Re: Tensorflow on Spark CPU

2023-04-30 Thread Sean Owen

There is a large overhead to distributing this type of workload. I imagine
that for a small problem, the overhead dominates. You do not nearly need to
distribute a problem of this size, so more workers is probalby just worse.

On Sun, Apr 30, 2023 at 1:46 AM second_co...@yahoo.com <
second_co...@yahoo.com> wrote:

> I re-test with cifar10 example and below is the result .  can advice why
> lesser num_slot is faster compared with more slots?
>
> num_slots=20
>
> 231 seconds
>
>
> num_slots=5
>
> 52 seconds
>
>
> num_slot=1
>
> 34 seconds
>
> the code is at below
> https://gist.github.com/cometta/240bbc549155e22f80f6ba670c9a2e32
>
> Do you have an example of tensorflow+big dataset that I can test?
>
>
>
>
>
>
>
> On Saturday, April 29, 2023 at 08:44:04 PM GMT+8, Sean Owen <
> sro...@gmail.com> wrote:
>
>
> You don't want to use CPUs with Tensorflow.
> If it's not scaling, you may have a problem that is far too small to
> distribute.
>
> On Sat, Apr 29, 2023 at 7:30 AM second_co...@yahoo.com.INVALID
>  wrote:
>
> Anyone successfully run native tensorflow on Spark ? i tested example at
> https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-distributor
> on Kubernetes CPU . By running in on multiple workers CPUs. I do not see
> any speed up in training time by setting number of slot from1 to 10. The
> time taken to train is still the same. Anyone tested tensorflow training on
> Spark distributed workers with CPUs ?  Can share your working example?
>
>
>
>
>
>

Re: Tensorflow on Spark CPU

2023-04-29 Thread Sean Owen

You don't want to use CPUs with Tensorflow.
If it's not scaling, you may have a problem that is far too small to
distribute.

On Sat, Apr 29, 2023 at 7:30 AM second_co...@yahoo.com.INVALID
 wrote:

> Anyone successfully run native tensorflow on Spark ? i tested example at
> https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-distributor
> on Kubernetes CPU . By running in on multiple workers CPUs. I do not see
> any speed up in training time by setting number of slot from1 to 10. The
> time taken to train is still the same. Anyone tested tensorflow training on
> Spark distributed workers with CPUs ?  Can share your working example?
>
>
>
>
>
>

Re: Spark 3.4.0 with Hadoop2.7 cannot be downloaded

2023-04-20 Thread Sean Owen

We just removed it now, yes.

On Thu, Apr 20, 2023 at 9:08 AM Emil Ejbyfeldt
 wrote:

> Hi,
>
> I think this is expected as it was dropped from the release process in
> https://issues.apache.org/jira/browse/SPARK-40651
>
> Also I don't see a Hadoop2.7 option when selecting Spark 3.4.0 on
> https://spark.apache.org/downloads.html
> Not really sure why you could be seeing that.
>
> Best,
> Emil
>
>
> On 20/04/2023 08:23, Enrico Minack wrote:
> > Hi,
> >
> > selecting Spark 3.4.0 with Hadoop2.7 at
> > https://spark.apache.org/downloads.html leads to
> >
> >
> https://www.apache.org/dyn/closer.lua/spark/spark-3.4.0/spark-3.4.0-bin-hadoop2.tgz
> >
> > saying:
> >
> > The requested file or directory is *not* on the mirrors.
> >
> > The object is in not in our archive https://archive.apache.org/dist/
> >
> > Is this expected?
> >
> > Enrico
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 3.2.4 (RC1)

2023-04-10 Thread Sean Owen

+1 from me

On Sun, Apr 9, 2023 at 7:19 PM Dongjoon Hyun  wrote:

> I'll start with my +1.
>
> I verified the checksum, signatures of the artifacts, and documentations.
> Also, ran the tests with YARN and K8s modules.
>
> Dongjoon.
>
> On 2023/04/09 23:46:10 Dongjoon Hyun wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 3.2.4.
> >
> > The vote is open until April 13th 1AM (PST) and passes if a majority +1
> PMC
> > votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.2.4
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see https://spark.apache.org/
> >
> > The tag to be voted on is v3.2.4-rc1 (commit
> > 0ae10ac18298d1792828f1d59b652ef17462d76e)
> > https://github.com/apache/spark/tree/v3.2.4-rc1
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.2.4-rc1-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1442/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.2.4-rc1-docs/
> >
> > The list of bug fixes going into 3.2.4 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12352607
> >
> > This release is using the release script of the tag v3.2.4-rc1.
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 3.2.4?
> > ===
> >
> > The current list of open tickets targeted at 3.2.4 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> > Version/s" = 3.2.4
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 3.4.0 (RC7)

2023-04-08 Thread Sean Owen

+1 form me, same result as last time.

On Fri, Apr 7, 2023 at 6:30 PM Xinrong Meng 
wrote:

> Please vote on releasing the following candidate(RC7) as Apache Spark
> version 3.4.0.
>
> The vote is open until 11:59pm Pacific time *April 12th* and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.4.0-rc7 (commit
> 87a5442f7ed96b11051d8a9333476d080054e5a0):
> https://github.com/apache/spark/tree/v3.4.0-rc7
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1441
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-docs/
>
> The list of bug fixes going into 3.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12351465
>
> This release is using the release script of the tag v3.4.0-rc7.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.4.0?
> ===
> The current list of open tickets targeted at 3.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Thanks,
> Xinrong Meng
>

Re: Looping through a series of telephone numbers

2023-04-02 Thread Sean Owen

That won't work, you can't use Spark within Spark like that.
If it were exact matches, the best solution would be to load both datasets
and join on telephone number.
For this case, I think your best bet is a UDF that contains the telephone
numbers as a list and decides whether a given number matches something in
the set. Then use that to filter, then work with the data set.
There are probably clever fast ways of efficiently determining if a string
is a prefix of a group of strings in Python you could use too.

On Sun, Apr 2, 2023 at 3:17 AM Philippe de Rochambeau 
wrote:

> Many thanks, Mich.
> Is « foreach »  the best construct to  lookup items is a dataset  such as
> the below «  telephonedirectory » data set?
>
> val telrdd = spark.sparkContext.parallelize(Seq(«  tel1 » , «  tel2 » , «  
> tel3 » …)) // the telephone sequence
>
> // was read for a CSV file
>
> val ds = spark.read.parquet(«  /path/to/telephonedirectory » )
>
>   rdd .foreach(tel => {
> longAcc.select(«  * » ).rlike(«  + »  + tel)
>   })
>
>
>
>
> Le 1 avr. 2023 à 22:36, Mich Talebzadeh  a
> écrit :
>
> This may help
>
> Spark rlike() Working with Regex Matching Example
> s
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 1 Apr 2023 at 19:32, Philippe de Rochambeau 
> wrote:
>
>> Hello,
>> I’m looking for an efficient way in Spark to search for a series of
>> telephone numbers, contained in a CSV file, in a data set column.
>>
>> In pseudo code,
>>
>> for tel in [tel1, tel2, …. tel40,000]
>> search for tel in dataset using .like(« %tel% »)
>> end for
>>
>> I’m using the like function because the telephone numbers in the data set
>> main contain prefixes, such as « + « ; e.g., « +331222 ».
>>
>> Any suggestions would be welcome.
>>
>> Many thanks.
>>
>> Philippe
>>
>>
>>
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>

Re: [VOTE] Release Apache Spark 3.4.0 (RC5)

2023-03-30 Thread Sean Owen

+1 same result from me as last time.

On Thu, Mar 30, 2023 at 3:21 AM Xinrong Meng 
wrote:

> Please vote on releasing the following candidate(RC5) as Apache Spark
> version 3.4.0.
>
> The vote is open until 11:59pm Pacific time *April 4th* and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v3.4.0-rc5* (commit
> f39ad617d32a671e120464e4a75986241d72c487):
> https://github.com/apache/spark/tree/v3.4.0-rc5
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc5-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1439
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc5-docs/
>
> The list of bug fixes going into 3.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12351465
>
> This release is using the release script of the tag v3.4.0-rc5.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.4.0?
> ===
> The current list of open tickets targeted at 3.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Thanks,
> Xinrong Meng
>
>

Re: What is the range of the PageRank value of graphx

2023-03-28 Thread Sean Owen

>From the docs:

 * Note that this is not the "normalized" PageRank and as a consequence
pages that have no
 * inlinks will have a PageRank of alpha. In particular, the pageranks may
have some values
 * greater than 1.

On Tue, Mar 28, 2023 at 9:11 AM lee  wrote:

> When I calculate pagerank using HugeGraph, each pagerank value is less
> than 1, and the total of pageranks is 1. However, the PageRank value of
> graphx is often greater than 1, so what is the range of the PageRank value
> of graphx？
>
>
>
>
>
>
> 李杰
> leedd1...@163.com
>  
> 
>
>

Re: Question related to asynchronously map transformation using java spark structured streaming

2023-03-26 Thread Sean Owen

What do you mean by asynchronously here?

On Sun, Mar 26, 2023, 10:22 AM Emmanouil Kritharakis <
kritharakismano...@gmail.com> wrote:

> Hello again,
>
> Do we have any news for the above question?
> I would really appreciate it.
>
> Thank you,
>
> --
>
> Emmanouil (Manos) Kritharakis
>
> Ph.D. candidate in the Department of Computer Science
> 
>
> Boston University
>
>
> On Tue, Mar 14, 2023 at 12:04 PM Emmanouil Kritharakis <
> kritharakismano...@gmail.com> wrote:
>
>> Hello,
>>
>> I hope this email finds you well!
>>
>> I have a simple dataflow in which I read from a kafka topic, perform a
>> map transformation and then I write the result to another topic. Based on
>> your documentation here
>> ,
>> I need to work with Dataset data structures. Even though my solution works,
>> I need to utilize map transformation asynchronously. So my question is how
>> can I asynchronously call map transformation with Dataset data structures
>> in a java structured streaming environment? Can you please share a working
>> example?
>>
>> I am looking forward to hearing from you as soon as possible. Thanks in
>> advance!
>>
>> Kind regards
>>
>> --
>>
>> Emmanouil (Manos) Kritharakis
>>
>> Ph.D. candidate in the Department of Computer Science
>> 
>>
>> Boston University
>>
>

Re: Kind help request

2023-03-25 Thread Sean Owen

It is telling you that the UI can't bind to any port. I presume that's
because of container restrictions?
If you don't want the UI at all, just set spark.ui.enabled to false

On Sat, Mar 25, 2023 at 8:28 AM Lorenzo Ferrando <
lorenzo.ferra...@edu.unige.it> wrote:

> Dear Spark team,
>
> I am Lorenzo from University of Genoa. I am currently using (ubuntu 18.04)
> the nextflow/sarek pipeline to analyse genomic data through a singularity
> container. One of the step of the pipeline uses GATK4 and it implements
>  Spark. However, after some time I get the following error:
>
>
> 23:27:48.112 INFO  NativeLibraryLoader - Loading libgkl_compression.so from 
> jar:file:/gatk/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
> 23:27:48.523 INFO  ApplyBQSRSpark - 
> 
> 23:27:48.524 INFO  ApplyBQSRSpark - The Genome Analysis Toolkit (GATK) 
> v4.2.6.1
> 23:27:48.524 INFO  ApplyBQSRSpark - For support and documentation go to 
> https://software.broadinstitute.org/gatk/
> 23:27:48.525 INFO  ApplyBQSRSpark - Executing as ferrandl@alucard on Linux 
> v5.4.0-91-generic amd64
> 23:27:48.525 INFO  ApplyBQSRSpark - Java runtime: OpenJDK 64-Bit Server VM 
> v1.8.0_242-8u242-b08-0ubuntu3~18.04-b08
> 23:27:48.526 INFO  ApplyBQSRSpark - Start Date/Time: March 24, 2023 11:27:47 
> PM GMT
> 23:27:48.526 INFO  ApplyBQSRSpark - 
> 
> 23:27:48.526 INFO  ApplyBQSRSpark - 
> 
> 23:27:48.527 INFO  ApplyBQSRSpark - HTSJDK Version: 2.24.1
> 23:27:48.527 INFO  ApplyBQSRSpark - Picard Version: 2.27.1
> 23:27:48.527 INFO  ApplyBQSRSpark - Built for Spark Version: 2.4.5
> 23:27:48.527 INFO  ApplyBQSRSpark - HTSJDK Defaults.COMPRESSION_LEVEL : 2
> 23:27:48.527 INFO  ApplyBQSRSpark - HTSJDK 
> Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
> 23:27:48.527 INFO  ApplyBQSRSpark - HTSJDK 
> Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
> 23:27:48.527 INFO  ApplyBQSRSpark - HTSJDK 
> Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
> 23:27:48.527 INFO  ApplyBQSRSpark - Deflater: IntelDeflater
> 23:27:48.528 INFO  ApplyBQSRSpark - Inflater: IntelInflater
> 23:27:48.528 INFO  ApplyBQSRSpark - GCS max retries/reopens: 20
> 23:27:48.528 INFO  ApplyBQSRSpark - Requester pays: disabled
> 23:27:48.528 WARN  ApplyBQSRSpark -
>
>
>
>Warning: ApplyBQSRSpark is a BETA tool and is not yet ready for use in 
> production
>
>
>
>
> 23:27:48.528 INFO  ApplyBQSRSpark - Initializing engine
> 23:27:48.528 INFO  ApplyBQSRSpark - Done initializing engine
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 23/03/24 23:27:49 INFO SparkContext: Running Spark version 2.4.5
> 23/03/24 23:27:49 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 23/03/24 23:27:50 INFO SparkContext: Submitted application: ApplyBQSRSpark
> 23/03/24 23:27:50 INFO SecurityManager: Changing view acls to: ferrandl
> 23/03/24 23:27:50 INFO SecurityManager: Changing modify acls to: ferrandl
> 23/03/24 23:27:50 INFO SecurityManager: Changing view acls groups to:
> 23/03/24 23:27:50 INFO SecurityManager: Changing modify acls groups to:
> 23/03/24 23:27:50 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(ferrandl); 
> groups with view permissions: Set(); users  with modify permissions: 
> Set(ferrandl); groups with modify permissions: Set()
> 23/03/24 23:27:50 INFO Utils: Successfully started service 'sparkDriver' on 
> port 46757.
> 23/03/24 23:27:50 INFO SparkEnv: Registering MapOutputTracker
> 23/03/24 23:27:50 INFO SparkEnv: Registering BlockManagerMaster
> 23/03/24 23:27:50 INFO BlockManagerMasterEndpoint: Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
> 23/03/24 23:27:50 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint 
> up
> 23/03/24 23:27:50 INFO DiskBlockManager: Created local directory at 
> /home/ferrandl/projects/ribas_reanalysis/sarek/work/27/89b7451fcac6fd31461885b5774752/blockmgr-e76f7d59-da0b-4e62-8a99-3cdb23f11ae6
> 23/03/24 23:27:50 INFO MemoryStore: MemoryStore started with capacity 2004.6 
> MB
> 23/03/24 23:27:50 INFO SparkEnv: Registering OutputCommitCoordinator
> 23/03/24 23:27:51 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> 23/03/24 23:27:51 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
> Attempting port 4042.
> 23/03/24 23:27:51 WARN Utils: Service 'SparkUI' could not bind on port 4042. 
> Attempting port 4043.
> 23/03/24 23:27:51 WARN Utils: Service 'SparkUI' could not bind on port 4043. 
> Attempting port 4044.
> 23/03/24 23:27:51 WARN Utils: Service

Re: Question related to parallelism using structed streaming parallelism

2023-03-21 Thread Sean Owen

Yes more specifically, you can't ask for executors once the app starts,
in SparkConf like that. You set this when you launch it against a Spark
cluster in spark-submit or otherwise.

On Tue, Mar 21, 2023 at 4:23 AM Mich Talebzadeh 
wrote:

> Hi Emmanouil,
>
> This means that your job is running on the driver as a single JVM, hence
> active(1)
>
>

Re: Understanding executor memory behavior

2023-03-16 Thread Sean Owen

All else equal it is better to have the same resources in fewer executors.
More tasks are local to other tasks which helps perf. There is more
possibility of 'borrowing' extra mem and CPU in a task.

On Thu, Mar 16, 2023, 2:14 PM Nikhil Goyal  wrote:

> Hi folks,
> I am trying to understand what would be the difference in running 8G 1
> core executor vs 40G 5 core executors. I see that on yarn it can cause bin
> fitting issues but other than that are there any pros and cons on using
> either?
>
> Thanks
> Nikhil
>

Re: logging pickle files on local run of spark.ml Pipeline model

2023-03-15 Thread Sean Owen

Pickle won't work. But the others should. I think you are specifying an
invalid path in both cases but hard to say without more detail

On Wed, Mar 15, 2023, 9:13 AM Mnisi, Caleb 
wrote:

> Good Day
>
>
>
> I am having trouble saving a spark.ml Pipeline model to a pickle file,
> when running locally on my PC.
>
> I’ve tried a few ways to save the model:
>
>1. mlflow.spark.log_model(artifact_path=experiment.artifact_location,
>spark_model= model, registered_model_name="myModel")
>   1. with error that the spark model is multiple files
>2. pickle.dump(model, file): with error - TypeError: cannot pickle
>'_thread.RLock' object
>3. model.save(‘path’): with Java errors:
>   1. at
>   
> org.apache.hadoop.mapred.OutputCommitter.commitJob(OutputCommitter.java:291)
>   2. at
>   
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:182)
>   3. at
>   
> org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:99)
>   ... 67 more
>
>
>
> Your assistance on this would be much appreciated.
>
> Regards,
>
>
>
> *Caleb Mnisi*
>
> Consultant | Deloitte Analytics | Cognitive Advantage
>
> Deloitte & Touche
>
> 5th floor, 5 Magwa Crescent, Waterfall City, 2090
>
> M: +27 72 170 8779
>
> *cmn...@deloitte.co.za * | www2.deloitte.com/za
> 
>
>
>
>
>
> Please consider the environment before printing.
>
>
> *Disclaimer:* This email is subject to important restrictions,
> qualifications and disclaimers ("the Disclaimer") that must be accessed and
> read by visiting our website and viewing the webpage at the following
> address: http://www.deloitte.com/za/disclaimer. The Disclaimer forms part
> of the content of this email. If you cannot access the Disclaimer, please
> obtain a copy thereof from us by sending an email to
> zaitserviced...@deloitte.co.za. Deloitte refers to a Deloitte member
> firm, one of its related entities, or Deloitte Touche Tohmatsu Limited
> (“DTTL”). Each Deloitte member firm is a separate legal entity and a member
> of DTTL. DTTL does not provide services to clients. Please see
> www.deloitte.com/about to learn more.
>

Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Sean Owen

That's incorrect, it's spark.default.parallelism, but as the name suggests,
that is merely a default. You control partitioning directly with
.repartition()

On Tue, Mar 14, 2023 at 11:37 AM Mich Talebzadeh 
wrote:

> Check this link
>
>
> https://sparkbyexamples.com/spark/difference-between-spark-sql-shuffle-partitions-and-spark-default-parallelism/
>
> You can set it
>
> spark.conf.set("sparkDefaultParallelism", value])
>
>
> Have a look at Streaming statistics in Spark GUI, especially *Processing
> Tim*e, defined by Spark GUI as Time taken to process all jobs of a batch.
>  *The **Scheduling Dela*y and *the **Total Dela*y are additional
> indicators of health.
>
>
> then decide how to set the value.
>
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 14 Mar 2023 at 16:04, Emmanouil Kritharakis <
> kritharakismano...@gmail.com> wrote:
>
>> Yes I need to check the performance of my streaming job in terms of
>> latency and throughput. Is there any working example of how to increase the
>> parallelism with spark structured streaming  using Dataset data structures?
>> Thanks in advance.
>>
>> Kind regards,
>>
>> --
>>
>> Emmanouil (Manos) Kritharakis
>>
>> Ph.D. candidate in the Department of Computer Science
>> 
>>
>> Boston University
>>
>>
>> On Tue, Mar 14, 2023 at 12:01 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> What benefits are you going with increasing parallelism? Better througput
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 14 Mar 2023 at 15:58, Emmanouil Kritharakis <
>>> kritharakismano...@gmail.com> wrote:
>>>
 Hello,

 I hope this email finds you well!

 I have a simple dataflow in which I read from a kafka topic, perform a
 map transformation and then I write the result to another topic. Based on
 your documentation here
 ,
 I need to work with Dataset data structures. Even though my solution works,
 I need to increase the parallelism. The spark documentation includes a lot
 of parameters that I can change based on specific data structures like
 *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The
 former is the default number of partitions in RDDs returned by
 transformations like join, reduceByKey while the later is not recommended
 for structured streaming as it is described in documentation: "Note: For
 structured streaming, this configuration cannot be changed between query
 restarts from the same checkpoint location".

 So my question is how can I increase the parallelism for a simple
 dataflow based on datasets with a map transformation only?

 I am looking forward to hearing from you as soon as possible. Thanks in
 advance!

 Kind regards,

 --

 Emmanouil (Manos) Kritharakis

 Ph.D. candidate in the Department of Computer Science

 Boston University

>>>

Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Sean Owen

Are you just looking for DataFrame.repartition()?

On Tue, Mar 14, 2023 at 10:57 AM Emmanouil Kritharakis <
kritharakismano...@gmail.com> wrote:

> Hello,
>
> I hope this email finds you well!
>
> I have a simple dataflow in which I read from a kafka topic, perform a map
> transformation and then I write the result to another topic. Based on your
> documentation here
> ,
> I need to work with Dataset data structures. Even though my solution works,
> I need to increase the parallelism. The spark documentation includes a lot
> of parameters that I can change based on specific data structures like
> *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The former
> is the default number of partitions in RDDs returned by transformations
> like join, reduceByKey while the later is not recommended for structured
> streaming as it is described in documentation: "Note: For structured
> streaming, this configuration cannot be changed between query restarts from
> the same checkpoint location".
>
> So my question is how can I increase the parallelism for a simple dataflow
> based on datasets with a map transformation only?
>
> I am looking forward to hearing from you as soon as possible. Thanks in
> advance!
>
> Kind regards,
>
> --
>
> Emmanouil (Manos) Kritharakis
>
> Ph.D. candidate in the Department of Computer Science
> 
>
> Boston University
>

Re: Spark 3.3.2 not running with Antlr4 runtime latest version

2023-03-14 Thread Sean Owen

You want Antlr 3 and Spark is on 4? no I don't think Spark would downgrade.
You can shade your app's dependencies maybe.

On Tue, Mar 14, 2023 at 8:21 AM Sahu, Karuna
 wrote:

> Hi Team
>
>
>
> We are upgrading a legacy application using Spring boot , Spark and
> Hibernate. While upgrading Hibernate to 6.1.6.Final version there is a
> mismatch for antlr4 runtime jar with Hibernate and latest Spark version.
> Details for the issue are posted on StackOverflow as well:
>
> Issue in running Spark 3.3.2 with Antlr 4.10.1 - Stack Overflow
> 
>
>
>
> Please let us know if upgrades for this is being planned for latest Spark
> version.
>
>
>
> Thanks
>
> Karuna
>
> --
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy. Your privacy is important to us.
> Accenture uses your personal data only in compliance with data protection
> laws. For further information on how Accenture processes your personal
> data, please see our privacy statement at
> https://www.accenture.com/us-en/privacy-policy.
>
> __
>
> www.accenture.com
>

Re: [VOTE] Release Apache Spark 3.4.0 (RC3)

2023-03-09 Thread Sean Owen

If the issue were just tags, then you can simply delete the tag and re-tag
the right commit. That doesn't change a commit log.
But is the issue that the relevant commits aren't in branch-3.4? Like I
don't see the usual release commits in
https://github.com/apache/spark/commits/branch-3.4
Yeah OK that needs a re-do.

We can still test this release.
It works for me, except that I still get the weird infinite-compile-loop
issue that doesn't seem to be related to Spark. The Spark Connect parts
seem to work.

On Thu, Mar 9, 2023 at 3:25 PM Dongjoon Hyun 
wrote:

> No~ We cannot in the AS-IS commit log status because it's screwed already
> as Emil wrote.
> Did you check the branch-3.2 commit log, Sean?
>
> Dongjoon.
>
>
> On Thu, Mar 9, 2023 at 11:42 AM Sean Owen  wrote:
>
>> We can just push the tags onto the branches as needed right? No need to
>> roll a new release
>>
>> On Thu, Mar 9, 2023, 1:36 PM Dongjoon Hyun 
>> wrote:
>>
>>> Yes, I also confirmed that the v3.4.0-rc3 tag is invalid.
>>>
>>> I guess we need RC4.
>>>
>>> Dongjoon.
>>>
>>> On Thu, Mar 9, 2023 at 7:13 AM Emil Ejbyfeldt
>>>  wrote:
>>>
>>>> It might being caused by the v3.4.0-rc3 tag not being part of the 3.4
>>>> branch branch-3.4:
>>>>
>>>> $ git log --pretty='format:%d %h' --graph origin/branch-3.4  v3.4.0-rc3
>>>> | head -n 10
>>>> *  (HEAD, origin/branch-3.4) e38e619946
>>>> *  f3e69a1fe2
>>>> *  74cf1a32b0
>>>> *  0191a5bde0
>>>> *  afced91348
>>>> | *  (tag: v3.4.0-rc3) b9be9ce15a
>>>> |/
>>>> *  006e838ede
>>>> *  fc29b07a31
>>>> *  8655dfe66d
>>>>
>>>>
>>>> Best,
>>>> Emil
>>>>
>>>> On 09/03/2023 15:50, yangjie01 wrote:
>>>> > HI, all
>>>> >
>>>> > I can't git check out the tag of v3.4.0-rc3. At the same time, there
>>>> is
>>>> > the following information on the Github page.
>>>> >
>>>> > Does anyone else have the same problem?
>>>> >
>>>> > Yang Jie
>>>> >
>>>> > *发件人**: *Xinrong Meng 
>>>> > *日期**: *2023年3月9日星期四20:05
>>>> > *收件人**: *dev 
>>>> > *主题**: *[VOTE] Release Apache Spark 3.4.0 (RC3)
>>>> >
>>>> > Please vote on releasing the following candidate(RC3) as Apache Spark
>>>> > version 3.4.0.
>>>> >
>>>> > The vote is open until 11:59pm Pacific time *March 14th* and passes
>>>> if a
>>>> > majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>> >
>>>> > [ ] +1 Release this package as Apache Spark 3.4.0
>>>> > [ ] -1 Do not release this package because ...
>>>> >
>>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>>> > <
>>>> https://mailshield.baidu.com/check?q=eJcUboQ1HRRomPZKEwRzpl69wA8DbI%2fNIiRNsQ%3d%3d
>>>> >
>>>> >
>>>> > The tag to be voted on is *v3.4.0-rc3* (commit
>>>> > b9be9ce15a82b18cca080ee365d308c0820a29a9):
>>>> > https://github.com/apache/spark/tree/v3.4.0-rc3
>>>> > <
>>>> https://mailshield.baidu.com/check?q=ScnsHLDD3dexVfW9cjs3GovMbG2LLAZqBLq9cA8V%2fTOpCQ1LdeNWoD0%2fy7eVo%2b3de8Rk%2bQ%3d%3d
>>>> >
>>>> >
>>>> > The release files, including signatures, digests, etc. can be found
>>>> at:
>>>> > https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc3-bin/
>>>> > <
>>>> https://mailshield.baidu.com/check?q=U%2fLs35p0l%2bUUTclb%2blAPSYb%2bALxMfer1Jc%2b3i965Bjh2CxHpG45RFLW0NqSwMx00Ci3MRMz%2b7mTmcKUIa27Pww%3d%3d
>>>> >
>>>> >
>>>> > Signatures used for Spark RCs can be found in this file:
>>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>> > <
>>>> https://mailshield.baidu.com/check?q=E6fHbSXEWw02TTJBpc3bfA9mi7ea0YiWcNHkm%2fDJxwlaWinGnMdaoO1PahHhgj00vKwcbElpuHA%3d
>>>> >
>>>> >
>>>> > The staging repository for this release can be found at:
>>>> >
>>>> https://repository.apache.org/content/repositories/orgapachespark-1437
>>>> > <
>>>> https://mailshield.baidu.com/check?q=otrdG4krOioiB1q4MH%2fIEA444B80s7LLO8D2IdosERiNzIymKGZ2D1jV4O0JA9%2fRVfJje3xu6%2b33PB24x0R5V8ArX6BnzcYSkG5cHg%3

Re: How to share a dataset file across nodes

2023-03-09 Thread Sean Owen

Put the file on HDFS, if you have a Hadoop cluster?

On Thu, Mar 9, 2023 at 3:02 PM sam smith  wrote:

> Hello,
>
> I use Yarn client mode to submit my driver program to Hadoop, the dataset
> I load is from the local file system, when i invoke load("file://path")
> Spark complains about the csv file being not found, which i totally
> understand, since the dataset is not in any of the workers or the
> applicationMaster but only where the driver program resides.
> I tried to share the file using the configurations:
>
>> *spark.yarn.dist.files* OR *spark.files *
>
> but both ain't working.
> My question is how to share the csv dataset across the nodes at the
> specified path?
>
> Thanks.
>

Re: [VOTE] Release Apache Spark 3.4.0 (RC3)

2023-03-09 Thread Sean Owen

We can just push the tags onto the branches as needed right? No need to
roll a new release

On Thu, Mar 9, 2023, 1:36 PM Dongjoon Hyun  wrote:

> Yes, I also confirmed that the v3.4.0-rc3 tag is invalid.
>
> I guess we need RC4.
>
> Dongjoon.
>
> On Thu, Mar 9, 2023 at 7:13 AM Emil Ejbyfeldt
>  wrote:
>
>> It might being caused by the v3.4.0-rc3 tag not being part of the 3.4
>> branch branch-3.4:
>>
>> $ git log --pretty='format:%d %h' --graph origin/branch-3.4  v3.4.0-rc3
>> | head -n 10
>> *  (HEAD, origin/branch-3.4) e38e619946
>> *  f3e69a1fe2
>> *  74cf1a32b0
>> *  0191a5bde0
>> *  afced91348
>> | *  (tag: v3.4.0-rc3) b9be9ce15a
>> |/
>> *  006e838ede
>> *  fc29b07a31
>> *  8655dfe66d
>>
>>
>> Best,
>> Emil
>>
>> On 09/03/2023 15:50, yangjie01 wrote:
>> > HI, all
>> >
>> > I can't git check out the tag of v3.4.0-rc3. At the same time, there is
>> > the following information on the Github page.
>> >
>> > Does anyone else have the same problem?
>> >
>> > Yang Jie
>> >
>> > *发件人**: *Xinrong Meng 
>> > *日期**: *2023年3月9日星期四20:05
>> > *收件人**: *dev 
>> > *主题**: *[VOTE] Release Apache Spark 3.4.0 (RC3)
>> >
>> > Please vote on releasing the following candidate(RC3) as Apache Spark
>> > version 3.4.0.
>> >
>> > The vote is open until 11:59pm Pacific time *March 14th* and passes if
>> a
>> > majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 3.4.0
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> > <
>> https://mailshield.baidu.com/check?q=eJcUboQ1HRRomPZKEwRzpl69wA8DbI%2fNIiRNsQ%3d%3d
>> >
>> >
>> > The tag to be voted on is *v3.4.0-rc3* (commit
>> > b9be9ce15a82b18cca080ee365d308c0820a29a9):
>> > https://github.com/apache/spark/tree/v3.4.0-rc3
>> > <
>> https://mailshield.baidu.com/check?q=ScnsHLDD3dexVfW9cjs3GovMbG2LLAZqBLq9cA8V%2fTOpCQ1LdeNWoD0%2fy7eVo%2b3de8Rk%2bQ%3d%3d
>> >
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc3-bin/
>> > <
>> https://mailshield.baidu.com/check?q=U%2fLs35p0l%2bUUTclb%2blAPSYb%2bALxMfer1Jc%2b3i965Bjh2CxHpG45RFLW0NqSwMx00Ci3MRMz%2b7mTmcKUIa27Pww%3d%3d
>> >
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> > <
>> https://mailshield.baidu.com/check?q=E6fHbSXEWw02TTJBpc3bfA9mi7ea0YiWcNHkm%2fDJxwlaWinGnMdaoO1PahHhgj00vKwcbElpuHA%3d
>> >
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1437
>> > <
>> https://mailshield.baidu.com/check?q=otrdG4krOioiB1q4MH%2fIEA444B80s7LLO8D2IdosERiNzIymKGZ2D1jV4O0JA9%2fRVfJje3xu6%2b33PB24x0R5V8ArX6BnzcYSkG5cHg%3d%3d
>> >
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc3-docs/
>> > <
>> https://mailshield.baidu.com/check?q=wA4vz1x6jiz0lcn1hQ0AhAiPk3gdFJbs7dSHwusppbgB4ph846QORuIJQzNRr8GzerMucW3FL7ADPE3radzpmm3er3g%3d
>> >
>> >
>> > The list of bug fixes going into 3.4.0 can be found at the following
>> URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12351465
>> > <
>> https://mailshield.baidu.com/check?q=hdSxPMAr37WGNHJRNA4Mh1JkSlqjUL%2bM8BgEclwc23ePHCBzkAjvhgnZa0N7SPRWAcgfoLXjX43CxJXmKnDj0LIElJs%3d
>> >
>> >
>> > This release is using the release script of the tag v3.4.0-rc3.
>> >
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 3.4.0?
>> > ===
>> > The current list of open tickets targeted at 3.4.0 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK
>> > <
>> https://mailshield.baidu.com/check?q=4UUpJqq41y71Gnuj0qTUYo6hTjqzT7oytN6x%2fvgC5XUtQUC8MfJ77tj7K70O%2f1QMmNoa1A%3d%3d>
>>  and
>> search for "Target Version/s" = 3.4.0
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> > In order to make

Re: 回复：Re: Build SPARK from source with SBT failed

2023-03-07 Thread Sean Owen

No, it's that JAVA_HOME wasn't set to .../Home. It is simply not finding
javac, in the error. Zulu supports M1.

On Tue, Mar 7, 2023 at 9:05 AM Artemis User  wrote:

> Looks like Maven build did find the javac, just can't run it.  So it's not
> a path problem but a compatibility problem.  Are you doing this on a Mac
> with M1/M2?  I don't think that Zulu JDK supports Apple silicon.   Your
> best option would be to use homebrew to install the dev tools (including
> OpenJDK) on Mac.  On Ubuntu, it seems still the compatibility problem.  Try
> to use the apt to install your dev tools, don't do it manually.  If you
> manually install JDK, it doesn't install hardware-optimized JVM libraries.
>
> On 3/7/23 8:21 AM, ckgppl_...@sina.cn wrote:
>
> No. I haven't installed Apple Developer Tools. I have installed Zulu
> OpenJDK 11.0.17 manually.
> So I need to install Apple Developer Tools?
> - 原始邮件 -
> 发件人：Sean Owen  
> 收件人：ckgppl_...@sina.cn
> 抄送人：user  
> 主题：Re: Build SPARK from source with SBT failed
> 日期：2023年03月07日 20点58分
>
> This says you don't have the java compiler installed. Did you install the
> Apple Developer Tools package?
>
> On Tue, Mar 7, 2023 at 1:42 AM  wrote:
>
> Hello,
>
> I have tried to build SPARK source codes with SBT in my local dev
> environment (MacOS 13.2.1). But it reported following error:
> [error] java.io.IOException: Cannot run program
> "/Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/bin/javac" (in
> directory "/Users/username/spark-remotemaster"): error=2, No such file or
> directory
>
> [error] at
> java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
>
> [error] at
> java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
>
> [error] at
> scala.sys.process.ProcessBuilderImpl$Simple.run(ProcessBuilderImpl.scala:75)
> [error] at
> scala.sys.process.ProcessBuilderImpl$AbstractBuilder.run(ProcessBuilderImpl.scala:106)
>
> I need to export JAVA_HOME to let it run successfully. But if I use maven
> then I don't need to export JAVA_HOME. I have also tried to build SPARK
> with SBT in Ubuntu X86_64 environment. It reported similar error.
>
> The official SPARK
> documentation  haven't mentioned export JAVA_HOME operation. So I think
> this is a bug which needs documentation or scripts change. Please correct
> me if I am wrong.
>
> Thanks
>
> Liang
>
>
>

Re: Pandas UDFs vs Inbuilt pyspark functions

2023-03-07 Thread Sean Owen

It's hard to evaluate without knowing what you're doing. Generally, using a
built-in function will be fastest. pandas UDFs can be faster than normal
UDFs if you can take advantage of processing multiple rows at once.

On Tue, Mar 7, 2023 at 6:47 AM neha garde  wrote:

> Hello All,
>
> I need help deciding on what is better, pandas udfs or inbuilt functions
> I have to perform a transformation where I managed to compare the two for
> a few thousand records
> and pandas_udf infact performed better.
> Given the complexity of the transformation, I also found pandas_udf makes
> it more readable.
> I also found a lot of comparisons made between normal udfs and pandas_udfs
>
> What I am looking forward to is whether pandas_udfs will behave as a
> normal pyspark in-built data.
> How do pandas_udfs work internally, and will they be equally performant on
> bigger sets of data.?
> I did go through a few documents but wasn't able to get a clear idea.
> I am mainly looking from the performance perspective.
>
> Thanks in advance
>
>
> Regards,
> Neha R.Garde.
>

Re: Build SPARK from source with SBT failed

2023-03-07 Thread Sean Owen

This says you don't have the java compiler installed. Did you install the
Apple Developer Tools package?

On Tue, Mar 7, 2023 at 1:42 AM  wrote:

> Hello,
>
> I have tried to build SPARK source codes with SBT in my local dev
> environment (MacOS 13.2.1). But it reported following error:
> [error] java.io.IOException: Cannot run program
> "/Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/bin/javac" (in
> directory "/Users/username/spark-remotemaster"): error=2, No such file or
> directory
>
> [error] at
> java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
>
> [error] at
> java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
>
> [error] at
> scala.sys.process.ProcessBuilderImpl$Simple.run(ProcessBuilderImpl.scala:75)
> [error] at
> scala.sys.process.ProcessBuilderImpl$AbstractBuilder.run(ProcessBuilderImpl.scala:106)
>
> I need to export JAVA_HOME to let it run successfully. But if I use maven
> then I don't need to export JAVA_HOME. I have also tried to build SPARK
> with SBT in Ubuntu X86_64 environment. It reported similar error.
>
> The official SPARK
> documentation  haven't mentioned export JAVA_HOME operation. So I think
> this is a bug which needs documentation or scripts change. Please correct
> me if I am wrong.
>
> Thanks
>
> Liang
>
>

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Sean Owen

I don't quite get it - aren't you applying to the same stream, and batches?
worst case why not apply these as one function?
Otherwise, how do you mean to associate one call to another?
globals don't help here. They aren't global beyond the driver, and, which
one would be which batch?

On Sat, Mar 4, 2023 at 3:02 PM Mich Talebzadeh 
wrote:

> Thanks. they are different batchIds
>
> From sendToControl, newtopic batchId is 76
> From sendToSink, md, batchId is 563
>
> As a matter of interest, why does a global variable not work?
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 4 Mar 2023 at 20:13, Sean Owen  wrote:
>
>> It's the same batch ID already, no?
>> Or why not simply put the logic of both in one function? or write one
>> function that calls both?
>>
>> On Sat, Mar 4, 2023 at 2:07 PM Mich Talebzadeh 
>> wrote:
>>
>>>
>>> This is probably pretty  straight forward but somehow is does not look
>>> that way
>>>
>>>
>>>
>>> On Spark Structured Streaming,  "foreachBatch" performs custom write
>>> logic on each micro-batch through a call function. Example,
>>>
>>> foreachBatch(sendToSink) expects 2 parameters, first: micro-batch as
>>> DataFrame or Dataset and second: unique id for each batch
>>>
>>>
>>>
>>> In my case I simultaneously read two topics through two separate
>>> functions
>>>
>>>
>>>
>>>1. foreachBatch(sendToSink). \
>>>2. foreachBatch(sendToControl). \
>>>
>>> This is  the code
>>>
>>> def sendToSink(df, batchId):
>>> if(len(df.take(1))) > 0:
>>> print(f"""From sendToSink, md, batchId is {batchId}, at
>>> {datetime.now()} """)
>>> #df.show(100,False)
>>> df. persist()
>>> # write to BigQuery batch table
>>> #s.writeTableToBQ(df, "append",
>>> config['MDVariables']['targetDataset'],config['MDVariables']['targetTable'])
>>> df.unpersist()
>>> #print(f"""wrote to DB""")
>>>else:
>>> print("DataFrame md is empty")
>>>
>>> def sendToControl(dfnewtopic, batchId2):
>>> if(len(dfnewtopic.take(1))) > 0:
>>> print(f"""From sendToControl, newtopic batchId is {batchId2}""")
>>> dfnewtopic.show(100,False)
>>> queue = dfnewtopic.first()[2]
>>> status = dfnewtopic.first()[3]
>>> print(f"""testing queue is {queue}, and status is {status}""")
>>> if((queue == config['MDVariables']['topic']) & (status ==
>>> 'false')):
>>>   spark_session = s.spark_session(config['common']['appName'])
>>>   active = spark_session.streams.active
>>>   for e in active:
>>>  name = e.name
>>>  if(name == config['MDVariables']['topic']):
>>> print(f"""\n==> Request terminating streaming process
>>> for topic {name} at {datetime.now()}\n """)
>>> e.stop()
>>> else:
>>> print("DataFrame newtopic is empty")
>>>
>>>
>>> The problem I have is to share batchID from the first function in the
>>> second function sendToControl(dfnewtopic, batchId2) so I can print it
>>> out.
>>>
>>>
>>> Defining a global did not work.. So it sounds like I am missing
>>> something rudimentary here!
>>>
>>>
>>> Thanks
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Sean Owen

It's the same batch ID already, no?
Or why not simply put the logic of both in one function? or write one
function that calls both?

On Sat, Mar 4, 2023 at 2:07 PM Mich Talebzadeh 
wrote:

>
> This is probably pretty  straight forward but somehow is does not look
> that way
>
>
>
> On Spark Structured Streaming,  "foreachBatch" performs custom write logic
> on each micro-batch through a call function. Example,
>
> foreachBatch(sendToSink) expects 2 parameters, first: micro-batch as
> DataFrame or Dataset and second: unique id for each batch
>
>
>
> In my case I simultaneously read two topics through two separate functions
>
>
>
>1. foreachBatch(sendToSink). \
>2. foreachBatch(sendToControl). \
>
> This is  the code
>
> def sendToSink(df, batchId):
> if(len(df.take(1))) > 0:
> print(f"""From sendToSink, md, batchId is {batchId}, at
> {datetime.now()} """)
> #df.show(100,False)
> df. persist()
> # write to BigQuery batch table
> #s.writeTableToBQ(df, "append",
> config['MDVariables']['targetDataset'],config['MDVariables']['targetTable'])
> df.unpersist()
> #print(f"""wrote to DB""")
>else:
> print("DataFrame md is empty")
>
> def sendToControl(dfnewtopic, batchId2):
> if(len(dfnewtopic.take(1))) > 0:
> print(f"""From sendToControl, newtopic batchId is {batchId2}""")
> dfnewtopic.show(100,False)
> queue = dfnewtopic.first()[2]
> status = dfnewtopic.first()[3]
> print(f"""testing queue is {queue}, and status is {status}""")
> if((queue == config['MDVariables']['topic']) & (status ==
> 'false')):
>   spark_session = s.spark_session(config['common']['appName'])
>   active = spark_session.streams.active
>   for e in active:
>  name = e.name
>  if(name == config['MDVariables']['topic']):
> print(f"""\n==> Request terminating streaming process for
> topic {name} at {datetime.now()}\n """)
> e.stop()
> else:
> print("DataFrame newtopic is empty")
>
>
> The problem I have is to share batchID from the first function in the
> second function sendToControl(dfnewtopic, batchId2) so I can print it
> out.
>
>
> Defining a global did not work.. So it sounds like I am missing something
> rudimentary here!
>
>
> Thanks
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Re: [VOTE] Release Apache Spark 3.4.0 (RC2)

2023-03-03 Thread Sean Owen

Oh OK, I thought this RC was meant to fix that.

On Fri, Mar 3, 2023 at 12:35 AM Jonathan Kelly 
wrote:

> I see that one too but have not investigated it myself. In the RC1 thread,
> it was mentioned that this occurs when running the tests via Maven but not
> via SBT. Does the test class path get set up differently when running via
> SBT vs. Maven?
>
> On Thu, Mar 2, 2023 at 5:37 PM Sean Owen  wrote:
>
>> Thanks, that's good to know. The workaround (deleting the thriftserver
>> target dir) works for me. Who knows?
>>
>> But I'm also still seeing:
>>
>> - simple udf *** FAILED ***
>>   io.grpc.StatusRuntimeException: INTERNAL:
>> org.apache.spark.sql.ClientE2ETestSuite
>>   at io.grpc.Status.asRuntimeException(Status.java:535)
>>   at
>> io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660)
>>   at org.apache.spark.sql.connect.client.SparkResult.org
>> $apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:61)
>>   at
>> org.apache.spark.sql.connect.client.SparkResult.length(SparkResult.scala:106)
>>   at
>> org.apache.spark.sql.connect.client.SparkResult.toArray(SparkResult.scala:123)
>>   at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2426)
>>   at org.apache.spark.sql.Dataset.withResult(Dataset.scala:2747)
>>   at org.apache.spark.sql.Dataset.collect(Dataset.scala:2425)
>>   at
>> org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$8(ClientE2ETestSuite.scala:85)
>>   at
>> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>>
>> On Thu, Mar 2, 2023 at 4:38 PM Jonathan Kelly 
>> wrote:
>>
>>> Yes, this issue has driven me quite crazy as well! I hit this issue for
>>> a long time when compiling the master branch and running tests. Strangely,
>>> it would only occur, as you say, when running the tests and not during an
>>> initial build that skips running the tests. (However, I have seen instances
>>> where it does occur even in the initial build with tests skipped, but only
>>> on AWS CodeBuild, not when building locally or on Amazon Linux.)
>>>
>>> I thought for a long time that I was alone in this bizarre issue, but I
>>> eventually found sbt#6183 <https://github.com/sbt/sbt/issues/6183> and
>>> SPARK-41063 <https://issues.apache.org/jira/browse/SPARK-41063>, but
>>> both are unfortunately still open.
>>>
>>> I found at one point that the issue magically disappeared once
>>> [SPARK-41408] <https://issues.apache.org/jira/browse/SPARK-41408>[BUILD]
>>> Upgrade scala-maven-plugin to 4.8.0
>>> <https://github.com/apache/spark/commit/a3a755d36136295473a4873a6df33c295c29213e>
>>>  was
>>> merged, but then it cropped back up again at some point after that, and I
>>> used git bisect to find that the issue appeared again when [SPARK-27561]
>>> <https://issues.apache.org/jira/browse/SPARK-27561>[SQL] Support
>>> implicit lateral column alias resolution on Project
>>> <https://github.com/apache/spark/commit/7e9b88bfceb86d3b32e82a86b672aab3c74def8c>
>>>  was
>>> merged. This commit didn't even directly affect anything in
>>> hive-thriftserver, but it does make some pretty big changes to pretty core
>>> classes in sql/catalyst, so it's not too surprising that this could trigger
>>> an issue that seems to have to do with "very complicated inheritance
>>> hierarchies involving both Java and Scala", which is a phrase mentioned on
>>> sbt#6183 <https://github.com/sbt/sbt/issues/6183>.
>>>
>>> One thing that I did find to help was to
>>> delete sql/hive-thriftserver/target between building Spark and running the
>>> tests. This helps in my builds where the issue only occurs during the
>>> testing phase and not during the initial build phase, but of course it
>>> doesn't help in my builds where the issue occurs during that first build
>>> phase.
>>>
>>> ~ Jonathan Kelly
>>>
>>> On Thu, Mar 2, 2023 at 1:47 PM Sean Owen  wrote:
>>>
>>>> Has anyone seen this behavior -- I've never seen it before. The Hive
>>>> thriftserver module for me just goes into an infinite loop when running
>>>> tests:
>>>>
>>>> ...
>>>> [INFO] done compiling
>>>> [INFO] compiling 22 Scala sources and 24 Java sources to
>>>> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/target/scala-2.12/classes
>>>> ...
>>>> [INFO] done compiling
>&g

Re: [VOTE] Release Apache Spark 3.4.0 (RC2)

2023-03-02 Thread Sean Owen

Thanks, that's good to know. The workaround (deleting the thriftserver
target dir) works for me. Who knows?

But I'm also still seeing:

- simple udf *** FAILED ***
  io.grpc.StatusRuntimeException: INTERNAL:
org.apache.spark.sql.ClientE2ETestSuite
  at io.grpc.Status.asRuntimeException(Status.java:535)
  at
io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660)
  at org.apache.spark.sql.connect.client.SparkResult.org
$apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:61)
  at
org.apache.spark.sql.connect.client.SparkResult.length(SparkResult.scala:106)
  at
org.apache.spark.sql.connect.client.SparkResult.toArray(SparkResult.scala:123)
  at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2426)
  at org.apache.spark.sql.Dataset.withResult(Dataset.scala:2747)
  at org.apache.spark.sql.Dataset.collect(Dataset.scala:2425)
  at
org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$8(ClientE2ETestSuite.scala:85)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

On Thu, Mar 2, 2023 at 4:38 PM Jonathan Kelly 
wrote:

> Yes, this issue has driven me quite crazy as well! I hit this issue for a
> long time when compiling the master branch and running tests. Strangely, it
> would only occur, as you say, when running the tests and not during an
> initial build that skips running the tests. (However, I have seen instances
> where it does occur even in the initial build with tests skipped, but only
> on AWS CodeBuild, not when building locally or on Amazon Linux.)
>
> I thought for a long time that I was alone in this bizarre issue, but I
> eventually found sbt#6183 <https://github.com/sbt/sbt/issues/6183> and
> SPARK-41063 <https://issues.apache.org/jira/browse/SPARK-41063>, but both
> are unfortunately still open.
>
> I found at one point that the issue magically disappeared once
> [SPARK-41408] <https://issues.apache.org/jira/browse/SPARK-41408>[BUILD]
> Upgrade scala-maven-plugin to 4.8.0
> <https://github.com/apache/spark/commit/a3a755d36136295473a4873a6df33c295c29213e>
>  was
> merged, but then it cropped back up again at some point after that, and I
> used git bisect to find that the issue appeared again when [SPARK-27561]
> <https://issues.apache.org/jira/browse/SPARK-27561>[SQL] Support implicit
> lateral column alias resolution on Project
> <https://github.com/apache/spark/commit/7e9b88bfceb86d3b32e82a86b672aab3c74def8c>
>  was
> merged. This commit didn't even directly affect anything in
> hive-thriftserver, but it does make some pretty big changes to pretty core
> classes in sql/catalyst, so it's not too surprising that this could trigger
> an issue that seems to have to do with "very complicated inheritance
> hierarchies involving both Java and Scala", which is a phrase mentioned on
> sbt#6183 <https://github.com/sbt/sbt/issues/6183>.
>
> One thing that I did find to help was to
> delete sql/hive-thriftserver/target between building Spark and running the
> tests. This helps in my builds where the issue only occurs during the
> testing phase and not during the initial build phase, but of course it
> doesn't help in my builds where the issue occurs during that first build
> phase.
>
> ~ Jonathan Kelly
>
> On Thu, Mar 2, 2023 at 1:47 PM Sean Owen  wrote:
>
>> Has anyone seen this behavior -- I've never seen it before. The Hive
>> thriftserver module for me just goes into an infinite loop when running
>> tests:
>>
>> ...
>> [INFO] done compiling
>> [INFO] compiling 22 Scala sources and 24 Java sources to
>> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/target/scala-2.12/classes
>> ...
>> [INFO] done compiling
>> [INFO] compiling 22 Scala sources and 9 Java sources to
>> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/target/scala-2.12/classes
>> ...
>> [WARNING] [Warn]
>> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:25:29:
>>  [deprecation] GnuParser in org.apache.commons.cli has been deprecated
>> [WARNING] [Warn]
>> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/auth/HiveAuthFactory.java:333:18:
>>  [deprecation] authorize(UserGroupInformation,String,Configuration) in
>> ProxyUsers has been deprecated
>> [WARNING] [Warn]
>> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java:110:16:
>>  [deprecation] HIVE_SERVER2_THRIFT_HTTP_COOKIE_IS_SECURE in ConfVars has
>> been deprecated
>> [WARNING] [Warn]
>> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/Thri

Re: [VOTE] Release Apache Spark 3.4.0 (RC2)

2023-03-02 Thread Sean Owen

Has anyone seen this behavior -- I've never seen it before. The Hive
thriftserver module for me just goes into an infinite loop when running
tests:

...
[INFO] done compiling
[INFO] compiling 22 Scala sources and 24 Java sources to
/mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/target/scala-2.12/classes
...
[INFO] done compiling
[INFO] compiling 22 Scala sources and 9 Java sources to
/mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/target/scala-2.12/classes
...
[WARNING] [Warn]
/mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:25:29:
 [deprecation] GnuParser in org.apache.commons.cli has been deprecated
[WARNING] [Warn]
/mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/auth/HiveAuthFactory.java:333:18:
 [deprecation] authorize(UserGroupInformation,String,Configuration) in
ProxyUsers has been deprecated
[WARNING] [Warn]
/mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java:110:16:
 [deprecation] HIVE_SERVER2_THRIFT_HTTP_COOKIE_IS_SECURE in ConfVars has
been deprecated
[WARNING] [Warn]
/mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java:553:53:
 [deprecation] HttpUtils in javax.servlet.http has been deprecated
[WARNING] [Warn]
/mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:185:24:
 [deprecation] OptionBuilder in org.apache.commons.cli has been deprecated
[WARNING] [Warn]
/mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:187:10:
 [static] static method should be qualified by type name, OptionBuilder,
instead of by an expression
[WARNING] [Warn]
/mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:197:26:
 [deprecation] GnuParser in org.apache.commons.cli has been deprecated
...

... repeated over and over.

On Thu, Mar 2, 2023 at 6:04 AM Xinrong Meng 
wrote:

> Please vote on releasing the following candidate(RC2) as Apache Spark
> version 3.4.0.
>
> The vote is open until 11:59pm Pacific time *March 7th* and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v3.4.0-rc2* (commit
> 759511bb59b206ac5ff18f377c239a2f38bf5db6):
> https://github.com/apache/spark/tree/v3.4.0-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1436
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc2-docs/
>
> The list of bug fixes going into 3.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12351465
>
> This release is using the release script of the tag v3.4.0-rc2.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.4.0?
> ===
> The current list of open tickets targeted at 3.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Thanks,
> Xinrong

Re: [Question] LimitedInputStream license issue in Spark source.

2023-03-01 Thread Sean Owen

Right, it contains ALv2 licensed code attributed to two authors - some is
from Guava, some is from Apache Spark contributors.
I thought this is how we should handle this. It's not feasible to go line
by line and say what came from where.

On Wed, Mar 1, 2023 at 1:33 AM Dongjoon Hyun 
wrote:

> May I ask why do you thinkn in that way? Could you elaborate a little more
> about your concerns if you mean it from a legal perspective?
>
> > The ASF header states "Licensed to the Apache Software Foundation (ASF)
> under one or more contributor license agreements.”
> > I ‘m not sure this is true with this file even though both Spark and
> this file are under the ALv2 license.
>
> On Tue, Feb 28, 2023 at 11:26 PM Justin Mclean 
> wrote:
>
>> Hi,
>>
>> The issue is not the original header it is the addition of the ASF
>> header. The ASF header states "Licensed to the Apache Software Foundation
>> (ASF) under one or more contributor license agreements.” I ‘m not sure this
>> is true with this file even though both Spark and this file are under the
>> ALv2 license.
>>
>> Kind Regards,
>> Justin
>
>

Re: [Question] LimitedInputStream license issue in Spark source.

2023-03-01 Thread Sean Owen

Right, it contains ALv2 licensed code attributed to two authors - some is
from Guava, some is from Apache Spark contributors.
I thought this is how we should handle this. It's not feasible to go line
by line and say what came from where.

On Wed, Mar 1, 2023 at 1:33 AM Dongjoon Hyun 
wrote:

> May I ask why do you thinkn in that way? Could you elaborate a little more
> about your concerns if you mean it from a legal perspective?
>
> > The ASF header states "Licensed to the Apache Software Foundation (ASF)
> under one or more contributor license agreements.”
> > I ‘m not sure this is true with this file even though both Spark and
> this file are under the ALv2 license.
>
> On Tue, Feb 28, 2023 at 11:26 PM Justin Mclean 
> wrote:
>
>> Hi,
>>
>> The issue is not the original header it is the addition of the ASF
>> header. The ASF header states "Licensed to the Apache Software Foundation
>> (ASF) under one or more contributor license agreements.” I ‘m not sure this
>> is true with this file even though both Spark and this file are under the
>> ALv2 license.
>>
>> Kind Regards,
>> Justin
>
>

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Sean Owen

That's pretty impressive. I'm not sure it's quite right - not clear that
the intent is taking a minimum of absolute values (is it? that'd be wild).
But I think it might have pointed in the right direction. I'm not quite
sure why that error pops out, but I think 'max' is the wrong function.
That's an aggregate function. "greatest" is the function that returns the
max of several cols. Try that?

On Thu, Feb 23, 2023 at 2:25 PM Bjørn Jørgensen 
wrote:

> I'm trying to learn how to use chatgpt for coding.
>
> So after a lite chat I got this.
>
>
> The code you provided seems to calculate the distance between a gene and a
> variant by finding the maximum value between the difference of the variant
> position and the gene start position, the difference of the gene end
> position and the variant position, and 0. However, the result of this
> calculation may not always be correct.
>
> Here's an alternative approach to calculating the distance between a gene
> and a variant:
>
>
> from pyspark.sql.functions import abs
>
> # Calculate the absolute difference between the variant position and the
> gene start and end positions
> distances = joined.withColumn("distance_start", abs(col("position") -
> col("start"))) \
>  .withColumn("distance_end", abs(col("position") -
> col("end")))
>
> # Take the minimum distance between the variant and the gene
> distances = distances.withColumn("distance", least(col("distance_start"),
> col("distance_end")))
>
>
> In this approach, we first calculate the absolute difference between the
> variant position and the gene start and end positions using the abs
> function. Then, we use the least function to take the minimum distance
> between the variant and the gene. This should give you a more accurate
> distance calculation.
>
>
>
>
> Yes, that's correct. The original code you provided calculates the
> distance between a gene and a variant as the maximum of three terms: the
> difference between the variant position and the gene start position, the
> difference between the gene end position and the variant position, and 0.
>
> While this approach might work in some cases, it has a few issues. First,
> it doesn't take into account the possibility that the variant might be
> inside the gene (i.e., between the start and end positions). Second, it
> assumes that the gene is always located upstream or downstream of the
> variant, which may not be the case.
>
> The alternative approach I provided earlier calculates the distance as the
> minimum of two absolute differences: the absolute difference between the
> variant position and the gene start position, and the absolute difference
> between the variant position and the gene end position. This approach
> should work in all cases, including when the variant is inside the gene or
> when the gene is located upstream or downstream of the variant.
>
>
>
> tor. 23. feb. 2023 kl. 20:48 skrev Russell Jurney <
> russell.jur...@gmail.com>:
>
>> Usually, the solution to these problems is to do less per line, break it
>> out and perform each minute operation as a field, then combine those into a
>> final answer. Can you do that here?
>>
>> Thanks,
>> Russell Jurney @rjurney <http://twitter.com/rjurney>
>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
>> <http://facebook.com/jurney> datasyndrome.com Book a time on Calendly
>> <https://calendly.com/rjurney_personal/30min>
>>
>>
>> On Thu, Feb 23, 2023 at 11:07 AM Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>> Here is the complete error:
>>>
>>> ```
>>> Traceback (most recent call last):
>>>   File "nearest-gene.py", line 74, in 
>>> main()
>>>   File "nearest-gene.py", line 62, in main
>>> distances = joined.withColumn("distance", max(col("start") -
>>> col("position"), col("position") - col("end"), 0))
>>>   File
>>> "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_01/pyspark.zip/pyspark/sql/column.py",
>>> line 907, in __nonzero__
>>> ValueError: Cannot convert column into bool: please use '&' for 'and',
>>> '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
>>> ```
>>>
>>> On Thu, Feb 23, 2023 at 2:00 PM Sean Owen  wrote:
>>>
>>>> That error sounds like it's from pandas not spark.

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Sean Owen

That error sounds like it's from pandas not spark. Are you sure it's this
line?

On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

>
>  Hello,
>
>   I'm trying to calculate the distance between a gene (with start and end)
> and a variant (with position), so I joined gene and variant data by
> chromosome and then tried to calculate the distance like this:
>
> ```
> distances = joined.withColumn("distance", max(col("start") -
> col("position"), col("position") - col("end"), 0))
> ```
>
>   Basically, the distance is the maximum of three terms.
>
>   This line causes an obscure error:
>
> ```
> ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
> for 'or', '~' for 'not' when building DataFrame boolean expressions.
> ```
>
>   How can I do this? Thanks!
>
>  Best, Oliver
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Sean Owen

FWIW I agree with this.

On Wed, Feb 22, 2023 at 2:59 PM Allan Folting  wrote:

> Hi all,
>
> I would like to propose that we show Python code examples first in the
> Spark documentation where we have multiple programming language examples.
> An example is on the Quick Start page:
> https://spark.apache.org/docs/latest/quick-start.html
>
> I propose this change because Python has become more popular than the
> other languages supported in Apache Spark. There are a lot more users of
> Spark in Python than Scala today and Python attracts a broader set of new
> users.
> For Python usage data, see https://www.tiobe.com/tiobe-index/ and
> https://insights.stackoverflow.com/trends?tags=r%2Cscala%2Cpython%2Cjava.
>
> Also, this change aligns with Python already being the first tab on our
> home page:
> https://spark.apache.org/
>
> Anyone who wants to use another language can still just click on the other
> tabs.
>
> I created a draft PR for the Spark SQL, DataFrames and Datasets Guide page
> as a first step:
> https://github.com/apache/spark/pull/40087
>
>
> I would appreciate it if you could share your thoughts on this proposal.
>
>
> Thanks a lot,
> Allan Folting
>

Re: [DISCUSS] Make release cadence predictable

2023-02-15 Thread Sean Owen

I don't think there is a delay per se, because there is no hard release
date to begin with, to delay with respect to. It's been driven by, "feels
like enough stuff has gone in" and "someone is willing to roll a release",
and that happens more like every 8-9 months. This would be a shift not only
in expectation - lower the threshold for 'enough stuff has gone in' to
probably match a 6 month cadence - but also a shift in policy to a release
train-like process. If something isn't ready then it just waits another 6
months.

You're right, the problem is kind of - what is something is in process in a
half-baked state? you don't really want to release half a thing, nor do you
want to develop it quite separately from the master branch.
It is worth asking what prompts this, too. Just, we want to release earlier
and more often?

On Wed, Feb 15, 2023 at 1:19 PM Maciej  wrote:

> Hi,
>
> Sorry for a silly question, but do we know what exactly caused these
> delays? Are these avoidable?
>
> It is not a systematic observation, but my general impression is that we
> rarely delay for sake of individual features, unless there is some soft
> consensus about their importance. Arguably, these could be postponed,
> assuming we can adhere to the schedule.
>
> And then, we're left with large, multi-task features. A lot can be done
> with proper timing and design, but in our current process there is no way
> to guarantee that each of these can be delivered within given time window.
> How are we going to handle these? Delivering half-baked things is hardly
> satisfying solution and more rigid schedule can only increase pressure on
> maintainers. Do we plan to introduce something like feature branches for
> these, to isolate upcoming release in case of delay?
>
> On 2/14/23 19:53, Dongjoon Hyun wrote:
>
> +1 for Hyukjin and Sean's opinion.
>
> Thank you for initiating this discussion.
>
> If we have a fixed-predefined regular 6-month, I believe we can persuade
> the incomplete features to wait for next releases more easily.
>
> In addition, I want to add the first RC1 date requirement because RC1
> always did a great job for us.
>
> I guess `branch-cut + 1M (no later than 1month)` could be the reasonable
> deadline.
>
> Thanks,
> Dongjoon.
>
>
> On Tue, Feb 14, 2023 at 6:33 AM Sean Owen  wrote:
>
>> I'm fine with shifting to a stricter cadence-based schedule. Sometimes,
>> it'll mean some significant change misses a release rather than delays it.
>> If people are OK with that discipline, sure.
>> A hard 6-month cycle would mean the minor releases are more frequent and
>> have less change in them. That's probably OK. We could also decide to
>> choose a longer cadence like 9 months, but I don't know if that's better.
>> I assume maintenance releases would still be as-needed, and major
>> releases would also work differently - probably no 4.0 until next year at
>> the earliest.
>>
>> On Tue, Feb 14, 2023 at 3:01 AM Hyukjin Kwon  wrote:
>>
>>> Hi all,
>>>
>>> *TL;DR*: Branch cut for every 6 months (January and July).
>>>
>>> I would like to discuss/propose to make our release cadence predictable.
>>> In our documentation, we mention as follows:
>>>
>>> In general, feature (“minor”) releases occur about every 6 months. Hence,
>>> Spark 2.3.0 would generally be released about 6 months after 2.2.0.
>>>
>>> However, the reality is slightly different. Here is the time it took for
>>> the recent releases:
>>>
>>>- Spark 3.3.0 took 8 months
>>>- Spark 3.2.0 took 7 months
>>>- Spark 3.1 took 9 months
>>>
>>> Here are problems caused by such delay:
>>>
>>>- The whole related schedules are affected in all downstream
>>>projects, vendors, etc.
>>>- It makes the release date unpredictable to the end users.
>>>- Developers as well as the release managers have to rush because of
>>>the delay, which prevents us from focusing on having a proper
>>>regression-free release.
>>>
>>> My proposal is to branch cut every 6 months (January and July that
>>> avoids the public holidays / vacation period in general) so the release can
>>> happen twice
>>> every year regardless of the actual release date.
>>> I believe it both makes the release cadence predictable, and relaxes the
>>> burden about making releases.
>>>
>>> WDYT?
>>>
>>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>
>

Re: [DISCUSS] Make release cadence predictable

2023-02-14 Thread Sean Owen

I'm fine with shifting to a stricter cadence-based schedule. Sometimes,
it'll mean some significant change misses a release rather than delays it.
If people are OK with that discipline, sure.
A hard 6-month cycle would mean the minor releases are more frequent and
have less change in them. That's probably OK. We could also decide to
choose a longer cadence like 9 months, but I don't know if that's better.
I assume maintenance releases would still be as-needed, and major releases
would also work differently - probably no 4.0 until next year at the
earliest.

On Tue, Feb 14, 2023 at 3:01 AM Hyukjin Kwon  wrote:

> Hi all,
>
> *TL;DR*: Branch cut for every 6 months (January and July).
>
> I would like to discuss/propose to make our release cadence predictable.
> In our documentation, we mention as follows:
>
> In general, feature (“minor”) releases occur about every 6 months. Hence,
> Spark 2.3.0 would generally be released about 6 months after 2.2.0.
>
> However, the reality is slightly different. Here is the time it took for
> the recent releases:
>
>- Spark 3.3.0 took 8 months
>- Spark 3.2.0 took 7 months
>- Spark 3.1 took 9 months
>
> Here are problems caused by such delay:
>
>- The whole related schedules are affected in all downstream projects,
>vendors, etc.
>- It makes the release date unpredictable to the end users.
>- Developers as well as the release managers have to rush because of
>the delay, which prevents us from focusing on having a proper
>regression-free release.
>
> My proposal is to branch cut every 6 months (January and July that avoids
> the public holidays / vacation period in general) so the release can happen
> twice
> every year regardless of the actual release date.
> I believe it both makes the release cadence predictable, and relaxes the
> burden about making releases.
>
> WDYT?
>

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-13 Thread Sean Owen

Agree, just, if it's such a tiny change, and it actually fixes the issue,
maybe worth getting that into 3.3.x. I don't feel strongly.

On Mon, Feb 13, 2023 at 11:19 AM L. C. Hsieh  wrote:

> If it is not supported in Spark 3.3.x, it looks like an improvement at
> Spark 3.4.
> For such cases we usually do not back port. I think this is also why
> the PR did not back port when it was merged.
>
> I'm okay if there is consensus to back port it.
>
>

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-13 Thread Sean Owen

Does that change change the result for Spark 3.3.x?
It looks like we do not support Python 3.11 in Spark 3.3.x, which is one
answer to whether this should be changed now.
But if that's the only change that matters for Python 3.11 and makes it
work, sure I think we should back-port. It doesn't necessarily block a
release but if that's the case, it seems OK to include to me in a next RC.

On Mon, Feb 13, 2023 at 10:53 AM Bjørn Jørgensen 
wrote:

> There is a fix for python 3.11 https://github.com/apache/spark/pull/38987
> We should have this in more branches.
>
> man. 13. feb. 2023 kl. 09:39 skrev Bjørn Jørgensen <
> bjornjorgen...@gmail.com>:
>
>> On manjaro it is Python 3.10.9
>>
>> On ubuntu it is Python 3.11.1
>>
>> man. 13. feb. 2023 kl. 03:24 skrev yangjie01 :
>>
>>> Which Python version do you use for testing? When I use the latest
>>> Python 3.11, I can reproduce similar test failures (43 tests of sql module
>>> fail), but when I use python 3.10, they will succeed
>>>
>>>
>>>
>>> YangJie
>>>
>>>
>>>
>>> *发件人**: *Bjørn Jørgensen 
>>> *日期**: *2023年2月13日 星期一 05:09
>>> *收件人**: *Sean Owen 
>>> *抄送**: *"L. C. Hsieh" , Spark dev list <
>>> dev@spark.apache.org>
>>> *主题**: *Re: [VOTE] Release Spark 3.3.2 (RC1)
>>>
>>>
>>>
>>> Tried it one more time and the same result.
>>>
>>>
>>>
>>> On another box with Manjaro
>>>
>>> 
>>> [INFO] Reactor Summary for Spark Project Parent POM 3.3.2:
>>> [INFO]
>>> [INFO] Spark Project Parent POM ... SUCCESS
>>> [01:50 min]
>>> [INFO] Spark Project Tags . SUCCESS [
>>> 17.359 s]
>>> [INFO] Spark Project Sketch ... SUCCESS [
>>> 12.517 s]
>>> [INFO] Spark Project Local DB . SUCCESS [
>>> 14.463 s]
>>> [INFO] Spark Project Networking ... SUCCESS
>>> [01:07 min]
>>> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>>>  9.013 s]
>>> [INFO] Spark Project Unsafe ... SUCCESS [
>>>  8.184 s]
>>> [INFO] Spark Project Launcher . SUCCESS [
>>> 10.454 s]
>>> [INFO] Spark Project Core . SUCCESS
>>> [23:58 min]
>>> [INFO] Spark Project ML Local Library . SUCCESS [
>>> 21.218 s]
>>> [INFO] Spark Project GraphX ... SUCCESS
>>> [01:24 min]
>>> [INFO] Spark Project Streaming  SUCCESS
>>> [04:57 min]
>>> [INFO] Spark Project Catalyst . SUCCESS
>>> [08:00 min]
>>> [INFO] Spark Project SQL .. SUCCESS [
>>>  01:02 h]
>>> [INFO] Spark Project ML Library ... SUCCESS
>>> [14:38 min]
>>> [INFO] Spark Project Tools  SUCCESS [
>>>  4.394 s]
>>> [INFO] Spark Project Hive . SUCCESS
>>> [53:43 min]
>>> [INFO] Spark Project REPL . SUCCESS
>>> [01:16 min]
>>> [INFO] Spark Project Assembly . SUCCESS [
>>>  2.186 s]
>>> [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [
>>> 16.150 s]
>>> [INFO] Spark Integration for Kafka 0.10 ... SUCCESS
>>> [01:34 min]
>>> [INFO] Kafka 0.10+ Source for Structured Streaming  SUCCESS
>>> [32:55 min]
>>> [INFO] Spark Project Examples . SUCCESS [
>>> 23.800 s]
>>> [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [
>>>  7.301 s]
>>> [INFO] Spark Avro . SUCCESS
>>> [01:19 min]
>>> [INFO]
>>> 
>>> [INFO] BUILD SUCCESS
>>> [INFO]
>>> 
>>> [INFO] Total time:  03:31 h
>>> [INFO] Finished at: 2023-02-12T21:54:20+01:00
>>> [INFO]
>>> 
>>> [bjorn@amd7g spark-3.3.2]$  java -version
>>> openjdk v

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread Sean Owen

It doesn't work because it's an aggregate function. You have to groupBy()
(group by nothing) to make that work, but, you can't assign that as a
column. Folks those approaches don't make sense semantically in SQL or
Spark or anything.
They just mean use threads to collect() distinct values for each col in
parallel using threads in your program. You don't have to but you could.
What else are we looking for here, the answer has been given a number of
times I think.


On Sun, Feb 12, 2023 at 2:28 PM sam smith 
wrote:

> OK, what do you mean by " do your outer for loop in parallel "?
> btw this didn't work:
> for (String columnName : df.columns()) {
> df= df.withColumn(columnName,
> collect_set(col(columnName)).as(columnName));
> }
>
>
> Le dim. 12 févr. 2023 à 20:36, Enrico Minack  a
> écrit :
>
>> That is unfortunate, but 3.4.0 is around the corner, really!
>>
>> Well, then based on your code, I'd suggest two improvements:
>> - cache your dataframe after reading, this way, you don't read the entire
>> file for each column
>> - do your outer for loop in parallel, then you have N parallel Spark jobs
>> (only helps if your Spark cluster is not fully occupied by a single column)
>>
>> Your withColumn-approach does not work because withColumn expects a
>> column as the second argument, but df.select(columnName).distinct() is a
>> DataFrame and .col is a column in *that* DataFrame, it is not a column
>> of the dataframe that you call withColumn on.
>>
>> It should read:
>>
>> Scala:
>> df.select(df.columns.map(column => collect_set(col(column)).as(column)):
>> _*).show()
>>
>> Java:
>> for (String columnName : df.columns()) {
>> df= df.withColumn(columnName,
>> collect_set(col(columnName)).as(columnName));
>> }
>>
>> Then you have a single DataFrame that computes all columns in a single
>> Spark job.
>>
>> But this reads all distinct values into a single partition, which has the
>> same downside as collect, so this is as bad as using collect.
>>
>> Cheers,
>> Enrico
>>
>>
>> Am 12.02.23 um 18:05 schrieb sam smith:
>>
>> @Enrico Minack  Thanks for "unpivot" but I am
>> using version 3.3.0 (you are taking it way too far as usual :) )
>> @Sean Owen  Pls then show me how it can be improved by
>> code.
>>
>> Also, why such an approach (using withColumn() ) doesn't work:
>>
>> for (String columnName : df.columns()) {
>> df= df.withColumn(columnName,
>> df.select(columnName).distinct().col(columnName));
>> }
>>
>> Le sam. 11 févr. 2023 à 13:11, Enrico Minack  a
>> écrit :
>>
>>> You could do the entire thing in DataFrame world and write the result to
>>> disk. All you need is unpivot (to be released in Spark 3.4.0, soon).
>>>
>>> Note this is Scala but should be straightforward to translate into Java:
>>>
>>> import org.apache.spark.sql.functions.collect_set
>>>
>>> val df = Seq((1, 10, 123), (2, 20, 124), (3, 20, 123), (4, 10,
>>> 123)).toDF("a", "b", "c")
>>>
>>> df.unpivot(Array.empty, "column", "value")
>>>   .groupBy("column")
>>>   .agg(collect_set("value").as("distinct_values"))
>>>
>>> The unpivot operation turns
>>> +---+---+---+
>>> |  a|  b|  c|
>>> +---+---+---+
>>> |  1| 10|123|
>>> |  2| 20|124|
>>> |  3| 20|123|
>>> |  4| 10|123|
>>> +---+---+---+
>>>
>>> into
>>>
>>> +--+-+
>>> |column|value|
>>> +--+-+
>>> | a|1|
>>> | b|   10|
>>> | c|  123|
>>> | a|2|
>>> | b|   20|
>>> | c|  124|
>>> | a|3|
>>> | b|   20|
>>> | c|  123|
>>> | a|4|
>>> | b|   10|
>>> | c|  123|
>>> +--+-+
>>>
>>> The groupBy("column").agg(collect_set("value").as("distinct_values"))
>>> collects distinct values per column:
>>> +--+---+
>>>
>>> |column|distinct_values|
>>> +--+---+
>>> | c| [123, 124]|
>>> | b|   [20, 10]|
>>> | a|   [1, 2, 3, 4]|
>>> +--+---+
>>>
>>> Note that unpivot only works if all columns have a "common" type. Then
>>> all columns are cast to that common type. If you

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread Sean Owen

That's the answer, except, you can never select a result set into a column
right? you just collect() each of those results. Or, what do you want? I'm
not clear.

On Sun, Feb 12, 2023 at 10:59 AM sam smith 
wrote:

> @Enrico Minack  Thanks for "unpivot" but I am using
> version 3.3.0 (you are taking it way too far as usual :) )
> @Sean Owen  Pls then show me how it can be improved by
> code.
>
> Also, why such an approach (using withColumn() ) doesn't work:
>
> for (String columnName : df.columns()) {
> df= df.withColumn(columnName,
> df.select(columnName).distinct().col(columnName));
> }
>
> Le sam. 11 févr. 2023 à 13:11, Enrico Minack  a
> écrit :
>
>> You could do the entire thing in DataFrame world and write the result to
>> disk. All you need is unpivot (to be released in Spark 3.4.0, soon).
>>
>> Note this is Scala but should be straightforward to translate into Java:
>>
>> import org.apache.spark.sql.functions.collect_set
>>
>> val df = Seq((1, 10, 123), (2, 20, 124), (3, 20, 123), (4, 10,
>> 123)).toDF("a", "b", "c")
>>
>> df.unpivot(Array.empty, "column", "value")
>>   .groupBy("column")
>>   .agg(collect_set("value").as("distinct_values"))
>>
>> The unpivot operation turns
>> +---+---+---+
>> |  a|  b|  c|
>> +---+---+---+
>> |  1| 10|123|
>> |  2| 20|124|
>> |  3| 20|123|
>> |  4| 10|123|
>> +---+---+---+
>>
>> into
>>
>> +--+-+
>> |column|value|
>> +--+-+
>> | a|1|
>> | b|   10|
>> | c|  123|
>> | a|2|
>> | b|   20|
>> | c|  124|
>> | a|3|
>> | b|   20|
>> | c|  123|
>> | a|4|
>> | b|   10|
>> | c|  123|
>> +--+-+
>>
>> The groupBy("column").agg(collect_set("value").as("distinct_values"))
>> collects distinct values per column:
>> +--+---+
>>
>> |column|distinct_values|
>> +--+---+
>> | c| [123, 124]|
>> | b|   [20, 10]|
>> | a|   [1, 2, 3, 4]|
>> +--+---+
>>
>> Note that unpivot only works if all columns have a "common" type. Then
>> all columns are cast to that common type. If you have incompatible types
>> like Integer and String, you would have to cast them all to String first:
>>
>> import org.apache.spark.sql.types.StringType
>>
>> df.select(df.columns.map(col(_).cast(StringType)): _*).unpivot(...)
>>
>> If you want to preserve the type of the values and have multiple value
>> types, you cannot put everything into a DataFrame with one
>> distinct_values column. You could still have multiple DataFrames, one
>> per data type, and write those, or collect the DataFrame's values into Maps:
>>
>> import scala.collection.immutable
>>
>> import org.apache.spark.sql.DataFrame
>> import org.apache.spark.sql.functions.collect_set
>>
>> // if all you columns have the same type
>> def distinctValuesPerColumnOneType(df: DataFrame): immutable.Map[String,
>> immutable.Seq[Any]] = {
>>   df.unpivot(Array.empty, "column", "value")
>> .groupBy("column")
>> .agg(collect_set("value").as("distinct_values"))
>> .collect()
>> .map(row => row.getString(0) -> row.getSeq[Any](1).toList)
>> .toMap
>> }
>>
>>
>> // if your columns have different types
>> def distinctValuesPerColumn(df: DataFrame): immutable.Map[String,
>> immutable.Seq[Any]] = {
>>   df.schema.fields
>> .groupBy(_.dataType)
>> .mapValues(_.map(_.name))
>> .par
>> .map { case (dataType, columns) => df.select(columns.map(col): _*) }
>> .map(distinctValuesPerColumnOneType)
>> .flatten
>> .toList
>> .toMap
>> }
>>
>> val df = Seq((1, 10, "one"), (2, 20, "two"), (3, 20, "one"), (4, 10,
>> "one")).toDF("a", "b", "c")
>> distinctValuesPerColumn(df)
>>
>> The result is: (list values are of original type)
>> Map(b -> List(20, 10), a -> List(1, 2, 3, 4), c -> List(one, two))
>>
>> Hope this helps,
>> Enrico
>>
>>
>> Am 10.02.23 um 22:56 schrieb sam smith:
>>
>> Hi Apotolos,
>> Can you suggest a better approach while keeping values within a dataframe?
>>
>>

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-11 Thread Sean Owen

+1 The tests and all results were the same as ever for me (Java 11, Scala
2.13, Ubuntu 22.04)
I also didn't see that issue ... maybe somehow locale related? which could
still be a bug.

On Sat, Feb 11, 2023 at 8:49 PM L. C. Hsieh  wrote:

> Thank you for testing it.
>
> I was going to run it again but still didn't see any errors.
>
> I also checked CI (and looked again now) on branch-3.3 before cutting RC.
>
> BTW, I didn't find an actual test failure (i.e. "- test_name ***
> FAILED ***") in the log file.
>
> Maybe it is due to the dev env? What dev env you're using to run the test?
>
>
> On Sat, Feb 11, 2023 at 8:58 AM Bjørn Jørgensen
>  wrote:
> >
> >
> > ./build/mvn clean package
> >
> > Run completed in 1 hour, 18 minutes, 29 seconds.
> > Total number of tests run: 11652
> > Suites: completed 516, aborted 0
> > Tests: succeeded 11609, failed 43, canceled 8, ignored 57, pending 0
> > *** 43 TESTS FAILED ***
> > [INFO]
> 
> > [INFO] Reactor Summary for Spark Project Parent POM 3.3.2:
> > [INFO]
> > [INFO] Spark Project Parent POM ... SUCCESS [
> 3.418 s]
> > [INFO] Spark Project Tags . SUCCESS [
> 17.845 s]
> > [INFO] Spark Project Sketch ... SUCCESS [
> 20.791 s]
> > [INFO] Spark Project Local DB . SUCCESS [
> 16.527 s]
> > [INFO] Spark Project Networking ... SUCCESS
> [01:03 min]
> > [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
> 9.914 s]
> > [INFO] Spark Project Unsafe ... SUCCESS [
> 12.007 s]
> > [INFO] Spark Project Launcher . SUCCESS [
> 7.620 s]
> > [INFO] Spark Project Core . SUCCESS
> [40:04 min]
> > [INFO] Spark Project ML Local Library . SUCCESS [
> 29.997 s]
> > [INFO] Spark Project GraphX ... SUCCESS
> [02:33 min]
> > [INFO] Spark Project Streaming  SUCCESS
> [05:51 min]
> > [INFO] Spark Project Catalyst . SUCCESS
> [13:29 min]
> > [INFO] Spark Project SQL .. FAILURE [
> 01:25 h]
> > [INFO] Spark Project ML Library ... SKIPPED
> > [INFO] Spark Project Tools  SKIPPED
> > [INFO] Spark Project Hive . SKIPPED
> > [INFO] Spark Project REPL . SKIPPED
> > [INFO] Spark Project Assembly . SKIPPED
> > [INFO] Kafka 0.10+ Token Provider for Streaming ... SKIPPED
> > [INFO] Spark Integration for Kafka 0.10 ... SKIPPED
> > [INFO] Kafka 0.10+ Source for Structured Streaming  SKIPPED
> > [INFO] Spark Project Examples . SKIPPED
> > [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
> > [INFO] Spark Avro . SKIPPED
> > [INFO]
> 
> > [INFO] BUILD FAILURE
> > [INFO]
> 
> > [INFO] Total time:  02:30 h
> > [INFO] Finished at: 2023-02-11T17:32:45+01:00
> >
> > lør. 11. feb. 2023 kl. 06:01 skrev L. C. Hsieh :
> >>
> >> Please vote on releasing the following candidate as Apache Spark
> version 3.3.2.
> >>
> >> The vote is open until Feb 15th 9AM (PST) and passes if a majority +1
> >> PMC votes are cast, with a minimum of 3 +1 votes.
> >>
> >> [ ] +1 Release this package as Apache Spark 3.3.2
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see https://spark.apache.org/
> >>
> >> The tag to be voted on is v3.3.2-rc1 (commit
> >> 5103e00c4ce5fcc4264ca9c4df12295d42557af6):
> >> https://github.com/apache/spark/tree/v3.3.2-rc1
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v3.3.2-rc1-bin/
> >>
> >> Signatures used for Spark RCs can be found in this file:
> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1433/
> >>
> >> The documentation corresponding to this release can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v3.3.2-rc1-docs/
> >>
> >> The list of bug fixes going into 3.3.2 can be found at the following
> URL:
> >> https://issues.apache.org/jira/projects/SPARK/versions/12352299
> >>
> >> This release is using the release script of the tag v3.3.2-rc1.
> >>
> >> FAQ
> >>
> >> =
> >> How can I help test this release?
> >> =
> >>
> >> If you are a Spark user, you can help us test this release by taking
> >> an existing Spark

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread Sean Owen

Why would csv or a temp table change anything here? You don't need
windowing for distinct values either

On Fri, Feb 10, 2023, 6:01 PM Mich Talebzadeh 
wrote:

> on top of my head, create a dataframe reading CSV file.
>
> This is python
>
>  listing_df =
> spark.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header", "true").load(csv_file)
>  listing_df.printSchema()
>  listing_df.createOrReplaceTempView("temp")
>
> ## do your distinct columns using windowing functions on temp table with
> SQL
>
>  HTH
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 10 Feb 2023 at 21:59, sam smith 
> wrote:
>
>> I am not sure i understand well " Just need to do the cols one at a
>> time". Plus I think Apostolos is right, this needs a dataframe approach not
>> a list approach.
>>
>> Le ven. 10 févr. 2023 à 22:47, Sean Owen  a écrit :
>>
>>> For each column, select only that call and get distinct values. Similar
>>> to what you do here. Just need to do the cols one at a time. Your current
>>> code doesn't do what you want.
>>>
>>> On Fri, Feb 10, 2023, 3:46 PM sam smith 
>>> wrote:
>>>
>>>> Hi Sean,
>>>>
>>>> "You need to select the distinct values of each col one at a time", how
>>>> ?
>>>>
>>>> Le ven. 10 févr. 2023 à 22:40, Sean Owen  a écrit :
>>>>
>>>>> That gives you all distinct tuples of those col values. You need to
>>>>> select the distinct values of each col one at a time. Sure just collect()
>>>>> the result as you do here.
>>>>>
>>>>> On Fri, Feb 10, 2023, 3:34 PM sam smith 
>>>>> wrote:
>>>>>
>>>>>> I want to get the distinct values of each column in a List (is it
>>>>>> good practice to use List here?), that contains as first element the 
>>>>>> column
>>>>>> name, and the other element its distinct values so that for a dataset we
>>>>>> get a list of lists, i do it this way (in my opinion no so fast):
>>>>>>
>>>>>> List> finalList = new ArrayList>();
>>>>>> Dataset df = spark.read().format("csv").option("header", 
>>>>>> "true").load("/pathToCSV");
>>>>>> String[] columnNames = df.columns();
>>>>>>  for (int i=0;i>>>>> List columnList = new ArrayList();
>>>>>>
>>>>>> columnList.add(columnNames[i]);
>>>>>>
>>>>>>
>>>>>> List columnValues = 
>>>>>> df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList();
>>>>>> for (int j=0;j>>>>> columnList.add(columnValues.get(j).apply(0).toString());
>>>>>>
>>>>>> finalList.add(columnList);
>>>>>>
>>>>>>
>>>>>> How to improve this?
>>>>>>
>>>>>> Also, can I get the results in JSON format?
>>>>>>
>>>>>

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread Sean Owen

That gives you all distinct tuples of those col values. You need to select
the distinct values of each col one at a time. Sure just collect() the
result as you do here.

On Fri, Feb 10, 2023, 3:34 PM sam smith  wrote:

> I want to get the distinct values of each column in a List (is it good
> practice to use List here?), that contains as first element the column
> name, and the other element its distinct values so that for a dataset we
> get a list of lists, i do it this way (in my opinion no so fast):
>
> List> finalList = new ArrayList>();
> Dataset df = spark.read().format("csv").option("header", 
> "true").load("/pathToCSV");
> String[] columnNames = df.columns();
>  for (int i=0;i List columnList = new ArrayList();
>
> columnList.add(columnNames[i]);
>
>
> List columnValues = 
> df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList();
> for (int j=0;j columnList.add(columnValues.get(j).apply(0).toString());
>
> finalList.add(columnList);
>
>
> How to improve this?
>
> Also, can I get the results in JSON format?
>

Re: Building Spark to run PySpark Tests?

2023-01-19 Thread Sean Owen

It's not clear what error you're facing from this info (ConnectionError
could mean lots of things), so would be hard to generalize answers. How
much mem do you have on your Mac?
-Xmx2g sounds low, but also probably doesn't matter much.
Spark builds work on my Mac, FWIW.

On Thu, Jan 19, 2023 at 10:15 AM Adam Chhina  wrote:

> Hmm, would there be a list of common env issues that would interfere with
> builds? Looking up the error message, it seemed like often the issue was
> OOM by the JVM process. I’m not sure if that’s what’s happening here, since
> during the build and setting up the tests the config should have allocated
> enough memory?
>
> I’ve been just trying to follow the build docs, and so far I’m running as
> such:
>
> > git clone --branch v3.2.3 https://github.com/apache/spark.git
> > cd spark
> > export MAVEN_OPTS="-Xss64m -Xmx2g -XX:ReservedCodeCacheSize=1g” // was
> unset, but set to be safe
> > export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES // I saw in the
> developer tools that some pyspark tests were having issues on macOS
> > export JAVA_HOME=`/usr/libexec/java_home -v 11`
> > ./build/mvn -DskipTests clean package -Phive
> > ./python/run-tests --python-executables --testnames
> ‘pyspark.tests.test_broadcast'
>
> > java -version
>
> openjdk version "11.0.17" 2022-10-18
>
> OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
>
> OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
>
>
> > OS
>
> Ventura 13.1 (22C65)
>
>
> Best,
>
>
> Adam Chhina
>
> On Jan 18, 2023, at 6:50 PM, Sean Owen  wrote:
>
> Release _branches_ are tested as commits arrive to the branch, yes. That's
> what you see at https://github.com/apache/spark/actions
> Released versions are fixed, they don't change, and were also manually
> tested before release, so no they are not re-tested; there is no need.
>
> You presumably have some local env issue, because the source of Spark
> 3.2.3 was passing CI/CD at time of release as well as manual tests of the
> PMC.
>
>
> On Wed, Jan 18, 2023 at 5:24 PM Adam Chhina  wrote:
>
>> Hi Sean,
>>
>> That’s fair in regards to 3.3.x being the current release branch. I’m not
>> familiar with the testing schedule, but I had assumed all currently
>> supported release versions would have some nightly/weekly tests ran; is
>> that not the case? I only ask, as when I when I’m seeing these test
>> failures, I assumed these were either known/unknown from some recurring
>> testing pipeline.
>>
>> Also, unfortunately using v3.2.3 also had the same test failures.
>>
>> > git clone --branch v3.2.3 https://github.com/apache/spark.git
>>
>> I’ve posted the traceback below for one of the ran tests. At the end it
>> mentioned to check the logs - `see logs`. However I wasn’t sure whether
>> that just meant the traceback or some more detailed logs elsewhere? I
>> wasn’t able to see any files that looked relevant running `find . -name
>> “*logs*”` afterwards. Sorry if I’m missing something obvious.
>>
>> ```
>> test_broadcast_no_encryption (pyspark.tests.test_broadcast.BroadcastTest)
>> ... ERROR
>> test_broadcast_value_against_gc
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> test_broadcast_value_driver_encryption
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> test_broadcast_value_driver_no_encryption
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> test_broadcast_with_encryption
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>>
>> ==
>> ERROR: test_broadcast_with_encryption
>> (pyspark.tests.test_broadcast.BroadcastTest)
>> --
>> Traceback (most recent call last):
>>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 67, in
>> test_broadcast_with_encryption
>> self._test_multiple_broadcasts(("spark.io.encryption.enabled",
>> "true"))
>>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 58, in
>> _test_multiple_broadcasts
>> conf = SparkConf()
>>   File "$path/spark/python/pyspark/conf.py", line 120, in __init__
>> self._jconf = _jvm.SparkConf(loadDefaults)
>>   File
>> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line
>> 1709, in __getattr__
>> answer = self._gateway_client.send_command(
>>   File
>> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gatew

Re: Can you create an apache jira account for me? Thanks very much!

2023-01-19 Thread Sean Owen

I can help offline. Send me your preferred JIRA user name.

On Thu, Jan 19, 2023 at 7:12 AM Wei Yan  wrote:

> When I tried to sign up through this site:
> https://issues.apache.org/jira/secure/Signup!default.jspa
> I got an error message:"Sorry, you can't sign up to this Jira site at the
> moment as it's private."
> and I got a suggestion:"If you think you should be able to sign up then
> you should let the Jira administrator know".
> So I think I need some help.
>
>
>

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Sean Owen

Release _branches_ are tested as commits arrive to the branch, yes. That's
what you see at https://github.com/apache/spark/actions
Released versions are fixed, they don't change, and were also manually
tested before release, so no they are not re-tested; there is no need.

You presumably have some local env issue, because the source of Spark 3.2.3
was passing CI/CD at time of release as well as manual tests of the PMC.


On Wed, Jan 18, 2023 at 5:24 PM Adam Chhina  wrote:

> Hi Sean,
>
> That’s fair in regards to 3.3.x being the current release branch. I’m not
> familiar with the testing schedule, but I had assumed all currently
> supported release versions would have some nightly/weekly tests ran; is
> that not the case? I only ask, as when I when I’m seeing these test
> failures, I assumed these were either known/unknown from some recurring
> testing pipeline.
>
> Also, unfortunately using v3.2.3 also had the same test failures.
>
> > git clone --branch v3.2.3 https://github.com/apache/spark.git
>
> I’ve posted the traceback below for one of the ran tests. At the end it
> mentioned to check the logs - `see logs`. However I wasn’t sure whether
> that just meant the traceback or some more detailed logs elsewhere? I
> wasn’t able to see any files that looked relevant running `find . -name
> “*logs*”` afterwards. Sorry if I’m missing something obvious.
>
> ```
> test_broadcast_no_encryption (pyspark.tests.test_broadcast.BroadcastTest)
> ... ERROR
> test_broadcast_value_against_gc
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
> test_broadcast_value_driver_encryption
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
> test_broadcast_value_driver_no_encryption
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
> test_broadcast_with_encryption
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>
> ==
> ERROR: test_broadcast_with_encryption
> (pyspark.tests.test_broadcast.BroadcastTest)
> --
> Traceback (most recent call last):
>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 67, in
> test_broadcast_with_encryption
> self._test_multiple_broadcasts(("spark.io.encryption.enabled", "true"))
>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 58, in
> _test_multiple_broadcasts
> conf = SparkConf()
>   File "$path/spark/python/pyspark/conf.py", line 120, in __init__
> self._jconf = _jvm.SparkConf(loadDefaults)
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line
> 1709, in __getattr__
> answer = self._gateway_client.send_command(
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line
> 1036, in send_command
> connection = self._get_connection()
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
> 284, in _get_connection
> connection = self._create_new_connection()
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
> 291, in _create_new_connection
> connection.connect_to_java_server()
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
> 438, in connect_to_java_server
> self.socket.connect((self.java_address, self.java_port))
> ConnectionRefusedError: [Errno 61] Connection refused
>
> ------
> Ran 7 tests in 12.950s
>
> FAILED (errors=7)
> sys:1: ResourceWarning: unclosed file <_io.BufferedWriter name=4>
>
> Had test failures in pyspark.tests.test_broadcast with
> /usr/local/bin/python3; see logs.
> ```
>
> Best,
>
> Adam Chhina
>
> On Jan 18, 2023, at 5:03 PM, Sean Owen  wrote:
>
> That isn't the released version either, but rather the head of the 3.2
> branch (which is beyond 3.2.3).
> You may want to check out the v3.2.3 tag instead:
> https://github.com/apache/spark/tree/v3.2.3
> ... instead of 3.2.1.
> But note of course the 3.3.x is the current release branch anyway.
>
> Hard to say what the error is without seeing more of the error log.
>
> That final warning is fine, just means you are using Java 11+.
>
>
> On Wed, Jan 18, 2023 at 3:59 PM Adam Chhina  wrote:
>
>> Oh, whoops, didn’t realize that wasn’t the release version, thanks!
>>
>> > git clone --branch branch-3.2 https://github.com/apache/spark.git
>>
>> Ah, so the old failing tests are passing now, but I am seeing failures in
>> `pyspark.tests.test_broadcast` such as  `

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Sean Owen

That isn't the released version either, but rather the head of the 3.2
branch (which is beyond 3.2.3).
You may want to check out the v3.2.3 tag instead:
https://github.com/apache/spark/tree/v3.2.3
... instead of 3.2.1.
But note of course the 3.3.x is the current release branch anyway.

Hard to say what the error is without seeing more of the error log.

That final warning is fine, just means you are using Java 11+.


On Wed, Jan 18, 2023 at 3:59 PM Adam Chhina  wrote:

> Oh, whoops, didn’t realize that wasn’t the release version, thanks!
>
> > git clone --branch branch-3.2 https://github.com/apache/spark.git
>
> Ah, so the old failing tests are passing now, but I am seeing failures in 
> `pyspark.tests.test_broadcast`
> such as  `test_broadcast_value_against_gc`, with a majority of them
> failing due to `ConnectionRefusedError: [Errno 61] Connection refused`.
> Maybe these tests are not mean to be ran locally, and only in the pipeline?
>
> Also, I see this warning that mentions to notify the maintainers here:
>
> ```
>
> Starting test(/usr/local/bin/python3): pyspark.tests.test_broadcast
>
> WARNING: An illegal reflective access operation has occurred
>
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
> (file:/$path/spark/common/unsafe/target/scala-2.12/classes/) to constructor
> java.nio.DirectByteBuffer(long,int)
> ```
>
> FWIW, not sure if this matters, but python executable used for running
> these tests is `Python 3.10.9` under `/user/local/bin/python3`.
>
> Best,
>
> Adam Chhina
>
> On Jan 18, 2023, at 3:05 PM, Bjørn Jørgensen 
> wrote:
>
> Replace
> > > git clone g...@github.com:apache/spark.git
> > > git checkout -b spark-321 v3.2.1
>
> with
> git clone --branch branch-3.2 https://github.com/apache/spark.git
> This will give you branch 3.2 as today, what I suppose you call upstream
>
> https://github.com/apache/spark/commits/branch-3.2
> and right now all tests in github action are passed :)
>
>
> ons. 18. jan. 2023 kl. 18:07 skrev Sean Owen :
>
>> Never seen those, but it's probably a difference in pandas, numpy
>> versions. You can see the current CICD test results in GitHub Actions. But,
>> you want to use release versions, not an RC. 3.2.1 is not the latest
>> version, and it's possible the tests were actually failing in the RC.
>>
>> On Wed, Jan 18, 2023, 10:57 AM Adam Chhina  wrote:
>>
>>> Bump,
>>>
>>> Just trying to see where I can find what tests are known failing for a
>>> particular release, to ensure I’m building upstream correctly following the
>>> build docs. I figured this would be the best place to ask as it pertains to
>>> building and testing upstream (also more than happy to provide a PR for any
>>> docs if required afterwards), however if there would be a more appropriate
>>> place, please let me know.
>>>
>>> Best,
>>>
>>> Adam Chhina
>>>
>>> > On Dec 27, 2022, at 11:37 AM, Adam Chhina 
>>> wrote:
>>> >
>>> > As part of an upgrade I was looking to run upstream PySpark unit tests
>>> on `v3.2.1-rc2` before I applied some downstream patches and tested those.
>>> However, I'm running into some issues with failing unit tests, which I'm
>>> not sure are failing upstream or due to some step I missed in the build.
>>> >
>>> > The current failing tests (at least so far, since I believe the python
>>> script exits on test failure):
>>> > ```
>>> > ==
>>> > FAIL: test_train_prediction
>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
>>> > Test that error on test data improves as model is trained.
>>> > --
>>> > Traceback (most recent call last):
>>> >   File
>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> line 474, in test_train_prediction
>>> > eventually(condition, timeout=180.0)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>> 86, in eventually
>>> > lastValue = condition()
>>> >   File
>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> line 469, in condition
>>> > self.assertGreater(errors[1] - errors[-1], 2)
>>> > AssertionError: 1.8960983527735014 not greater than 2
>>> >
>>> > ===

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Sean Owen

Never seen those, but it's probably a difference in pandas, numpy versions.
You can see the current CICD test results in GitHub Actions. But, you want
to use release versions, not an RC. 3.2.1 is not the latest version, and
it's possible the tests were actually failing in the RC.

On Wed, Jan 18, 2023, 10:57 AM Adam Chhina  wrote:

> Bump,
>
> Just trying to see where I can find what tests are known failing for a
> particular release, to ensure I’m building upstream correctly following the
> build docs. I figured this would be the best place to ask as it pertains to
> building and testing upstream (also more than happy to provide a PR for any
> docs if required afterwards), however if there would be a more appropriate
> place, please let me know.
>
> Best,
>
> Adam Chhina
>
> > On Dec 27, 2022, at 11:37 AM, Adam Chhina  wrote:
> >
> > As part of an upgrade I was looking to run upstream PySpark unit tests
> on `v3.2.1-rc2` before I applied some downstream patches and tested those.
> However, I'm running into some issues with failing unit tests, which I'm
> not sure are failing upstream or due to some step I missed in the build.
> >
> > The current failing tests (at least so far, since I believe the python
> script exits on test failure):
> > ```
> > ==
> > FAIL: test_train_prediction
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
> > Test that error on test data improves as model is trained.
> > --
> > Traceback (most recent call last):
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 474, in test_train_prediction
> > eventually(condition, timeout=180.0)
> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 86,
> in eventually
> > lastValue = condition()
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 469, in condition
> > self.assertGreater(errors[1] - errors[-1], 2)
> > AssertionError: 1.8960983527735014 not greater than 2
> >
> > ==
> > FAIL: test_parameter_accuracy
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> > Test that the final value of weights is close to the desired value.
> > --
> > Traceback (most recent call last):
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 229, in test_parameter_accuracy
> > eventually(condition, timeout=60.0, catch_assertions=True)
> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 91,
> in eventually
> > raise lastValue
> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 82,
> in eventually
> > lastValue = condition()
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 226, in condition
> > self.assertAlmostEqual(rel, 0.1, 1)
> > AssertionError: 0.23052813480829393 != 0.1 within 1 places
> (0.13052813480829392 difference)
> >
> > ==
> > FAIL: test_training_and_prediction
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> > Test that the model improves on toy data with no. of batches
> > --
> > Traceback (most recent call last):
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 334, in test_training_and_prediction
> > eventually(condition, timeout=180.0)
> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 93,
> in eventually
> > raise AssertionError(
> > AssertionError: Test failed due to timeout after 180 sec, with last
> condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74,
> 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78,
> 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64,
> 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
> >
> > --
> > Ran 13 tests in 661.536s
> >
> > FAILED (failures=3, skipped=1)
> >
> > Had test failures in pyspark.mllib.tests.test_streaming_algorithms with
> /usr/local/bin/python3; see logs.
> > ```
> >
> > Here's how I'm currently building Spark, I was using the
> [building-spark](https://spark.apache.org/docs/3..1/building-spark.html)
> docs as a reference.
> > ```
> > > git clone g...@github.com:apache/spark.git
> > > git checkout -b spark-321 v3.2.1
> > > ./build/mvn -DskipTests clean package -Phive
> > > export JAVA_HOME=$(path/to/jdk/11)
> > > ./python/run-tests
> > ```
> >
> > Current Java version
> > ```
> > java

Re: [PySPark] How to check if value of one column is in array of another column

2023-01-17 Thread Sean Owen

I think you want array_contains:
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_contains.html

On Tue, Jan 17, 2023 at 4:18 PM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

>
>  Hello,
>
>   I have data originally stored as JSON. Column gene contains a string,
> column nearest an array of strings. How can I check whether the value of
> gene is an element of the array of nearest?
>
>   I tried: genes_joined.gene.isin(genes_joined.nearest)
>
>   But I get an error that says:
>
> pyspark.sql.utils.AnalysisException: cannot resolve '(gene IN (nearest))'
> due to data type mismatch: Arguments must be same type but were: string !=
> array;
>
>   How do I do this? Thanks!
>
>  Best, Oliver
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>

Re: pyspark.sql.dataframe.DataFrame versus pyspark.pandas.frame.DataFrame

2023-01-13 Thread Sean Owen

One is a normal Pyspark DataFrame, the other is a pandas work-alike wrapper
on a Pyspark DataFrame. They're the same thing with different APIs.
Neither has a 'storage format'.

spark-excel might be fine, and it's used with Spark DataFrames. Because it
emulates pandas's read_excel API, the Pyspark pandas DataFrame also has a
read_excel method that could work.
You can try both and see which works for you.

On Thu, Jan 12, 2023 at 9:56 PM second_co...@yahoo.com.INVALID
 wrote:

>
> Good day,
>
> May i know what is the different between pyspark.sql.dataframe.DataFrame
> versus pyspark.pandas.frame.DataFrame ? Are both store in Spark dataframe
> format?
>
> I'm looking for a way to load a huge excel file (4-10GB), i wonder should
> i use third party library spark-excel or just use native pyspark.pandas ?
> I prefer to use Spark dataframe so that it uses the parallelization
> feature of Spark in the executors instead of running it on the driver.
>
> Can help to advice ?
>
>
> Detail
> ---
>
> df = spark.read \.format("com.crealytics.spark.excel") \
> .option("header", "true") \.load("/path/big_excel.xls")print(type(df)) # 
> output pyspark.sql.dataframe.DataFrame
>
>
> import pyspark.pandas as psfrom pyspark.sql import DataFrame  
> path="/path/big-excel.xls" df= ps.read_excel(path)
>
> # output pyspark.pandas.frame.DataFrame
>
>
> Thank you.
>
>
>

Re: [pyspark/sparksql]: How to overcome redundant/repetitive code? Is a for loop over an sql statement with a variable a bad idea?

2023-01-06 Thread Sean Owen

Right, nothing wrong with a for loop here. Seems like just the right thing.

On Fri, Jan 6, 2023, 3:20 PM Joris Billen 
wrote:

> Hello Community,
> I am working in pyspark with sparksql and have a very similar very complex
> list of dataframes that Ill have to execute several times for all the
> “models” I have.
> Suppose the code is exactly the same for all models, only the table it
> reads from and some values in the where statements will have the modelname
> in it.
> My question is how to prevent repetitive code.
> So instead of doing somethg like this (this is pseudocode, in reality it
> makes use of lots of complex dataframes) which also would require me to
> change the code every time I change it in the future:
>
> *dfmodel1=sqlContext.sql("SELECT  FROM model1_table
> WHERE model =‘model1’ “).write()*
> *dfmodel2=sqlContext.sql("SELECT  FROM model2_table
> WHERE model =‘model2’ “).write()*
> *dfmodel3=sqlContext.sql("SELECT  FROM model3_table
> WHERE model =‘model3’ “).write()*
>
>
> For loops in spark sound like a bad idea (but that is mainly in terms of
> data, maybe nothing against looping over sql statements). Is it allowed
> to do something like this?
>
>
> *spark-submit withloops.py [“model1”,"model2”,"model3"]*
>
> *code withloops.py*
> *models=sys.arg[1]*
> *qry="""SELECT  FROM {} WHERE model ='{}'"""*
> *for i in models:*
> *  FROM_TABLE=table_model*
> *  sqlContext.sql(qry.format(i,table_model )).write()*
>
>
>
> I was trying to look up about refactoring in pyspark to prevent redundant
> code but didnt find any relevant links.
>
>
>
> Thanks for input!
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 24107 matches

Mail list logo