Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-04 Thread Sean Owen
Deprecation doesn't stop any of that though, if you want to encourage people to do something with GraphX. We can un-deprecate things. We don't have to remove deprecated things. But, why would we not encourage people to work on GraphFrames if interested in this domain? Nobody has been willing to c

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-04 Thread Sean Owen
e can > and should deprecate. > > On Fri, Oct 4, 2024 at 3:10 PM Sean Owen wrote: > > > > I could flip this argument around. More strongly, not being deprecated > means "won't be removed" and likewise implies support and development. I > don't think e

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-04 Thread Sean Owen
I could flip this argument around. More strongly, *not* being deprecated means "won't be removed" and likewise implies support and development. I don't think either of the latter have been true for years. What suggests this will change? A todo list is not going to do anything, IMHO. I'm also conce

Re: [VOTE] Officialy Deprecate GraphX in Spark 4

2024-09-30 Thread Sean Owen
For reasons in the previous thread, yes +1 to deprecation On Mon, Sep 30, 2024 at 1:02 PM Holden Karau wrote: > I think it has been de-facto deprecated, we haven’t updated it > meaningfully in several years. I think removing the API would be excessive > but deprecating it would give us the flexi

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-09-30 Thread Sean Owen
I support deprecating GraphX because: - GraphFrames supersedes it, really - No maintainers and no reason to believe there will be - we can take the last 5+ years as thorough evidence - Low (but not trivial) docs hits compared to other modules: https://analytics.apache.org/index.php

Re: [VOTE] Deprecate SparkR

2024-08-21 Thread Sean Owen
+1 On Wed, Aug 21, 2024, 11:40 AM Shivaram Venkataraman < shivaram.venkatara...@gmail.com> wrote: > Hi all > > Based on the previous discussion thread [1], I hereby call a vote to > deprecate the SparkR module in Apache Spark with the upcoming Spark 4 > release and remove it in the next major rel

Re: [VOTE] Archive Spark Documentations in Apache Archives

2024-08-12 Thread Sean Owen
He did already; see the preceding thread here on dev@. You can figure the size that moves out of the repo from the docs sizes: 9.9M ./0.6.0 10M ./0.6.1 10M ./0.6.2 15M ./0.7.0 16M ./0.7.2 16M ./0.7.3 20M ./0.8.0 20M ./0.8.1 38M ./0.9.0 38M ./0.9.1 38M ./0.9.2 36M ./1.0.0 38M ./1.0.1

Re: [VOTE] Archive Spark Documentations in Apache Archives

2024-08-12 Thread Sean Owen
+1 with the following clarifications, for my benefit: Once we upload to release, and it's copied by archive, we delete from release right? I know we are meant to keep the files in release minimal as they're mirrored to all ASF mirrors. But if we're uploading some batches and deleting them after, t

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Sean Owen
t, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Thu, 8 Aug 2024 at 22:02, Sean Owen

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Sean Owen
That seems a ltle bit too much to me. I could see people still on a recent version that just want to see docs or compare/contrast docs for changes. Removing the versions that seem to have ~0 traffic would remove, it seems, like 80% of the .html files (and replace them with a compressed archive

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Sean Owen
joon staff 103M Aug 8 10:22 3.5.1.tgz > > Specifically, shall we keep HTML files for only the latest version of live > releases, e.g. 3.4.3, 3.5.1, and 4.0.0-preview1? > > In other words, all 0.x ~ 3.4.2 and 3.5.1 will be tarball files in the > current status. > > Dongjoon. &g

Re: [DISCUSS] Using Github Issues for Spark-Connect-Go _only_ issues.

2024-08-08 Thread Sean Owen
the self-serve > instructions - https://infra.apache.org/request-bug-tracker.html > > Please keep the feedback coming. > > On Thu, Aug 8, 2024 at 2:43 PM Sean Owen wrote: > >> This is still part of the Apache Spark project, conceptually? >> IIRC Apache projects still need to

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Sean Owen
I agree with 'archiving', but what does that mean? delete from the repo and site? While I really doubt people are looking for docs for, say, 0.5.0, it'd be a big jump to totally remove it. What if we made a compressed tarball of old docs and put that in the repo, linked to it, and removed the docs

Re: [DISCUSS] Using Github Issues for Spark-Connect-Go _only_ issues.

2024-08-08 Thread Sean Owen
This is still part of the Apache Spark project, conceptually? IIRC Apache projects still need to use JIRA, so we can't do this. On Thu, Aug 8, 2024 at 5:08 AM Mich Talebzadeh wrote: > Hi Martin, > > If I understood it correctly, your proposal suggests centralizing issue > tracking for the Spark

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-29 Thread Sean Owen
Also from ASF community perspective - I think all are agreed this was merged too fast. But, I'm missing where this is somehow due to the needs of a single vendor. Where is this related to file systems or keys? did I miss it from another discussion or PR, or is this actually about a different issue

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-16 Thread Sean Owen
related to the order in which Maven executes the test cases in > the `connect` module. > > > > I have submitted a backport PR > <https://github.com/apache/spark/pull/45141> to branch-3.5, and if > necessary, we can merge it to fix this test issue. > > > > Jie Yan

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-15 Thread Sean Owen
Is anyone seeing this Spark Connect test failure? then again, I have some weird issue with this env that always fails 1 or 2 tests that nobody else can replicate. - Test observe *** FAILED *** == FAIL: Plans do not match === !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_

Re: Removing Kinesis in Spark 4

2024-01-20 Thread Sean Owen
I'm not aware of much usage. but that doesn't mean a lot. FWIW, in the past month or so, the Kinesis docs page got about 700 views, compared to about 1400 for Kafka https://analytics.apache.org/index.php?module=CoreHome&action=index&date=yesterday&period=day&idSite=40#?idSite=40&period=range&date=

Re: Regression? - UIUtils::formatBatchTime - [SPARK-46611][CORE] Remove ThreadLocal by replace SimpleDateFormat with DateTimeFormatter

2024-01-08 Thread Sean Owen
Agreed, that looks wrong. From the code, it seems that "timezone" is only used for testing, though apparently no test caught this. I'll submit a PR to patch it in any event: https://github.com/apache/spark/pull/44619 On Mon, Jan 8, 2024 at 1:33 AM Janda Martin wrote: > I think that > [SPARK-466

Re: Should Spark 4.x use Java modules (those you define with module-info.java sources)?

2023-12-04 Thread Sean Owen
It already does. I think that's not the same idea? On Mon, Dec 4, 2023, 8:12 PM Almog Tavor wrote: > I think Spark should start shading it’s problematic deps similar to how > it’s done in Flink > > On Mon, 4 Dec 2023 at 2:57 Sean Owen wrote: > >> I am not sure we ca

Re: Should Spark 4.x use Java modules (those you define with module-info.java sources)?

2023-12-03 Thread Sean Owen
I am not sure we can control that - the Scala _x.y suffix has particular meaning in the Scala ecosystem for artifacts and thus the naming of .jar files. And we need to work with the Scala ecosystem. What can't handle these files, Spring Boot? does it somehow assume the .jar file name relates to Ja

Re: Spark Compatibility with Spring Boot 3.x

2023-10-05 Thread Sean Owen
I think we already updated this in Spark 4. However for now you would have to also include a JAR with the jakarta.* classes instead. You are welcome to try Spark 4 now by building from master, but it's far from release. On Thu, Oct 5, 2023 at 11:53 AM Ahmed Albalawi wrote: > Hello team, > > We a

Re: PySpark 3.5.0 on PyPI

2023-09-20 Thread Sean Owen
I think the announcement mentioned there were some issues with pypi and the upload size this time. I am sure it's intended to be there when possible. On Wed, Sep 20, 2023, 3:00 PM Kezhi Xiong wrote: > Hi, > > Are there any plans to upload PySpark 3.5.0 to PyPI ( > https://pypi.org/project/pyspar

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-20 Thread Sean Owen
isk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damag

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Sean Owen
Pyspark follows SQL databases here. stddev is stddev_samp, and sample standard deviation is the calculation with the Bessel correction, n-1 in the denominator. stddev_pop is simply standard deviation, with n in the denominator. On Tue, Sep 19, 2023 at 7:13 AM Helene Bøe wrote: > Hi! > > > > I am

Re: getting emails in different order!

2023-09-18 Thread Sean Owen
I have seen this, and not sure if it's just the ASF mailer being weird, or more likely, because emails are moderated and we inadvertently moderate them out of order On Mon, Sep 18, 2023 at 10:59 AM Mich Talebzadeh wrote: > Hi, > > I use gmail to receive spark user group emails. > > On occasions,

Re: Are DataFrame rows ordered without an explicit ordering clause?

2023-09-18 Thread Sean Owen
I think it's the same, and always has been - yes you don't have a guaranteed ordering unless an operation produces a specific ordering. Could be the result of order by, yes; I believe you would be guaranteed that reading input files results in data in the order they appear in the file, etc. 1:1 ope

Re: Spark stand-alone mode

2023-09-15 Thread Sean Owen
Yes, should work fine, just set up according to the docs. There needs to be network connectivity between whatever the driver node is and these 4 nodes. On Thu, Sep 14, 2023 at 11:57 PM Ilango wrote: > > Hi all, > > We have 4 HPC nodes and installed spark individually in all nodes. > > Spark is u

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Sean Owen
ame issue. > > > org.elasticsearch > elasticsearch-spark-30_${scala.compat.version} > 7.12.1 > > > > On Fri, Sep 8, 2023 at 4:41 AM Sean Owen wrote: > >> By marking it provided, you are not including this dependency with your >> app. If it is also

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Sean Owen
By marking it provided, you are not including this dependency with your app. If it is also not somehow already provided by your spark cluster (this is what it means), then yeah this is not anywhere on the class path at runtime. Remove the provided scope. On Thu, Sep 7, 2023, 4:09 PM Dipayan Dev w

Re: Okio Vulnerability in Spark 3.4.1

2023-08-31 Thread Sean Owen
as part of some other dependency. > > > > *From:* Sean Owen > *Sent:* Thursday, August 31, 2023 5:10 PM > *To:* Agrawal, Sanket > *Cc:* user@spark.apache.org > *Subject:* [EXT] Re: Okio Vulnerability in Spark 3.4.1 > > > > Does the vulnerability affect Spark? &

Re: Okio Vulnerability in Spark 3.4.1

2023-08-31 Thread Sean Owen
Does the vulnerability affect Spark? In any event, have you tried updating Okio in the Spark build? I don't believe you could just replace the JAR, as other libraries probably rely on it and compiled against the current version. On Thu, Aug 31, 2023 at 6:02 AM Agrawal, Sanket wrote: > Hi All, >

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-31 Thread Sean Owen
I think you're talking past Hyukjin here. I think the response is: none of that is managed by Pyspark now, and this proposal does not change that. Your current interpreter and environment is used to execute the stored procedure, which is just Python code. It's on you to bring an environment that r

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-30 Thread Sean Owen
to verify? > > > > Thanks, > > Jie Yang > > > > *发件人**: *Dipayan Dev > *日期**: *2023年8月30日 星期三 17:01 > *收件人**: *Sean Owen > *抄送**: *Yuanjian Li , Spark dev list < > dev@spark.apache.org> > *主题**: *Re: [VOTE] Release Apache Spark 3.5.0 (RC3) > >

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-29 Thread Sean Owen
It looks good except that I'm getting errors running the Spark Connect tests at the end (Java 17, Scala 2.13) It looks like I missed something necessary to build; is anyone getting this? [ERROR] [Error] /tmp/spark-3.5.0/connector/connect/server/target/generated-test-sources/protobuf/java/org/apach

Re: error trying to save to database (Phoenix)

2023-08-21 Thread Sean Owen
ooks like spark 3.4.1 (my version) uses scala Scala 2.12 > How do I specify the scala version? > > On Mon, Aug 21, 2023 at 4:47 PM Sean Owen wrote: > >> That's a mismatch in the version of scala that your library uses vs spark >> uses. >> >> On Mon, Aug 21, 202

Re: error trying to save to database (Phoenix)

2023-08-21 Thread Sean Owen
That's a mismatch in the version of scala that your library uses vs spark uses. On Mon, Aug 21, 2023, 6:46 PM Kal Stevens wrote: > I am having a hard time figuring out what I am doing wrong here. > I am not sure if I have an incompatible version of something installed or > something else. > I ca

Re: [VOTE] Release Apache Spark 3.5.0 (RC2)

2023-08-19 Thread Sean Owen
+1 this looks better to me. Works with Scala 2.13 / Java 17 for me. On Sat, Aug 19, 2023 at 3:23 AM Yuanjian Li wrote: > Please vote on releasing the following candidate(RC2) as Apache Spark > version 3.5.0. > > The vote is open until 11:59pm Pacific time Aug 23th and passes if a > majority +1 P

Re: Spark Vulnerabilities

2023-08-14 Thread Sean Owen
Yeah, we generally don't respond to "look at the output of my static analyzer". Some of these are already addressed in a later version. Some don't affect Spark. Some are possibly an issue but hard to change without breaking lots of things - they are really issues with upstream dependencies. But fo

Re: Question about ARRAY_INSERT between Spark and Databricks

2023-08-13 Thread Sean Owen
There shouldn't be any difference here. In fact, I get the results you list for 'spark' from Databricks. It's possible the difference is a bug fix along the way that is in the Spark version you are using locally but not in the DBR you are using. But, yeah seems to work as. you say. If you're askin

What else could be removed in Spark 4?

2023-08-07 Thread Sean Owen
While we're noodling on the topic, what else might be worth removing in Spark 4? For example, looks like we're finally hitting problems supporting Java 8 through 21 all at once, related to Scala 2.13.x updates. It would be reasonable to require Java 11, or even 17, as a baseline for the multi-year

Re: [VOTE] Release Apache Spark 3.5.0 (RC1)

2023-08-06 Thread Sean Owen
Sat, Aug 5, 2023 at 5:42 PM Sean Owen wrote: > I'm still testing other combinations, but it looks like tests fail on Java > 17 after building with Java 8, which should be a normal supported > configuration. > This is described at https://github.com/apache/spark/pull/41943 and lo

Re: [VOTE] Release Apache Spark 3.5.0 (RC1)

2023-08-05 Thread Sean Owen
I'm still testing other combinations, but it looks like tests fail on Java 17 after building with Java 8, which should be a normal supported configuration. This is described at https://github.com/apache/spark/pull/41943 and looks like it is resolved by moving back to Scala 2.13.8 for now. Unless I'

Re: conver panda image column to spark dataframe

2023-08-03 Thread Sean Owen
pp4 has one row, I'm guessing - containing an array of 10 images. You want 10 rows of 1 image each. But, just don't do this. Pass the bytes of the image as an array, along with width/height/channels, and reshape it on use. It's just easier. That is how the Spark image representation works anyway O

Re: Interested in contributing to SPARK-24815

2023-08-03 Thread Sean Owen
t;> >>> Refer to the ASF Source Header and Copyright Notice Policy[1], code >>> directly submitted to ASF should include the Apache license header >>> without any additional copyright notice. >>> >>> >>> Kent Yao >>> >>> [

Re: [VOTE] SPIP: XML data source support

2023-07-28 Thread Sean Owen
+1 I think that porting the package 'as is' into Spark is probably worthwhile. That's relatively easy; the code is already pretty battle-tested and not that big and even originally came from Spark code, so is more or less similar already. One thing it never got was DSv2 support, which means XML re

Re: spark context list_packages()

2023-07-27 Thread Sean Owen
There is no such method in Spark. I think that's some EMR-specific modification. On Wed, Jul 26, 2023 at 11:06 PM second_co...@yahoo.com.INVALID wrote: > I ran the following code > > spark.sparkContext.list_packages() > > on spark 3.4.1 and i get below error > > An error was encountered: > Attri

Re: Spark 3.0.0 EOL

2023-07-26 Thread Sean Owen
There aren't "LTS" releases, though you might expect the last 3.x release will see maintenance releases longer. See end of https://spark.apache.org/versioning-policy.html On Wed, Jul 26, 2023 at 3:56 AM Manu Zhang wrote: > Will Apache Spark 3.5 be a LTS version? > > Thanks, > Manu > > On Mon, Ju

Re: Interested in contributing to SPARK-24815

2023-07-24 Thread Sean Owen
When contributing to an ASF project, it's governed by the terms of the ASF ICLA: https://www.apache.org/licenses/icla.pdf or CCLA: https://www.apache.org/licenses/cla-corporate.pdf I don't believe ASF projects ever retain an original author copyright statement, but rather source files have a state

Re: How to read excel file in PySpark

2023-06-20 Thread Sean Owen
No, a pandas on Spark DF is distributed. On Tue, Jun 20, 2023, 1:45 PM Mich Talebzadeh wrote: > Thanks but if you create a Spark DF from Pandas DF that Spark DF is not > distributed and remains on the driver. I recall a while back we had this > conversation. I don't think anything has changed. >

Re: How to read excel file in PySpark

2023-06-20 Thread Sean Owen
It is indeed not part of SparkSession. See the link you cite. It is part of the pyspark pandas API On Tue, Jun 20, 2023, 5:42 AM John Paul Jayme wrote: > Good day, > > > > I have a task to read excel files in databricks but I cannot seem to > proceed. I am referencing the API documents - read_e

Re: [VOTE] Apache Spark PMC asks Databricks to differentiate its Spark version string

2023-06-16 Thread Sean Owen
On Fri, Jun 16, 2023 at 3:58 PM Dongjoon Hyun wrote: > I started the thread about already publicly visible version issues > according to the ASF PMC communication guideline. It's no confidential, > personal, or security-related stuff. Are you insisting this is confidential? > Discussion about a

Re: [VOTE] Apache Spark PMC asks Databricks to differentiate its Spark version string

2023-06-16 Thread Sean Owen
As we noted in the last thread, this discussion should have been on private@ to begin with, but, the ship has sailed. You are suggesting that non-PMC members vote on whether the PMC has to do something? No, that's not how anything works here. It's certainly the PMC that decides what to put in the

Re: [VOTE] Apache Spark PMC asks Databricks to differentiate its Spark version string

2023-06-16 Thread Sean Owen
What does a vote on dev@ mean? did you mean this for the PMC list? Dongjoon - this offers no rationale about "why". The more relevant thread begins here: https://lists.apache.org/thread/k7gr65wt0fwtldc7hp7bd0vkg1k93rrb but it likewise never got to connecting a specific observation to policy. Could

Re: Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread Sean Owen
You sure it is not just that it's displaying in your local TZ? Check the actual value as a long for example. That is likely the same time. On Thu, Jun 8, 2023, 5:50 PM karan alang wrote: > ref : > https://stackoverflow.com/questions/76436159/apache-spark-not-reading-utc-timestamp-from-mongodb-co

Re: JDK version support policy?

2023-06-08 Thread Sean Owen
ava 11 should be dropped in Spark 4, just > thought I'd bring this issue to your attention. > > Best Regards, Martin > -- > *From:* Jungtaek Lim > *Sent:* Wednesday, June 7, 2023 23:19 > *To:* Sean Owen > *Cc:* Dongjoon Hyun ; Holden Karau &

Re: JDK version support policy?

2023-06-07 Thread Sean Owen
t; >> On 2023/06/07 02:42:19 yangjie01 wrote: >> > +1 on dropping Java 8 in Spark 4.0, and I even hope Spark 4.0 can only >> support Java 17 and the upcoming Java 21. >> > >> > 发件人: Denny Lee >> > 日期: 2023年6月7日 星期三 07:10 >> > 收件人: Sean

Re: ASF policy violation and Scala version issues

2023-06-07 Thread Sean Owen
Hi Dongjoon, I think this conversation is not advancing anymore. I personally consider the matter closed unless you can find other support or respond with more specifics. While this perhaps should be on private@, I think it's not wrong as an instructive discussion on dev@. I don't believe you've m

Re: ASF policy violation and Scala version issues

2023-06-07 Thread Sean Owen
(With consent, shall we move this to the PMC list?) No, I don't think that's what this policy says. First, could you please be more specific here? why do you think a certain release is at odds with this? Because so far you've mentioned, I think, not taking a Scala maintenance release update. But

Re: JDK version support policy?

2023-06-06 Thread Sean Owen
I haven't followed this discussion closely, but I think we could/should drop Java 8 in Spark 4.0, which is up next after 3.5? On Tue, Jun 6, 2023 at 2:44 PM David Li wrote: > Hello Spark developers, > > I'm from the Apache Arrow project. We've discussed Java version support > [1], and crucially,

Re: ASF policy violation and Scala version issues

2023-06-05 Thread Sean Owen
I think the issue is whether a distribution of Spark is so materially different from OSS that it causes problems for the larger community of users. There's a legitimate question of whether such a thing can be called "Apache Spark + changes", as describing it that way becomes meaningfully inaccurate

Re: ASF policy violation and Scala version issues

2023-06-05 Thread Sean Owen
On Mon, Jun 5, 2023 at 12:01 PM Dongjoon Hyun wrote: > 1. For the naming, yes, but the company should use different version > numbers instead of the exact "3.4.0". As I shared the screenshot in my > previous email, the company exposes "Apache Spark 3.4.0" exactly because > they build their distri

Re: ASF policy violation and Scala version issues

2023-06-05 Thread Sean Owen
1/ Regarding naming - I believe releasing "Apache Foo X.Y + patches" is acceptable, if it is substantially Apache Foo X.Y. This is common practice for downstream vendors. It's fair nominative use. The principle here is consumer confusion. Is anyone substantially misled? Here I don't think so. I kno

Re: Apache Spark 3.5.0 Expectations (?)

2023-05-29 Thread Sean Owen
It does seem risky; there are still likely libs out there that don't cross compile for 2.13. I would make it the default at 4.0, myself. On Mon, May 29, 2023 at 7:16 PM Hyukjin Kwon wrote: > While I support going forward with a higher version, actually using Scala > 2.13 by default is a big deal

Re: JDK version support information

2023-05-29 Thread Sean Owen
Per docs, it is Java 8. It's possible Java 11 partly works with 2.x but not supported. But then again 2.x is not supported either. On Mon, May 29, 2023, 6:43 AM Poorna Murali wrote: > We are currently using JDK 11 and spark 2.4.5.1 is working fine with that. > So, we wanted to check the maximum

Re: [MLlib] how-to find implementation of Decision Tree Regressor fit function

2023-05-25 Thread Sean Owen
Are you looking for https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala On Thu, May 25, 2023 at 6:54 AM Max wrote: > Good day, I'm working on an Implantation from Joint Probability Trees > (JPT) using the Spark framework. For this to

Re: Tensorflow on Spark CPU

2023-04-30 Thread Sean Owen
nds > > the code is at below > https://gist.github.com/cometta/240bbc549155e22f80f6ba670c9a2e32 > > Do you have an example of tensorflow+big dataset that I can test? > > > > > > > > On Saturday, April 29, 2023 at 08:44:04 PM GMT+8, Sean Owen < > sro...@gmail.

Re: Tensorflow on Spark CPU

2023-04-29 Thread Sean Owen
You don't want to use CPUs with Tensorflow. If it's not scaling, you may have a problem that is far too small to distribute. On Sat, Apr 29, 2023 at 7:30 AM second_co...@yahoo.com.INVALID wrote: > Anyone successfully run native tensorflow on Spark ? i tested example at > https://github.com/tenso

Re: Spark 3.4.0 with Hadoop2.7 cannot be downloaded

2023-04-20 Thread Sean Owen
We just removed it now, yes. On Thu, Apr 20, 2023 at 9:08 AM Emil Ejbyfeldt wrote: > Hi, > > I think this is expected as it was dropped from the release process in > https://issues.apache.org/jira/browse/SPARK-40651 > > Also I don't see a Hadoop2.7 option when selecting Spark 3.4.0 on > https://

Re: [VOTE] Release Apache Spark 3.2.4 (RC1)

2023-04-10 Thread Sean Owen
+1 from me On Sun, Apr 9, 2023 at 7:19 PM Dongjoon Hyun wrote: > I'll start with my +1. > > I verified the checksum, signatures of the artifacts, and documentations. > Also, ran the tests with YARN and K8s modules. > > Dongjoon. > > On 2023/04/09 23:46:10 Dongjoon Hyun wrote: > > Please vote on

Re: [VOTE] Release Apache Spark 3.4.0 (RC7)

2023-04-08 Thread Sean Owen
+1 form me, same result as last time. On Fri, Apr 7, 2023 at 6:30 PM Xinrong Meng wrote: > Please vote on releasing the following candidate(RC7) as Apache Spark > version 3.4.0. > > The vote is open until 11:59pm Pacific time *April 12th* and passes if a > majority +1 PMC votes are cast, with a

Re: Looping through a series of telephone numbers

2023-04-02 Thread Sean Owen
That won't work, you can't use Spark within Spark like that. If it were exact matches, the best solution would be to load both datasets and join on telephone number. For this case, I think your best bet is a UDF that contains the telephone numbers as a list and decides whether a given number matche

Re: [VOTE] Release Apache Spark 3.4.0 (RC5)

2023-03-30 Thread Sean Owen
+1 same result from me as last time. On Thu, Mar 30, 2023 at 3:21 AM Xinrong Meng wrote: > Please vote on releasing the following candidate(RC5) as Apache Spark > version 3.4.0. > > The vote is open until 11:59pm Pacific time *April 4th* and passes if a > majority +1 PMC votes are cast, with a m

Re: What is the range of the PageRank value of graphx

2023-03-28 Thread Sean Owen
>From the docs: * Note that this is not the "normalized" PageRank and as a consequence pages that have no * inlinks will have a PageRank of alpha. In particular, the pageranks may have some values * greater than 1. On Tue, Mar 28, 2023 at 9:11 AM lee wrote: > When I calculate pagerank using

Re: Question related to asynchronously map transformation using java spark structured streaming

2023-03-26 Thread Sean Owen
What do you mean by asynchronously here? On Sun, Mar 26, 2023, 10:22 AM Emmanouil Kritharakis < kritharakismano...@gmail.com> wrote: > Hello again, > > Do we have any news for the above question? > I would really appreciate it. > > Thank you, > > --

Re: Kind help request

2023-03-25 Thread Sean Owen
It is telling you that the UI can't bind to any port. I presume that's because of container restrictions? If you don't want the UI at all, just set spark.ui.enabled to false On Sat, Mar 25, 2023 at 8:28 AM Lorenzo Ferrando < lorenzo.ferra...@edu.unige.it> wrote: > Dear Spark team, > > I am Lorenz

Re: Question related to parallelism using structed streaming parallelism

2023-03-21 Thread Sean Owen
Yes more specifically, you can't ask for executors once the app starts, in SparkConf like that. You set this when you launch it against a Spark cluster in spark-submit or otherwise. On Tue, Mar 21, 2023 at 4:23 AM Mich Talebzadeh wrote: > Hi Emmanouil, > > This means that your job is running on

Re: Understanding executor memory behavior

2023-03-16 Thread Sean Owen
All else equal it is better to have the same resources in fewer executors. More tasks are local to other tasks which helps perf. There is more possibility of 'borrowing' extra mem and CPU in a task. On Thu, Mar 16, 2023, 2:14 PM Nikhil Goyal wrote: > Hi folks, > I am trying to understand what wo

Re: logging pickle files on local run of spark.ml Pipeline model

2023-03-15 Thread Sean Owen
Pickle won't work. But the others should. I think you are specifying an invalid path in both cases but hard to say without more detail On Wed, Mar 15, 2023, 9:13 AM Mnisi, Caleb wrote: > Good Day > > > > I am having trouble saving a spark.ml Pipeline model to a pickle file, > when running locall

Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Sean Owen
That's incorrect, it's spark.default.parallelism, but as the name suggests, that is merely a default. You control partitioning directly with .repartition() On Tue, Mar 14, 2023 at 11:37 AM Mich Talebzadeh wrote: > Check this link > > > https://sparkbyexamples.com/spark/difference-between-spark-s

Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Sean Owen
Are you just looking for DataFrame.repartition()? On Tue, Mar 14, 2023 at 10:57 AM Emmanouil Kritharakis < kritharakismano...@gmail.com> wrote: > Hello, > > I hope this email finds you well! > > I have a simple dataflow in which I read from a kafka topic, perform a map > transformation and then I

Re: Spark 3.3.2 not running with Antlr4 runtime latest version

2023-03-14 Thread Sean Owen
You want Antlr 3 and Spark is on 4? no I don't think Spark would downgrade. You can shade your app's dependencies maybe. On Tue, Mar 14, 2023 at 8:21 AM Sahu, Karuna wrote: > Hi Team > > > > We are upgrading a legacy application using Spring boot , Spark and > Hibernate. While upgrading Hibernat

Re: [VOTE] Release Apache Spark 3.4.0 (RC3)

2023-03-09 Thread Sean Owen
We cannot in the AS-IS commit log status because it's screwed already > as Emil wrote. > Did you check the branch-3.2 commit log, Sean? > > Dongjoon. > > > On Thu, Mar 9, 2023 at 11:42 AM Sean Owen wrote: > >> We can just push the tags onto the branches as needed r

Re: How to share a dataset file across nodes

2023-03-09 Thread Sean Owen
Put the file on HDFS, if you have a Hadoop cluster? On Thu, Mar 9, 2023 at 3:02 PM sam smith wrote: > Hello, > > I use Yarn client mode to submit my driver program to Hadoop, the dataset > I load is from the local file system, when i invoke load("file://path") > Spark complains about the csv fil

Re: [VOTE] Release Apache Spark 3.4.0 (RC3)

2023-03-09 Thread Sean Owen
We can just push the tags onto the branches as needed right? No need to roll a new release On Thu, Mar 9, 2023, 1:36 PM Dongjoon Hyun wrote: > Yes, I also confirmed that the v3.4.0-rc3 tag is invalid. > > I guess we need RC4. > > Dongjoon. > > On Thu, Mar 9, 2023 at 7:13 AM Emil Ejbyfeldt > wro

Re: 回复:Re: Build SPARK from source with SBT failed

2023-03-07 Thread Sean Owen
> OpenJDK 11.0.17 manually. > So I need to install Apple Developer Tools? > - 原始邮件 - > 发件人:Sean Owen > 收件人:ckgppl_...@sina.cn > 抄送人:user > 主题:Re: Build SPARK from source with SBT failed > 日期:2023年03月07日 20点58分 > > This says you don't have the java compil

Re: Pandas UDFs vs Inbuilt pyspark functions

2023-03-07 Thread Sean Owen
It's hard to evaluate without knowing what you're doing. Generally, using a built-in function will be fastest. pandas UDFs can be faster than normal UDFs if you can take advantage of processing multiple rows at once. On Tue, Mar 7, 2023 at 6:47 AM neha garde wrote: > Hello All, > > I need help d

Re: Build SPARK from source with SBT failed

2023-03-07 Thread Sean Owen
This says you don't have the java compiler installed. Did you install the Apple Developer Tools package? On Tue, Mar 7, 2023 at 1:42 AM wrote: > Hello, > > I have tried to build SPARK source codes with SBT in my local dev > environment (MacOS 13.2.1). But it reported following error: > [error] j

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Sean Owen
other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sat, 4 Mar 2023 at 20:13, Sean Owen wrote: &

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Sean Owen
It's the same batch ID already, no? Or why not simply put the logic of both in one function? or write one function that calls both? On Sat, Mar 4, 2023 at 2:07 PM Mich Talebzadeh wrote: > > This is probably pretty straight forward but somehow is does not look > that way > > > > On Spark Structu

Re: [VOTE] Release Apache Spark 3.4.0 (RC2)

2023-03-03 Thread Sean Owen
path get set up differently when running via > SBT vs. Maven? > > On Thu, Mar 2, 2023 at 5:37 PM Sean Owen wrote: > >> Thanks, that's good to know. The workaround (deleting the thriftserver >> target dir) works for me. Who knows? >> >> But I&

Re: [VOTE] Release Apache Spark 3.4.0 (RC2)

2023-03-02 Thread Sean Owen
//github.com/sbt/sbt/issues/6183>. > > One thing that I did find to help was to > delete sql/hive-thriftserver/target between building Spark and running the > tests. This helps in my builds where the issue only occurs during the > testing phase and not during the initial build

Re: [VOTE] Release Apache Spark 3.4.0 (RC2)

2023-03-02 Thread Sean Owen
Has anyone seen this behavior -- I've never seen it before. The Hive thriftserver module for me just goes into an infinite loop when running tests: ... [INFO] done compiling [INFO] compiling 22 Scala sources and 24 Java sources to /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/target/scala-2.

Re: [Question] LimitedInputStream license issue in Spark source.

2023-03-01 Thread Sean Owen
Right, it contains ALv2 licensed code attributed to two authors - some is from Guava, some is from Apache Spark contributors. I thought this is how we should handle this. It's not feasible to go line by line and say what came from where. On Wed, Mar 1, 2023 at 1:33 AM Dongjoon Hyun wrote: > May

Re: [Question] LimitedInputStream license issue in Spark source.

2023-03-01 Thread Sean Owen
Right, it contains ALv2 licensed code attributed to two authors - some is from Guava, some is from Apache Spark contributors. I thought this is how we should handle this. It's not feasible to go line by line and say what came from where. On Wed, Mar 1, 2023 at 1:33 AM Dongjoon Hyun wrote: > May

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Sean Owen
gt;>> File "nearest-gene.py", line 62, in main >>> distances = joined.withColumn("distance", max(col("start") - >>> col("position"), col("position") - col("end"), 0)) >>> File >>> "/mnt

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Sean Owen
That error sounds like it's from pandas not spark. Are you sure it's this line? On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker < oliv...@broadinstitute.org> wrote: > > Hello, > > I'm trying to calculate the distance between a gene (with start and end) > and a variant (with position), so

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Sean Owen
FWIW I agree with this. On Wed, Feb 22, 2023 at 2:59 PM Allan Folting wrote: > Hi all, > > I would like to propose that we show Python code examples first in the > Spark documentation where we have multiple programming language examples. > An example is on the Quick Start page: > https://spark.a

Re: [DISCUSS] Make release cadence predictable

2023-02-15 Thread Sean Owen
can persuade > the incomplete features to wait for next releases more easily. > > In addition, I want to add the first RC1 date requirement because RC1 > always did a great job for us. > > I guess `branch-cut + 1M (no later than 1month)` could be the reasonable > deadline. >

Re: [DISCUSS] Make release cadence predictable

2023-02-14 Thread Sean Owen
I'm fine with shifting to a stricter cadence-based schedule. Sometimes, it'll mean some significant change misses a release rather than delays it. If people are OK with that discipline, sure. A hard 6-month cycle would mean the minor releases are more frequent and have less change in them. That's p

  1   2   3   4   5   6   7   8   9   10   >