Re: Spark 4.0 Query Analyzer Bug Report

2024-02-20 Thread Holden Karau
Do you mean Spark 3.4? 4.0 is very much not released yet. Also it would help if you could share your query & more of the logs leading up to the error. On Tue, Feb 20, 2024 at 3:07 PM Sharma, Anup wrote: > Hi Spark team, > > > > We ran into a dataframe issue after upgrading from spark 3.1 to 4.

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Holden Karau
This looks really cool :) Out of interest what are the differences in the approach between this and Glutten? On Tue, Feb 13, 2024 at 12:42 PM Chao Sun wrote: > Hi all, > > We are very happy to announce that Project Comet, a plugin to > accelerate Spark query execution via leveraging DataFusion

Re: Spark-Connect: Param `--packages` does not take effect for executors.

2023-12-04 Thread Holden Karau
So I think this sounds like a bug to me, in the help options for both regular spark-submit and ./sbin/start-connect-server.sh we say: " --packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths.

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-27 Thread Holden Karau
So I don’t think we make any particular guarantees around class path isolation there, so even if it does work it’s something you’d need to pay attention to on upgrades. Class path isolation is tricky to get right. On Mon, Nov 27, 2023 at 2:58 PM Faiz Halde wrote: > Hello, > > We are using spark

Re: Write Spark Connection client application in Go

2023-09-12 Thread Holden Karau
That’s so cool! Great work y’all :) On Tue, Sep 12, 2023 at 8:14 PM bo yang wrote: > Hi Spark Friends, > > Anyone interested in using Golang to write Spark application? We created a > Spark > Connect Go Client library . > Would love to hear

Re: Elasticsearch support for Spark 3.x

2023-08-27 Thread Holden Karau
What’s the version of the ES connector you are using? On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev wrote: > Hi All, > > We're using Spark 2.4.x to write dataframe into the Elasticsearch index. > As we're upgrading to Spark 3.3.0, it throwing out error > Caused by:

Re: Dynamic allocation does not deallocate executors

2023-08-08 Thread Holden Karau
ny > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage

Re: Dynamic allocation does not deallocate executors

2023-08-07 Thread Holden Karau
I think you need to set "spark.dynamicAllocation.shuffleTracking.enabled=true" to false. On Mon, Aug 7, 2023 at 2:50 AM Mich Talebzadeh wrote: > Yes I have seen cases where the driver gone but a couple of executors > hanging on. Sounds like a code issue. > > HTH > > Mich Talebzadeh, > Solutions

Re: SPIP: Shutting down spark structured streaming when the streaming process completed current process

2023-02-18 Thread Holden Karau
Is there someone focused on streaming work these days who would want to shepherd this? On Sat, Feb 18, 2023 at 5:02 PM Dongjoon Hyun wrote: > Thank you for considering me, but may I ask what makes you think to put me > there, Mich? I'm curious about your reason. > > > I have put dongjoon.hyun

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-06 Thread Holden Karau
> > On Tue, Dec 6, 2022 at 9:22 AM Holden Karau wrote: > >> There is the splittable gzip Hadoop input format, maybe someone could >> extend that to use support bgzip? >> >> On Tue, Dec 6, 2022 at 1:43 PM Oliver Ruebenacker < >> oliv...@broadinstitute.org

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-06 Thread Holden Karau
There is the splittable gzip Hadoop input format, maybe someone could extend that to use support bgzip? On Tue, Dec 6, 2022 at 1:43 PM Oliver Ruebenacker < oliv...@broadinstitute.org> wrote: > > Hello Chris, > > Yes, you can use gunzip/gzip to uncompress a file created by bgzip, but > to

Re: Dataproc serverless for Spark

2022-11-28 Thread Holden Karau
This sounds like a great question for the Google DataProc folks (I know there was some interesting work being done around it but I left before it was finished so I don't want to provide a possibly incorrect answer). If your a GCP customer try reaching out to their support for details. On Mon,

Re: Dynamic Scaling without Kubernetes

2022-10-26 Thread Holden Karau
So Spark can dynamically scale on YARN, but standalone mode becomes a bit complicated — where do you envision Spark gets the extra resources from? On Wed, Oct 26, 2022 at 12:18 PM Artemis User wrote: > Has anyone tried to make a Spark cluster dynamically scalable, i.e., > adding a new worker

Re: Jupyter notebook on Dataproc versus GKE

2022-09-06 Thread Holden Karau
rise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Mon, 5 Sept 2022 at 20:5

Re: Jupyter notebook on Dataproc versus GKE

2022-09-05 Thread Holden Karau
f data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Mon, 5 Sept 2022 at 12:47, Hold

Re: Jupyter notebook on Dataproc versus GKE

2022-09-05 Thread Holden Karau
I’ve run Jupyter w/Spark on K8s, haven’t tried it with Dataproc personally. The Spark K8s pod scheduler is now more pluggable for Yunikorn and Volcano can be used with less effort. On Mon, Sep 5, 2022 at 7:44 AM Mich Talebzadeh wrote: > > Hi, > > > Has anyone got experience of running Jupyter

Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread Holden Karau
Could we make it do the same sort of history server fallback approach? On Tue, May 17, 2022 at 10:41 PM bo yang wrote: > It is like Web Application Proxy in YARN ( > https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html), > to provide easy access for Spark

Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread Holden Karau
Oh that’s rad  On Tue, May 17, 2022 at 7:47 AM bo yang wrote: > Hi Spark Folks, > > I built a web reverse proxy to access Spark UI on Kubernetes (working > together with https://github.com/GoogleCloudPlatform/spark-on-k8s-operator). > Want to share here in case other people have similar need.

Re: Unable to access Google buckets using spark-submit

2022-02-12 Thread Holden Karau
You can also put the GS access jar with your Spark jars — that’s what the class not found exception is pointing you towards. On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh wrote: > BTW I also answered you in in stackoverflow : > > >

Re: Spark 3.1.2 full thread dumps

2022-02-04 Thread Holden Karau
We don’t block scaling up after node failure in classic Spark if that’s the question. On Fri, Feb 4, 2022 at 6:30 PM Mich Talebzadeh wrote: > From what I can see in auto scaling setup, you will always need a min of > two worker nodes as primary. It also states and I quote "Scaling primary >

Re: Log4j 1.2.17 spark CVE

2021-12-12 Thread Holden Karau
My understanding is it only applies to log4j 2+ so we don’t need to do anything. On Sun, Dec 12, 2021 at 8:46 PM Pralabh Kumar wrote: > Hi developers, users > > Spark is built using log4j 1.2.17 . Is there a plan to upgrade based on > recent CVE detected ? > > > Regards > Pralabh kumar > --

Re: Choice of IDE for Spark

2021-10-01 Thread Holden Karau
Personally I like Jupyter notebooks for my interactive work and then once I’ve done my exploration I switch back to emacs with either scala-metals or Python mode. I think the main takeaway is: do what feels best for you, there is no one true way to develop in Spark. On Fri, Oct 1, 2021 at 1:28

Drop-In Virtual Office Hour round 2 :)

2021-09-28 Thread Holden Karau
Hi Folks, I'm going to do another drop-in virtual office hour and I've made a public google calendar to track them so hopefully it's easier for folks to add events https://calendar.google.com/calendar/?cid=cXBubTY3Z2VzcmNjbnEzOWIzb3RyOWI1am9AZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ or ics feed at

Re: Drop-In Virtual Office Half-Hour

2021-09-20 Thread Holden Karau
Hey folks I'm doing my drop-in half-hour now - http://meet.google.com/ccd-mkbd-gfv :) On Mon, Sep 13, 2021 at 4:12 PM Holden Karau wrote: > Hi Folks, > > I'm going to experiment with a drop-in virtual half-hour office hour type > thing next Monday, if you've got any burning Spark or

Re: Drop-In Virtual Office Half-Hour

2021-09-17 Thread Holden Karau
info Video call link: https://meet.google.com/ccd-mkbd-gfv On Mon, Sep 13, 2021 at 4:12 PM Holden Karau wrote: > Hi Folks, > > I'm going to experiment with a drop-in virtual half-hour office hour type > thing next Monday, if you've got any burning Spark or general OSS questions >

Re: Drop-In Virtual Office Half-Hour

2021-09-13 Thread Holden Karau
, Sep 13, 2021 at 5:11 PM Holden Karau wrote: > Ah thanks for pointing that out. I changed the visibility on it to public > so it should work now. > > On Mon, Sep 13, 2021 at 4:26 PM Gourav Sengupta > wrote: > >> Hi Holden, >> >> This is such a wonde

Re: Drop-In Virtual Office Half-Hour

2021-09-13 Thread Holden Karau
Regards, > Gourav > > On Tue, Sep 14, 2021 at 12:13 AM Holden Karau > wrote: > >> Hi Folks, >> >> I'm going to experiment with a drop-in virtual half-hour office hour type >> thing next Monday, if you've got any burning Spark or general OSS questions >>

Drop-In Virtual Office Half-Hour

2021-09-13 Thread Holden Karau
Hi Folks, I'm going to experiment with a drop-in virtual half-hour office hour type thing next Monday, if you've got any burning Spark or general OSS questions you haven't had the time to ask anyone else I hope you'll swing by and join me. If no one comes with questions I'll tour some of the

Re: Spark on Kubernetes scheduler variety

2021-07-08 Thread Holden Karau
Holden Karau wrote: > That's awesome, I'm just starting to get context around Volcano but maybe > we can schedule an initial meeting for all of us interested in pursuing > this to get on the same page. > > On Wed, Jun 23, 2021 at 6:54 PM Klaus Ma wrote: > >> Hi team, >

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Holden Karau
own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising f

Re: CVEs

2021-06-21 Thread Holden Karau
If you get to a point where you find something you think is highly likely a valid vulnerability the best path forward is likely reaching out to private@ to figure out how to do a security release. On Mon, Jun 21, 2021 at 4:42 PM Eric Richardson wrote: > Thanks for the quick reply. Yes, since it

Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Holden Karau
Scala and Python have their advantages and disadvantages with Spark. In my experience with performance is super important you’ll end up needing to do some of your work in the JVM, but in many situations what matters work is what your team and company are familiar with and the ecosystem of tooling

Re: REST Structured Steaming Sink

2020-07-01 Thread Holden Karau
itly tune. > . foreachWriter is typically used for such use cases, not foreachBatch. > It's also pretty hard to guarantee exactly-once, rate limiting, etc. > > Best, > Burak > > On Wed, Jul 1, 2020 at 5:54 PM Holden Karau wrote: > >> I think adding something like t

Re: REST Structured Steaming Sink

2020-07-01 Thread Holden Karau
I think adding something like this (if it doesn't already exist) could help make structured streaming easier to use, foreachBatch is not the best API. On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim wrote: > I guess the method, query parameter, header, and the payload would be all > different for

[ANNOUNCE] Apache Spark 2.4.6 released

2020-06-10 Thread Holden Karau
We are happy to announce the availability of Spark 2.4.6! Spark 2.4.6 is a maintenance release containing stability, correctness, and security fixes. This release is based on the branch-2.4 maintenance branch of Spark. We strongly recommend all 2.4 users to upgrade to this stable release. To

Re: Spark API and immutability

2020-05-25 Thread Holden Karau
So even on RDDs cache/persist mutate the RDD object. The important thing for Spark is that the data represented/in the RDD/Dataframe isn’t mutated. On Mon, May 25, 2020 at 10:56 AM Chris Thomas wrote: > > The cache() method on the DataFrame API caught me out. > > Having learnt that DataFrames

Re: Watch "Airbus makes more of the sky with Spark - Jesse Anderson & Hassene Ben Salem" on YouTube

2020-04-25 Thread Holden Karau
Also it’s ok if Spark and Flink evolve in different directions, were both part of the same open source foundation. Sometimes being everything to everyone isn’t as important as being the best at what you need. I like to think of our relationship with other Apache projects as less competitive and

Re: Copyright Infringment

2020-04-25 Thread Holden Karau
f Apache >>> foundation's free licence agreement ? >>> >>> >>> >>> On Sat, 25 Apr 2020, 16:18 Sean Owen, wrote: >>> >>>> You'll want to ask the authors directly ; the book is not produced by >>>> the project itself, so

Re: Copyright Infringment

2020-04-25 Thread Holden Karau
o not want to commit an unlawful act. >>> Can you please clarify if I would be infringing copyright due to this >>> text. >>> *Book: High Performance Spark * >>> *authors: holden Karau Rachel Warren.* >>> *page xii:* >>> >>> * This book is h

Re: Going it alone.

2020-04-16 Thread Holden Karau
I want to be clear I believe the language in janethrope1s email is unacceptable for the mailing list and possibly a violation of the Apache code of conduct. I’m glad we don’t see messages like this often. I know this is a stressful time for many of us, but let’s try and do our best to not take it

Re: SPARK Suitable IDE

2020-03-04 Thread Holden Karau
I work in emacs with ensime. I think really any IDE is ok, so go with the one you feel most at home in. On Wed, Mar 4, 2020 at 5:49 PM tianlangstudio wrote: > We use IntelliJ IDEA,Whether it's Java, Scala or Python > > >

Re: PySpark Pandas UDF

2019-11-12 Thread Holden Karau
Thanks for sharing that. I think we should maybe add some checks around this so it’s easier to debug. I’m CCing Bryan who might have some thoughts. On Tue, Nov 12, 2019 at 7:42 AM gal.benshlomo wrote: > SOLVED! > thanks for the help - I found the issue. it was the version of pyarrow > (0.15.1)

Re: PySpark Pandas UDF

2019-11-10 Thread Holden Karau
Can you switch the write for a count just so we can isolate if it’s the write or the count? Also what’s the output path your using? On Sun, Nov 10, 2019 at 7:31 AM Gal Benshlomo wrote: > > > Hi, > > > > I’m using pandas_udf and not able to run it from cluster mode, even though > the same code

Re: Why Spark generates Java code and not Scala?

2019-11-10 Thread Holden Karau
If you look inside of the generation we generate java code and compile it with Janino. For interested folks the conversation moved over to the dev@ list On Sat, Nov 9, 2019 at 10:37 AM Marcin Tustin wrote: > What do you mean by this? Spark is written in a combination of Scala and > Java, and

Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

2019-11-01 Thread Holden Karau
On Thu, Oct 31, 2019 at 10:04 PM Nicolas Paris wrote: > have you deactivated the spark.ui ? > I have read several thread explaining the ui can lead to OOM because it > stores 1000 dags by default > > > On Sun, Oct 20, 2019 at 03:18:20AM -0700, Paul Wais wrote: > > Dear List, > > > > I've

Re: Loop through Dataframes

2019-10-06 Thread Holden Karau
So if you want to process the contents of a dataframe locally but not pull all of the data back at once toLocaliterator is probably what you're looking for, it's still not great though so maybe you can share the root problem which your trying to solve and folks might have some suggestions there.

Re: Announcing .NET for Apache Spark 0.5.0

2019-09-30 Thread Holden Karau
Congratulations on the release :) On Mon, Sep 30, 2019 at 9:38 AM Terry Kim wrote: > We are thrilled to announce that .NET for Apache Spark 0.5.0 has been just > released ! > > > > Some of the highlights of this release include: > >-

Re: Release Apache Spark 2.4.4

2019-08-14 Thread Holden Karau
t;> [SPARK-27234][SS][PYTHON] Use InheritableThreadLocal for current epoch in >> EpochTracker (to support Python UDFs) >> <https://github.com/apache/spark/pull/24946> >> >> Thanks, >> Terry >> >> On Tue, Aug 13, 2019 at 10:24 PM Wenchen Fan wrote: >> >

Re: Release Apache Spark 2.4.4

2019-08-13 Thread Holden Karau
+1 Does anyone have any critical fixes they’d like to see in 2.4.4? On Tue, Aug 13, 2019 at 5:22 PM Sean Owen wrote: > Seems fine to me if there are enough valuable fixes to justify another > release. If there are any other important fixes imminent, it's fine to > wait for those. > > > On Tue,

Re: Should python-2 be supported in Spark 3.0?

2019-05-31 Thread Holden Karau
+1 On Fri, May 31, 2019 at 5:41 PM Bryan Cutler wrote: > +1 and the draft sounds good > > On Thu, May 30, 2019, 11:32 AM Xiangrui Meng wrote: > >> Here is the draft announcement: >> >> === >> Plan for dropping Python 2 support >> >> As many of you already knew, Python core development team and

Re: How to preserve event order per key in Structured Streaming Repartitioning By Key?

2018-12-11 Thread Holden Karau
So it's been awhile since I poked at the streaming code base, but I don't think we make an promises about stable sort during repartition, and there's notes in there about how some of these components should be re-written into core so even if we did have stable sort I wouldn't depend on it unless

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-11-15 Thread Holden Karau
If folks are interested, while it's not on Amazon, I've got a live stream of getting client mode with Jupyternotebook to work on GCP/GKE : https://www.youtube.com/watch?v=eMj0Pv1-Nfo=3=PLRLebp9QyZtZflexn4Yf9xsocrR_aSryx On Wed, Oct 31, 2018 at 5:55 PM Zhang, Yuqi wrote: > Hi Li, > > > > Thank

Re: Is there any Spark source in Java

2018-11-03 Thread Holden Karau
Parts of it are indeed written in Java. You probably want to reach out to the developers list to talk about changing Spark. On Sat, Nov 3, 2018, 11:42 AM Soheil Pourbafrani Hi, I want to customize some part of Spark. I was wondering if there any > Spark source is written in Java language, or all

Code review and Coding livestreams today

2018-10-12 Thread Holden Karau
I’ll be doing my regular weekly code review at 10am Pacific today - https://youtu.be/IlH-EGiWXK8 with a look at the current RC, and in the afternoon at 3pm Pacific I’ll be doing some live coding around WIP graceful decommissioning PR - https://youtu.be/4FKuYk2sbQ8 -- Twitter:

Re: Live Streamed Code Review today at 11am Pacific

2018-09-21 Thread Holden Karau
batches) is my current plan to start with :) On Thu, Jul 19, 2018 at 11:38 PM Holden Karau wrote: > Heads up tomorrows Friday review is going to be at 8:30 am instead of 9:30 > am because I had to move some flights around. > > On Fri, Jul 13, 2018 at 12:03 PM, Holden Ka

Re: Use Arrow instead of Pickle without pandas_udf

2018-07-25 Thread Holden Karau
Not currently. What's the problem with pandas_udf for your use case? On Wed, Jul 25, 2018 at 1:27 PM, Hichame El Khalfi wrote: > Hi There, > > > Is there a way to use Arrow format instead of Pickle but without using > pandas_udf ? > > > Thank for your help, > > > Hichame > -- Twitter:

Live Code Reviews, Coding, and Dev Tools

2018-07-24 Thread Holden Karau
Tomorrow afternoon @ 3pm pacific I'll be doing some dev tools poking for Beam and Spark - https://www.youtube.com/watch?v=6cTmC_fP9B0 for mention-bot. On Friday I'll be doing my normal code reviews - https://www.youtube.com/watch?v=O4rRx-3PTiM On Monday July 30th @ 9:30am I'll be doing some more

Re: Live Streamed Code Review today at 11am Pacific

2018-07-20 Thread Holden Karau
Heads up tomorrows Friday review is going to be at 8:30 am instead of 9:30 am because I had to move some flights around. On Fri, Jul 13, 2018 at 12:03 PM, Holden Karau wrote: > This afternoon @ 3pm pacific I'll be looking at review tooling for Spark & > Beam https://www.youtube.c

Re: Pyspark access to scala/java libraries

2018-07-15 Thread Holden Karau
If you want to see some examples in a library shows a way to do it - https://github.com/sparklingpandas/sparklingml and high performance spark also talks about it. On Sun, Jul 15, 2018, 11:57 AM <0xf0f...@protonmail.com.invalid> wrote: > Check >

Re: Live Streamed Code Review today at 11am Pacific

2018-07-13 Thread Holden Karau
Jun 27, 2018 at 10:44 AM, Holden Karau wrote: > Today @ 1:30pm pacific I'll be looking at the current Spark 2.1.3 RC and > see how we validate Spark releases - https://www.twitch.tv/events/ > VAg-5PKURQeH15UAawhBtw / https://www.youtube.com/watch?v=1_XLrlKS26o . > Tomorrow @ 12:30 li

[ANNOUNCE] Apache Spark 2.1.3

2018-07-01 Thread Holden Karau
We are happy to announce the availability of Spark 2.1.3! Apache Spark 2.1.3 is a maintenance release, based on the branch-2.1 maintenance branch of Spark. We strongly recommend all 2.1.x users to upgrade to this stable release. The release notes are available at

Re: Live Streamed Code Review today at 11am Pacific

2018-06-27 Thread Holden Karau
user/holdenkarau & https://www.twitch.tv/holdenkarau/events . Hopefully this can encourage more folks to help with RC validation & PR reviews :) On Thu, Jun 14, 2018 at 6:07 AM, Holden Karau wrote: > Next week is pride in San Francisco but I'm still going to do two quick > session. One w

Re: Live Streamed Code Review today at 11am Pacific

2018-06-14 Thread Holden Karau
and the other will be the regular Friday code review ( https://www.youtube.com/watch?v=IAWm4OLRoyY / https://www.twitch.tv/events/v0qzXxnNQ_K7a8JYFsIiKQ ) also at 9am. On Thu, Jun 7, 2018 at 9:10 PM, Holden Karau wrote: > I'll be doing another one tomorrow morning at 9am pacific focused on > Python

Re: Live Streamed Code Review today at 11am Pacific

2018-06-07 Thread Holden Karau
I'll be doing another one tomorrow morning at 9am pacific focused on Python + K8s support & improved JSON support - https://www.youtube.com/watch?v=Z7ZEkvNwneU & https://www.twitch.tv/events/xU90q9RGRGSOgp2LoNsf6A :) On Fri, Mar 9, 2018 at 3:54 PM, Holden Karau wrote: > If anyone wan

Spark ML online serving

2018-06-06 Thread Holden Karau
At Spark Summit some folks were talking about model serving and we wanted to collect requirements from the community. -- Twitter: https://twitter.com/holdenkarau

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread Holden Karau
If it’s one 33mb file which decompressed to 1.5g then there is also a chance you need to split the inputs since gzip is a non-splittable compression format. On Tue, Jun 5, 2018 at 11:55 AM Anastasios Zouzias wrote: > Are you sure that your JSON file has the right format? > >

Re: testing frameworks

2018-05-30 Thread Holden Karau
So Jessie has an excellent blog post on how to use it with Java applications - http://www.jesse-anderson.com/2016/04/unit-testing-spark-with-java/ On Wed, May 30, 2018 at 4:14 AM Spico Florin wrote: > Hello! > I'm also looking for unit testing spark Java application. I've seen the > great

Re: testing frameworks

2018-05-21 Thread Holden Karau
So I’m biased as the author of spark-testing-base but I think it’s pretty ok. Are you looking for unit or integration or something else? On Mon, May 21, 2018 at 5:24 AM Steve Pruitt wrote: > Hi, > > > > Can anyone recommend testing frameworks suitable for Spark jobs. >

Re: [Spark on Google Kubernetes Engine] Properties File Error

2018-04-30 Thread Holden Karau
So, while its not perfect, I have a guide focused on running custom Spark on GKE https://cloud.google.com/blog/big-data/2018/03/testing-future-apache-spark-releases-and-changes-on-google-kubernetes-engine-and-cloud-dataproc and if you want to run pre-built Spark on GKE there is a solutions article

Re: Live Stream Code Reviews :)

2018-04-13 Thread Holden Karau
about the time > zone I guess. > > Regards, > Gourav Sengupta > > On Thu, Apr 12, 2018 at 8:23 PM, Holden Karau <hol...@pigscanfly.ca> > wrote: > >> Hi Y'all, >> >> If your interested in learning more about how the development process in >&g

Re: Live Stream Code Reviews :)

2018-04-12 Thread Holden Karau
gt; wrote: > >> Hi, >> 11 am in which timezone? >> >> Il gio 12 apr 2018, 21:23 Holden Karau <hol...@pigscanfly.ca> ha scritto: >> >>> Hi Y'all, >>> >>> If your interested in learning more about how the development process in >

Live Stream Code Reviews :)

2018-04-12 Thread Holden Karau
Hi Y'all, If your interested in learning more about how the development process in Apache Spark works I've been doing a weekly live streamed code review most Fridays at 11am. This weeks will be on twitch/youtube ( https://www.twitch.tv/holdenkarau / https://www.youtube.com/watch?v=vGVSa9KnD80 ).

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-21 Thread Holden Karau
Super exciting! I look forward to digging through it this weekend. On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) < ravishankar.n...@gmail.com> wrote: > Excellent. You filled a missing link. > > Best, > Passion > > On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia >

Re: Live Streamed Code Review today at 11am Pacific

2018-03-09 Thread Holden Karau
If anyone wants to watch the recording: https://www.youtube.com/watch?v=lugG_2QU6YU I'll do one next week as well - March 16th @ 11am - https://www.youtube.com/watch?v=pXzVtEUjrLc On Fri, Mar 9, 2018 at 9:28 AM, Holden Karau <hol...@pigscanfly.ca> wrote: > Hi folks, > >

Live Streamed Code Review today at 11am Pacific

2018-03-09 Thread Holden Karau
Hi folks, If your curious in learning more about how Spark is developed, I’m going to expirement doing a live code review where folks can watch and see how that part of our process works. I have two volunteers already for having their PRs looked at live, and if you have a Spark PR your working on

Re: Spark not releasing shuffle files in time (with very large heap)

2018-02-23 Thread Holden Karau
You can also look at the shuffle file cleanup tricks we do inside of the ALS algorithm in Spark. On Fri, Feb 23, 2018 at 6:20 PM, vijay.bvp wrote: > have you looked at > http://apache-spark-user-list.1001560.n3.nabble.com/Limit- > Spark-Shuffle-Disk-Usage-td23279.html > >

Re: Can spark handle this scenario?

2018-02-16 Thread Holden Karau
I'm not sure what you mean by it could be hard to serialize complex operations? Regardless I think the question is do you want to parallelize this on multiple machines or just one? On Feb 17, 2018 4:20 PM, "Lian Jiang" wrote: > Thanks Ayan. RDD may support map better

Re: pyspark+spacy throwing pickling exception

2018-02-15 Thread Holden Karau
So you left out the exception. On one hand I’m also not sure how well spacy serializes, so to debug this I would start off by moving the nlp = inside of my function and see if it still fails. On Thu, Feb 15, 2018 at 9:08 PM Selvam Raman wrote: > import spacy > > nlp =

FOSDEM mini-office hour?

2018-01-31 Thread Holden Karau
Hi Spark Friends, If any folks are around for FOSDEM this year I was planning on doing a coffee office hour on the last day after my talks . Maybe like 6pm? I'm also going to see if any BEAM folks are around and interested :) Cheers,

Re: Spark Tuning Tool

2018-01-22 Thread Holden Karau
That's very interesting, and might also get some interest on the dev@ list if it was open source. On Tue, Jan 23, 2018 at 4:02 PM, Roger Marin wrote: > I'd be very interested. > > On 23 Jan. 2018 4:01 pm, "Rohit Karlupia" wrote: > >> Hi, >> >> I have

Re: Access to Applications metrics

2017-12-05 Thread Holden Karau
I've done a SparkListener to record metrics for validation (it's a bit out of date). Are you just looking to have graphing/alerting set up on the Spark metrics? On Tue, Dec 5, 2017 at 1:53 PM, Thakrar, Jayesh < jthak...@conversantmedia.com> wrote: > You can also get the metrics from the Spark

Re: Recommended way to serialize Hadoop Writables' in Spark

2017-12-03 Thread Holden Karau
So is there a reason you want to shuffle Hadoop types rather than the Java types? As for your specific question, for Kyro you also need to register your serializers, did you do that? On Sun, Dec 3, 2017 at 10:02 AM pradeepbaji wrote: > Hi, > > Is there any recommended

Re: Is Databricks REST API open source ?

2017-12-02 Thread Holden Karau
That API is not open source. There are some other options as separate projects you can check out (like Livy,spark-jobserver, etc). On Sat, Dec 2, 2017 at 8:30 PM kant kodali wrote: > HI All, > > Is REST API (https://docs.databricks.com/api/index.html) open source? > where I

Re: NLTK with Spark Streaming

2017-11-26 Thread Holden Karau
So it’s certainly doable (it’s not super easy mind you), but until the arrow udf release goes out it will be rather slow. On Sun, Nov 26, 2017 at 8:01 AM ashish rawat wrote: > Hi, > > Has someone tried running NLTK (python) with Spark Streaming (scala)? I > was wondering if

What do you pay attention to when validating Spark jobs?

2017-11-21 Thread Holden Karau
Hi Folks, I'm working on updating a talk and I was wondering if any folks in the community wanted to share their best practices for validating your Spark jobs? Are there any counters folks have found useful for monitoring/validating your Spark jobs? Cheers, Holden :) -- Twitter:

Re: PySpark 2.2.0, Kafka 0.10 DataFrames

2017-11-20 Thread Holden Karau
What command did you use to launch your Spark application? The https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying documentation suggests using spark-submit with the `--packages` flag to include the required Kafka package. e.g. ./bin/spark-submit --packages

Re: Use of Accumulators

2017-11-14 Thread Holden Karau
to just toggle it saying there is some change while > processing the data. > > > > Please let me know if we can runtime do this. > > > > > > Thanks! > > *~Kedar Dixit* > > Bigdata Analytics at Persistent Systems Ltd. > > > > *From:* Holden Karau

Re: Use of Accumulators

2017-11-13 Thread Holden Karau
So you want to set an accumulator to 1 after a transformation has fully completed? Or what exactly do you want to do? On Mon, Nov 13, 2017 at 9:47 PM vaquar khan wrote: > Confirmed ,you can use Accumulators :) > > Regards, > Vaquar khan > > On Mon, Nov 13, 2017 at 10:58

[ANNOUNCE] Apache Spark 2.1.2

2017-10-25 Thread Holden Karau
We are happy to announce the availability of Spark 2.1.2! Apache Spark 2.1.2 is a maintenance release, based on the branch-2.1 maintenance branch of Spark. We strongly recommend all 2.1.x users to upgrade to this stable release. To download Apache Spark 2.1.2 visit

Re: [SS] Why is a streaming aggregation required for complete output mode?

2017-08-18 Thread Holden Karau
My assumption is it would be similar though, in memory sink of all of your records would quickly overwhelm your cluster, but in aggregation it could be reasonable. But there might be additional reasons on top of that. On Fri, Aug 18, 2017 at 11:44 AM Holden Karau <hol...@pigscanfly.ca>

Re: [SS] Why is a streaming aggregation required for complete output mode?

2017-08-18 Thread Holden Karau
.. > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Fri, Aug 18, 2017 at 6:35 PM, Holden Karau <hol...@pigscanfly.ca>

Re: [SS] Why is a streaming aggregation required for complete output mode?

2017-08-18 Thread Holden Karau
So performing complete output without an aggregation would require building up a table of the entire input to write out at each micro batch. This would get prohibitively expensive quickly. With an aggregation we just need to keep track of the aggregates and update them every batch, so the memory

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Holden Karau
The memory overhead is based less on the total amount of data and more on what you end up doing with the data (e.g. if your doing a lot of off-heap processing or using Python you need to increase it). Honestly most people find this number for their job "experimentally" (e.g. they try a few

With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Holden Karau
Hi wonderful Python + Spark folks, I'm excited to announce that with Spark 2.2.0 we finally have PySpark published on PyPI (see https://pypi.python.org/pypi/pyspark / https://twitter.com/holdenkarau/status/885207416173756417). This has been a long time coming (previous releases included pip

Re: Can we access files on Cluster mode

2017-06-24 Thread Holden Karau
addFile is supposed to not depend on a shared FS unless the semantics have changed recently. On Sat, Jun 24, 2017 at 11:55 AM varma dantuluri wrote: > Hi Sudhir, > > I believe you have to use a shared file system that is accused by all > nodes. > > > On Jun 24, 2017, at

Re: Spark checkpoint - nonstreaming

2017-05-26 Thread Holden Karau
In non streaming Spark checkpoints aren't for inter-application recovery, rather you can think of them as doing persist but to a HDFS rather than each nodes local memory / storage. On Fri, May 26, 2017 at 3:06 PM Priya wrote: > Hi, > > With nonstreaming spark application,

Re: [Spark Core]: Python and Scala generate different DAGs for identical code

2017-05-10 Thread Holden Karau
gt; On Wed, May 10, 2017 at 7:18 PM, Holden Karau <hol...@pigscanfly.ca> > wrote: > >> In PySpark the filter and then map steps are combined into a single >> transformation from the JVM point of view. This allows us to avoid copying >> the data back to Scala in between the

Re: [Spark Core]: Python and Scala generate different DAGs for identical code

2017-05-10 Thread Holden Karau
In PySpark the filter and then map steps are combined into a single transformation from the JVM point of view. This allows us to avoid copying the data back to Scala in between the filter and the map steps. The debugging exeperience is certainly much harder in PySpark and I think is an interesting

Re: Spark Testing Library Discussion

2017-04-26 Thread Holden Karau
Sorry about that, hangouts on air broke in the first one :( On Wed, Apr 26, 2017 at 8:41 AM, Marco Mistroni <mmistr...@gmail.com> wrote: > Uh i stayed online in the other link but nobody joinedWill follow > transcript > Kr > > On 26 Apr 2017 9:35 am, "Holden Ka

Re: Spark Testing Library Discussion

2017-04-26 Thread Holden Karau
And the recording of our discussion is at https://www.youtube.com/watch?v=2q0uAldCQ8M A few of us have follow up things and we will try and do another meeting in about a month or two :) On Tue, Apr 25, 2017 at 1:04 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > Urgh hangouts did

  1   2   3   >