Re: External shuffle service on K8S
Hi there, Please see https://issues.apache.org/jira/browse/SPARK-25299 for more discussion around this matter. -Matt Cheah From: Li Gao Date: Friday, October 26, 2018 at 9:10 AM To: "vincent.gromakow...@gmail.com" Cc: "caolijun1...@gmail.com" , "user@spark.apache.org" Subject: Re: External shuffle service on K8S There are existing 2.2 based ext shuffle on the fork: https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html [apache-spark-on-k8s.github.io] You can modify it to suit your needs. -Li On Fri, Oct 26, 2018 at 3:22 AM vincent gromakowski wrote: No it's on the roadmap >2.4 Le ven. 26 oct. 2018 à 11:15, 曹礼俊 a écrit : Hi all: Does Spark 2.3.2 supports external shuffle service on Kubernetes? I have looked up the documentation(https://spark.apache.org/docs/latest/running-on-kubernetes.html [spark.apache.org]), but couldn't find related suggestions. If suppports, how can I enable it? Best Regards Lijun Cao smime.p7s Description: S/MIME cryptographic signature
Re: java vs scala for Apache Spark - is there a performance difference ?
On Oct 27, 2018 3:34 AM, "karan alang" wrote: Hello - is there a "performance" difference when using Java or Scala for Apache Spark ? I understand, there are other obvious differences (less code with scala, easier to focus on logic etc), but wrt performance - i think there would not be much of a difference since both of them are JVM based, pls. let me know if this is not the case. thanks!
Is spark not good for ingesting into updatable databases?
Hi All, My problem is as explained, Environment: Spark 2.2.0 installed on CDH Use-Case: Reading from Kafka, cleansing the data and ingesting into a non updatable database. Problem: My streaming batch duration is 1 minute and I am receiving 3000 messages/min. I am observing a weird case where, in the map transformations some of the messages are being reprocessed more than once to the downstream transformations. Because of this I have been seeing duplicates in the downstream insert only database. It would have made sense if the reprocessing of the message happens for the entire task in which case I would have assumed the problem is because of the task failure. But, in my case I don't see any task failures and only one or two particular messages in the task will be reprocessed. Everytime I relaunch the spark job to process kafka messages from the starting offset, it would dup the exact same messages all the time irrespective of number of relaunches. I added the messages that are getting duped back to kafka at a different offset to see if I would observe the same problem, but this time it won't dup. Workaround for now: As a workaround for now, I added a cache at the end before ingestion into DB which gets updated processed event and thus making sure it won't be reprocessed again. My question here is, why am I seeing this weird behavior(only one particular message in the entire batch getting reprocessed again)? Is there some configuration that would help me fix this problem or is this a bug? Any solution apart from maintaining a cache would be of great help. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
java vs scala for Apache Spark - is there a performance difference ?
Hello - is there a "performance" difference when using Java or Scala for Apache Spark ? I understand, there are other obvious differences (less code with scala, easier to focus on logic etc), but wrt performance - i think there would not be much of a difference since both of them are JVM based, pls. let me know if this is not the case. thanks!
Re: conflicting version question
Thanks for the suggestion. Ouch. That looks painful. On Fri, Oct 26, 2018 at 1:28 PM Anastasios Zouzias wrote: > Hi Nathan, > > You can try to shade the dependency version that you want to use. That > said, shading is a tricky technique. Good luck. > > > https://softwareengineering.stackexchange.com/questions/297276/what-is-a-shaded-java-dependency > > > See also elasticsearch's discussion on shading > > https://www.elastic.co/de/blog/to-shade-or-not-to-shade > > Best, > Anastasios > > > On Fri, 26 Oct 2018, 15:45 Nathan Kronenfeld, > wrote: > >> Our code is currently using Gson 2.8.5. Spark, through Hadoop-API, pulls >> in Gson 2.2.4. >> >> At the moment, we just get "method X not found" exceptions because of >> this - because when we run in Spark, 2.2.4 is what gets loaded. >> >> Is there any way to have both versions exist simultaneously? To load >> 2.8.5 so that our code uses it, without messing up spark? >> >> Thanks, >> -Nathan Kronenfeld >> >
Re: conflicting version question
Hi Nathan, You can try to shade the dependency version that you want to use. That said, shading is a tricky technique. Good luck. https://softwareengineering.stackexchange.com/questions/297276/what-is-a-shaded-java-dependency See also elasticsearch's discussion on shading https://www.elastic.co/de/blog/to-shade-or-not-to-shade Best, Anastasios On Fri, 26 Oct 2018, 15:45 Nathan Kronenfeld, wrote: > Our code is currently using Gson 2.8.5. Spark, through Hadoop-API, pulls > in Gson 2.2.4. > > At the moment, we just get "method X not found" exceptions because of this > - because when we run in Spark, 2.2.4 is what gets loaded. > > Is there any way to have both versions exist simultaneously? To load 2.8.5 > so that our code uses it, without messing up spark? > > Thanks, > -Nathan Kronenfeld >
Re: External shuffle service on K8S
There are existing 2.2 based ext shuffle on the fork: https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html You can modify it to suit your needs. -Li On Fri, Oct 26, 2018 at 3:22 AM vincent gromakowski < vincent.gromakow...@gmail.com> wrote: > No it's on the roadmap >2.4 > > Le ven. 26 oct. 2018 à 11:15, 曹礼俊 a écrit : > >> Hi all: >> >> Does Spark 2.3.2 supports external shuffle service on Kubernetes? >> >> I have looked up the documentation( >> https://spark.apache.org/docs/latest/running-on-kubernetes.html), but >> couldn't find related suggestions. >> >> If suppports, how can I enable it? >> >> Best Regards >> >> Lijun Cao >> >> >> >
conflicting version question
Our code is currently using Gson 2.8.5. Spark, through Hadoop-API, pulls in Gson 2.2.4. At the moment, we just get "method X not found" exceptions because of this - because when we run in Spark, 2.2.4 is what gets loaded. Is there any way to have both versions exist simultaneously? To load 2.8.5 so that our code uses it, without messing up spark? Thanks, -Nathan Kronenfeld
[PySpark] Sharing testing library and requesting feedback
We recently open sourced mockrdd, a library for testing PySpark code. github.com/LiveRamp/mockrdd The mockrdd.MockRDD class offers similar behavior to pyspark.RDD with the following extra benefits. * Extensive sanity checks to identify invalid inputs * More meaningful error messages for debugging issues * Straightforward to running within pdb * Removes Spark dependencies from development and testing environments * No Spark overhead when running through a large test suite More details in this blog post: liveramp.com/engineering/introducing-mockrdd-for-testing-pyspark-code Would anyone find this useful? What other features would make this more useful? Are there benefits to using PySpark in local mode for testing that we're not considering? Thanks!
Re: External shuffle service on K8S
No it's on the roadmap >2.4 Le ven. 26 oct. 2018 à 11:15, 曹礼俊 a écrit : > Hi all: > > Does Spark 2.3.2 supports external shuffle service on Kubernetes? > > I have looked up the documentation( > https://spark.apache.org/docs/latest/running-on-kubernetes.html), but > couldn't find related suggestions. > > If suppports, how can I enable it? > > Best Regards > > Lijun Cao > > >
External shuffle service on K8S
Hi all: Does Spark 2.3.2 supports external shuffle service on Kubernetes? I have looked up the documentation( https://spark.apache.org/docs/latest/running-on-kubernetes.html), but couldn't find related suggestions. If suppports, how can I enable it? Best Regards Lijun Cao