Re: External shuffle service on K8S

2018-10-26 Thread Matt Cheah
Hi there,

 

Please see https://issues.apache.org/jira/browse/SPARK-25299 for more 
discussion around this matter.

 

-Matt Cheah

 

From: Li Gao 
Date: Friday, October 26, 2018 at 9:10 AM
To: "vincent.gromakow...@gmail.com" 
Cc: "caolijun1...@gmail.com" , "user@spark.apache.org" 

Subject: Re: External shuffle service on K8S

 

There are existing 2.2 based ext shuffle on the fork: 

https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html 
[apache-spark-on-k8s.github.io]

 

You can modify it to suit your needs.

 

-Li

 

 

On Fri, Oct 26, 2018 at 3:22 AM vincent gromakowski 
 wrote:

No it's on the roadmap >2.4

 

Le ven. 26 oct. 2018 à 11:15, 曹礼俊  a écrit :

Hi all: 

 

Does Spark 2.3.2 supports external shuffle service on Kubernetes? 

 

I have looked up the 
documentation(https://spark.apache.org/docs/latest/running-on-kubernetes.html 
[spark.apache.org]), but couldn't find related suggestions.

 

If suppports, how can I enable it?

 

Best Regards

 

Lijun Cao

 

 



smime.p7s
Description: S/MIME cryptographic signature


Re: java vs scala for Apache Spark - is there a performance difference ?

2018-10-26 Thread Battini Lakshman
On Oct 27, 2018 3:34 AM, "karan alang"  wrote:

Hello
- is there a "performance" difference when using Java or Scala for Apache
Spark ?

I understand, there are other obvious differences (less code with scala,
easier to focus on logic etc),
but wrt performance - i think there would not be much of a difference since
both of them are JVM based,
pls. let me know if this is not the case.

thanks!


Is spark not good for ingesting into updatable databases?

2018-10-26 Thread ravidspark
Hi All,

My problem is as explained,

Environment: Spark 2.2.0 installed on CDH
Use-Case: Reading from Kafka, cleansing the data and ingesting into a non
updatable database.

Problem: My streaming batch duration is 1 minute and I am receiving 3000
messages/min. I am observing a weird case where, in the map transformations
some of the messages are being reprocessed more than once to the downstream
transformations. Because of this I have been seeing duplicates in the
downstream insert only database.

It would have made sense if the reprocessing of the message happens for the
entire task in which case I would have assumed the problem is because of the
task failure. But, in my case I don't see any task failures and only one or
two particular messages in the task will be reprocessed. 

Everytime I relaunch the spark job to process kafka messages from the
starting offset, it would dup the exact same messages all the time
irrespective of number of relaunches.

I added the messages that are getting duped back to kafka at a different
offset to see if I would observe the same problem, but this time it won't
dup.

Workaround for now: 
As a workaround for now, I added a cache at the end before ingestion into DB
which gets updated processed event and thus making sure it won't be
reprocessed again.


My question here is, why am I seeing this weird behavior(only one particular
message in the entire batch getting reprocessed again)? Is there some
configuration that would help me fix this problem or is this a bug? 

Any solution apart from maintaining a cache would be of great help.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



java vs scala for Apache Spark - is there a performance difference ?

2018-10-26 Thread karan alang
Hello
- is there a "performance" difference when using Java or Scala for Apache
Spark ?

I understand, there are other obvious differences (less code with scala,
easier to focus on logic etc),
but wrt performance - i think there would not be much of a difference since
both of them are JVM based,
pls. let me know if this is not the case.

thanks!


Re: conflicting version question

2018-10-26 Thread Nathan Kronenfeld
Thanks for the suggestion.

Ouch.  That looks painful.

On Fri, Oct 26, 2018 at 1:28 PM Anastasios Zouzias 
wrote:

> Hi Nathan,
>
> You can try to shade the dependency version that you want to use. That
> said, shading is a tricky technique. Good luck.
>
>
> https://softwareengineering.stackexchange.com/questions/297276/what-is-a-shaded-java-dependency
>
>
> See also elasticsearch's discussion on shading
>
> https://www.elastic.co/de/blog/to-shade-or-not-to-shade
>
> Best,
> Anastasios
>
>
> On Fri, 26 Oct 2018, 15:45 Nathan Kronenfeld,
>  wrote:
>
>> Our code is currently using Gson 2.8.5.  Spark, through Hadoop-API, pulls
>> in Gson 2.2.4.
>>
>> At the moment, we just get "method X not found" exceptions because of
>> this - because when we run in Spark, 2.2.4 is what gets loaded.
>>
>> Is there any way to have both versions exist simultaneously? To load
>> 2.8.5 so that our code uses it, without messing up spark?
>>
>> Thanks,
>>   -Nathan Kronenfeld
>>
>


Re: conflicting version question

2018-10-26 Thread Anastasios Zouzias
Hi Nathan,

You can try to shade the dependency version that you want to use. That
said, shading is a tricky technique. Good luck.

https://softwareengineering.stackexchange.com/questions/297276/what-is-a-shaded-java-dependency


See also elasticsearch's discussion on shading

https://www.elastic.co/de/blog/to-shade-or-not-to-shade

Best,
Anastasios


On Fri, 26 Oct 2018, 15:45 Nathan Kronenfeld,
 wrote:

> Our code is currently using Gson 2.8.5.  Spark, through Hadoop-API, pulls
> in Gson 2.2.4.
>
> At the moment, we just get "method X not found" exceptions because of this
> - because when we run in Spark, 2.2.4 is what gets loaded.
>
> Is there any way to have both versions exist simultaneously? To load 2.8.5
> so that our code uses it, without messing up spark?
>
> Thanks,
>   -Nathan Kronenfeld
>


Re: External shuffle service on K8S

2018-10-26 Thread Li Gao
There are existing 2.2 based ext shuffle on the fork:
https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html

You can modify it to suit your needs.

-Li


On Fri, Oct 26, 2018 at 3:22 AM vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> No it's on the roadmap >2.4
>
> Le ven. 26 oct. 2018 à 11:15, 曹礼俊  a écrit :
>
>> Hi all:
>>
>> Does Spark 2.3.2 supports external shuffle service on Kubernetes?
>>
>> I have looked up the documentation(
>> https://spark.apache.org/docs/latest/running-on-kubernetes.html), but
>> couldn't find related suggestions.
>>
>> If suppports, how can I enable it?
>>
>> Best Regards
>>
>> Lijun Cao
>>
>>
>>
>


conflicting version question

2018-10-26 Thread Nathan Kronenfeld
Our code is currently using Gson 2.8.5.  Spark, through Hadoop-API, pulls
in Gson 2.2.4.

At the moment, we just get "method X not found" exceptions because of this
- because when we run in Spark, 2.2.4 is what gets loaded.

Is there any way to have both versions exist simultaneously? To load 2.8.5
so that our code uses it, without messing up spark?

Thanks,
  -Nathan Kronenfeld


[PySpark] Sharing testing library and requesting feedback

2018-10-26 Thread Matt Hagy
We recently open sourced mockrdd, a library for testing PySpark code.
github.com/LiveRamp/mockrdd

The mockrdd.MockRDD class offers similar behavior to pyspark.RDD with the
following extra benefits.
* Extensive sanity checks to identify invalid inputs
* More meaningful error messages for debugging issues
* Straightforward to running within pdb
* Removes Spark dependencies from development and testing environments
* No Spark overhead when running through a large test suite

More details in this blog post:
liveramp.com/engineering/introducing-mockrdd-for-testing-pyspark-code

Would anyone find this useful? What other features would make this more
useful? Are there benefits to using PySpark in local mode for testing that
we're not considering?

Thanks!


Re: External shuffle service on K8S

2018-10-26 Thread vincent gromakowski
No it's on the roadmap >2.4

Le ven. 26 oct. 2018 à 11:15, 曹礼俊  a écrit :

> Hi all:
>
> Does Spark 2.3.2 supports external shuffle service on Kubernetes?
>
> I have looked up the documentation(
> https://spark.apache.org/docs/latest/running-on-kubernetes.html), but
> couldn't find related suggestions.
>
> If suppports, how can I enable it?
>
> Best Regards
>
> Lijun Cao
>
>
>


External shuffle service on K8S

2018-10-26 Thread 曹礼俊
Hi all:

Does Spark 2.3.2 supports external shuffle service on Kubernetes?

I have looked up the documentation(
https://spark.apache.org/docs/latest/running-on-kubernetes.html), but
couldn't find related suggestions.

If suppports, how can I enable it?

Best Regards

Lijun Cao