Kafka streams vs Spark streaming

2017-10-11 Thread Mich Talebzadeh
Hi, Has anyone had an experience of using Kafka streams versus Spark? I am not familiar with Kafka streams concept except that it is a set of libraries. Any feedback will be appreciated. Regards, Mich LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8

Re: Kafka streams vs Spark streaming

2017-10-11 Thread Sachin Mittal
Kafka streams has a lower learning curve and if your source data is in kafka topics it is pretty simple to integrate it with. It can run like a library inside your main programs. So as compared to spark streams 1. Is much simpler to implement. 2. Is not much heavy on hardware unlike spark. On th

Re: Kafka streams vs Spark streaming

2017-10-11 Thread Sabarish Sasidharan
@Sachin >>is not elastic. You need to anticipate before hand on volume of data you will have. Very difficult to add and reduce topic partitions later on. Why do you say so Sachin? Kafka Streams will readjust once we add more partitions to the Kafka topic. And when we add more machines, rebalancing

Re: Kafka streams vs Spark streaming

2017-10-11 Thread Sabarish Sasidharan
@Sachin >>The partition key is very important if you need to run multiple instances of streams application and certain instance processing certain partitions only. Again, depending on partition key is optional. It's actually a feature enabler, so we can use local state stores to improve throughput

Re: Kafka streams vs Spark streaming

2017-10-11 Thread Sachin Mittal
No it wont work this way. Say you have 9 partitions and 3 instances. 1 = {1, 2, 3} 2 = {4, 5, 6} 3 = (7, 8, 9} And lets say a particular key (k1) is always written to partition 4. Now say you increase partitions to 12 you may have: 1 = {1, 2, 3, 4} 2 = {5, 6, 7, 8} 3 = (9, 10, 11, 12} Now it is po

Re: Kafka streams vs Spark streaming

2017-10-11 Thread Sachin Mittal
Well depends upon use case. Say the metric you are evaluating is grouped by a key and you want to parallelize the operation by adding more instances so certain instance deal with only a particular group it is always better to have partitioning also done on that key. This way a particular instance w

Job spark blocked and runs indefinitely

2017-10-11 Thread amine_901
We encounter a problem on a Spark job 1.6(on yarn) that never ends, whene several jobs launched simultaneously. We found that by launching the job spark in yarn-client mode we do not have this problem, unlike launching it in yarn-cluster mode. it could be a trail to find the cause. we changed the

Job spark blocked and runs indefinitely

2017-10-11 Thread amine_901
We encounter a problem on a Spark job 1.6(on yarn) that never ends, whene several jobs launched simultaneously. We found that by launching the job spark in yarn-client mode we do not have this problem, unlike launching it in yarn-cluster mode. it could be a trail to find the cause. we changed the

Re: Job spark blocked and runs indefinitely

2017-10-11 Thread Sebastian Piu
We do have this issue randomly too, so interested in hearing if someone was able to get to the bottom of it On Wed, 11 Oct 2017, 13:40 amine_901, wrote: > We encounter a problem on a Spark job 1.6(on yarn) that never ends, whene > several jobs launched simultaneously. > We found that by launchin

Re: Job spark blocked and runs indefinitely

2017-10-11 Thread Amine CHERIFI
it seems that the job block whene we call newAPIHadoopRDD to get data from Hbase. it may be the issue !! is there another api to load date from hbase ? Sent with Mailtrack

add jars to spark's runtime

2017-10-11 Thread David Capwell
We want to emit the metrics out of spark into our own custom store. To do this we built our own sink and tried to add it to spark by doing --jars path/to/jar and defining the class in metrics.properties which is supplied with the job. We noticed that spark kept crashing with the below exception

Java Rdd of String to dataframe

2017-10-11 Thread sk skk
Can we create a dataframe from a Java pair rdd of String . I don’t have a schema as it will be a dynamic Json. I gave encoders.string class. Any help is appreciated !! Thanks, SK

Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch
When attempting to run any example program w/ Intellij I am running into guava versioning issues: Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/cache/CacheLoader at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:73) at org.apache.spark.SparkConf.

PySpark pickling behavior

2017-10-11 Thread Naveen Swamy
Hello fellow users, 1) I am wondering if there is documentation or guidelines to understand in what situations does Pyspark decide to pickle the functions I use in the map method. 2) Are there best practices to avoid pickling and sharing variables, etc, I have a situation where I want to pass to

Dynamic Accumulators in 2.x?

2017-10-11 Thread David Capwell
I wrote a spark instrumentation tool that instruments RDDs to give more fine-grain details on what is going on within a Task. This is working right now, but uses volatiles and CAS to pass around this state (which slows down the task). We want to lower the overhead of this and make the main call p

Re: Running spark examples in Intellij

2017-10-11 Thread Paul
You say you did the maven package but did you do a maven install and define your local maven repo in SBT? -Paul Sent from my iPhone > On Oct 11, 2017, at 5:48 PM, Stephen Boesch wrote: > > When attempting to run any example program w/ Intellij I am running into > guava versioning issues: >

Re: Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch
A clarification here: the example is being run *from the Spark codebase*. Therefore the mvn install step would not be required as the classes are available directly within the project. The reason for needing the `mvn package` to be invoked is to pick up the changes of having updated the spark depe

Re: Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch
Thinking more carefully on your comment: - There may be some ambiguity as to whether the repo provided libraries are actually being used here - as you indicate - instead of the in-project classes. That would have to do with how the classpath inside IJ were constructed. When I click t