Re: Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch
Thinking more carefully on your comment: - There may be some ambiguity as to whether the repo provided libraries are actually being used here - as you indicate - instead of the in-project classes. That would have to do with how the classpath inside IJ were constructed. When I click

Re: Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch
A clarification here: the example is being run *from the Spark codebase*. Therefore the mvn install step would not be required as the classes are available directly within the project. The reason for needing the `mvn package` to be invoked is to pick up the changes of having updated the spark

Re: Running spark examples in Intellij

2017-10-11 Thread Paul
You say you did the maven package but did you do a maven install and define your local maven repo in SBT? -Paul Sent from my iPhone > On Oct 11, 2017, at 5:48 PM, Stephen Boesch wrote: > > When attempting to run any example program w/ Intellij I am running into > guava

Dynamic Accumulators in 2.x?

2017-10-11 Thread David Capwell
I wrote a spark instrumentation tool that instruments RDDs to give more fine-grain details on what is going on within a Task. This is working right now, but uses volatiles and CAS to pass around this state (which slows down the task). We want to lower the overhead of this and make the main call

PySpark pickling behavior

2017-10-11 Thread Naveen Swamy
Hello fellow users, 1) I am wondering if there is documentation or guidelines to understand in what situations does Pyspark decide to pickle the functions I use in the map method. 2) Are there best practices to avoid pickling and sharing variables, etc, I have a situation where I want to pass to

Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch
When attempting to run any example program w/ Intellij I am running into guava versioning issues: Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/cache/CacheLoader at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:73) at

Java Rdd of String to dataframe

2017-10-11 Thread sk skk
Can we create a dataframe from a Java pair rdd of String . I don’t have a schema as it will be a dynamic Json. I gave encoders.string class. Any help is appreciated !! Thanks, SK

add jars to spark's runtime

2017-10-11 Thread David Capwell
We want to emit the metrics out of spark into our own custom store. To do this we built our own sink and tried to add it to spark by doing --jars path/to/jar and defining the class in metrics.properties which is supplied with the job. We noticed that spark kept crashing with the below exception

Re: Job spark blocked and runs indefinitely

2017-10-11 Thread Amine CHERIFI
it seems that the job block whene we call newAPIHadoopRDD to get data from Hbase. it may be the issue !! is there another api to load date from hbase ? Sent with Mailtrack

Re: Job spark blocked and runs indefinitely

2017-10-11 Thread Sebastian Piu
We do have this issue randomly too, so interested in hearing if someone was able to get to the bottom of it On Wed, 11 Oct 2017, 13:40 amine_901, wrote: > We encounter a problem on a Spark job 1.6(on yarn) that never ends, whene > several jobs launched

Job spark blocked and runs indefinitely

2017-10-11 Thread amine_901
We encounter a problem on a Spark job 1.6(on yarn) that never ends, whene several jobs launched simultaneously. We found that by launching the job spark in yarn-client mode we do not have this problem, unlike launching it in yarn-cluster mode. it could be a trail to find the cause. we changed the

Job spark blocked and runs indefinitely

2017-10-11 Thread amine_901
We encounter a problem on a Spark job 1.6(on yarn) that never ends, whene several jobs launched simultaneously. We found that by launching the job spark in yarn-client mode we do not have this problem, unlike launching it in yarn-cluster mode. it could be a trail to find the cause. we changed the

Re: Kafka streams vs Spark streaming

2017-10-11 Thread Sachin Mittal
Well depends upon use case. Say the metric you are evaluating is grouped by a key and you want to parallelize the operation by adding more instances so certain instance deal with only a particular group it is always better to have partitioning also done on that key. This way a particular instance

Re: Kafka streams vs Spark streaming

2017-10-11 Thread Sachin Mittal
No it wont work this way. Say you have 9 partitions and 3 instances. 1 = {1, 2, 3} 2 = {4, 5, 6} 3 = (7, 8, 9} And lets say a particular key (k1) is always written to partition 4. Now say you increase partitions to 12 you may have: 1 = {1, 2, 3, 4} 2 = {5, 6, 7, 8} 3 = (9, 10, 11, 12} Now it is

Re: Kafka streams vs Spark streaming

2017-10-11 Thread Sabarish Sasidharan
@Sachin >>The partition key is very important if you need to run multiple instances of streams application and certain instance processing certain partitions only. Again, depending on partition key is optional. It's actually a feature enabler, so we can use local state stores to improve

Re: Kafka streams vs Spark streaming

2017-10-11 Thread Sabarish Sasidharan
@Sachin >>is not elastic. You need to anticipate before hand on volume of data you will have. Very difficult to add and reduce topic partitions later on. Why do you say so Sachin? Kafka Streams will readjust once we add more partitions to the Kafka topic. And when we add more machines,

Re: Kafka streams vs Spark streaming

2017-10-11 Thread Sachin Mittal
Kafka streams has a lower learning curve and if your source data is in kafka topics it is pretty simple to integrate it with. It can run like a library inside your main programs. So as compared to spark streams 1. Is much simpler to implement. 2. Is not much heavy on hardware unlike spark. On

Kafka streams vs Spark streaming

2017-10-11 Thread Mich Talebzadeh
Hi, Has anyone had an experience of using Kafka streams versus Spark? I am not familiar with Kafka streams concept except that it is a set of libraries. Any feedback will be appreciated. Regards, Mich LinkedIn *

Re: best spark spatial lib?

2017-10-11 Thread Imran Rajjad
Thanks guy for the response. Basically I am migrating an oracle pl/sql procedure to spark-java. In oracle I have a table with geometry column, on which I am able to do a "where col = 1 and geom.within(another_geom)" I am looking for a less complicated port in to spark for which queries. I will