date:20171011

Re: Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch

Thinking more carefully on your comment:

   - There may be some ambiguity as to whether the repo provided libraries
   are actually being used here - as you indicate - instead of the in-project
   classes. That would have to do with how the classpath inside IJ were
   constructed.   When I click through any of the spark classes in the IDE
   they do go directly to the in-project versions and not the repo jars: but
   that may not be definitive
   - In any case I had already performed the maven install and just
   verified that the jar's do have the correct timestamps in the local maven
   repo
   - The local maven repo is included by default  - so should not need to
   do anything special there

The same errors from the original post continue to occur.


2017-10-11 20:05 GMT-07:00 Stephen Boesch :

> A clarification here: the example is being run *from the Spark codebase*.
> Therefore the mvn install step would not be required as the classes are
> available directly within the project.
>
> The reason for needing the `mvn package` to be invoked is to pick up the
> changes of having updated the spark dependency scopes from *provided *to
> *compile*.
>
> As mentioned the spark unit tests are working (and within Intellij and
> without `mvn install`): only the examples are not.
>
> 2017-10-11 19:43 GMT-07:00 Paul :
>
>> You say you did the maven package but did you do a maven install and
>> define your local maven repo in SBT?
>>
>> -Paul
>>
>> Sent from my iPhone
>>
>> On Oct 11, 2017, at 5:48 PM, Stephen Boesch  wrote:
>>
>> When attempting to run any example program w/ Intellij I am running into
>> guava versioning issues:
>>
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> com/google/common/cache/CacheLoader
>> at org.apache.spark.SparkConf.loadFromSystemProperties(SparkCon
>> f.scala:73)
>> at org.apache.spark.SparkConf.(SparkConf.scala:68)
>> at org.apache.spark.SparkConf.(SparkConf.scala:55)
>> at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(S
>> parkSession.scala:919)
>> at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(S
>> parkSession.scala:918)
>> at scala.Option.getOrElse(Option.scala:121)
>> at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkS
>> ession.scala:918)
>> at org.apache.spark.examples.ml.KMeansExample$.main(KMeansExamp
>> le.scala:40)
>> at org.apache.spark.examples.ml.KMeansExample.main(KMeansExample.scala)
>> Caused by: java.lang.ClassNotFoundException:
>> com.google.common.cache.CacheLoader
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> ... 9 more
>>
>> The *scope*s for the spark dependencies were already changed from
>> *provided* to *compile* .  Both `sbt assembly` and `mvn package` had
>> already been run (successfully) from command line - and the (mvn) project
>> completely rebuilt inside intellij.
>>
>> The spark testcases run fine: this is a problem only in the examples
>> module.  Anyone running these successfully in IJ?  I have tried for
>> 2.1.0-SNAPSHOT and 2.3.0-SNAPSHOT - with the same outcome.
>>
>>
>>
>

Re: Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch

A clarification here: the example is being run *from the Spark codebase*.
Therefore the mvn install step would not be required as the classes are
available directly within the project.

The reason for needing the `mvn package` to be invoked is to pick up the
changes of having updated the spark dependency scopes from *provided *to
*compile*.

As mentioned the spark unit tests are working (and within Intellij and
without `mvn install`): only the examples are not.

2017-10-11 19:43 GMT-07:00 Paul :

> You say you did the maven package but did you do a maven install and
> define your local maven repo in SBT?
>
> -Paul
>
> Sent from my iPhone
>
> On Oct 11, 2017, at 5:48 PM, Stephen Boesch  wrote:
>
> When attempting to run any example program w/ Intellij I am running into
> guava versioning issues:
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> com/google/common/cache/CacheLoader
> at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:73)
> at org.apache.spark.SparkConf.(SparkConf.scala:68)
> at org.apache.spark.SparkConf.(SparkConf.scala:55)
> at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(
> SparkSession.scala:919)
> at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(
> SparkSession.scala:918)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.sql.SparkSession$Builder.getOrCreate(
> SparkSession.scala:918)
> at org.apache.spark.examples.ml.KMeansExample$.main(KMeansExamp
> le.scala:40)
> at org.apache.spark.examples.ml.KMeansExample.main(KMeansExample.scala)
> Caused by: java.lang.ClassNotFoundException:
> com.google.common.cache.CacheLoader
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 9 more
>
> The *scope*s for the spark dependencies were already changed from
> *provided* to *compile* .  Both `sbt assembly` and `mvn package` had
> already been run (successfully) from command line - and the (mvn) project
> completely rebuilt inside intellij.
>
> The spark testcases run fine: this is a problem only in the examples
> module.  Anyone running these successfully in IJ?  I have tried for
> 2.1.0-SNAPSHOT and 2.3.0-SNAPSHOT - with the same outcome.
>
>
>

Re: Running spark examples in Intellij

2017-10-11 Thread Paul

You say you did the maven package but did you do a maven install and define 
your local maven repo in SBT?

-Paul

Sent from my iPhone

> On Oct 11, 2017, at 5:48 PM, Stephen Boesch  wrote:
> 
> When attempting to run any example program w/ Intellij I am running into 
> guava versioning issues:
> 
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> com/google/common/cache/CacheLoader
>   at 
> org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:73)
>   at org.apache.spark.SparkConf.(SparkConf.scala:68)
>   at org.apache.spark.SparkConf.(SparkConf.scala:55)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:919)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at 
> org.apache.spark.examples.ml.KMeansExample$.main(KMeansExample.scala:40)
>   at org.apache.spark.examples.ml.KMeansExample.main(KMeansExample.scala)
> Caused by: java.lang.ClassNotFoundException: 
> com.google.common.cache.CacheLoader
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 9 more
> 
> The *scope*s for the spark dependencies were already changed from *provided* 
> to *compile* .  Both `sbt assembly` and `mvn package` had already been run 
> (successfully) from command line - and the (mvn) project completely rebuilt 
> inside intellij.
> 
> The spark testcases run fine: this is a problem only in the examples module.  
> Anyone running these successfully in IJ?  I have tried for 2.1.0-SNAPSHOT and 
> 2.3.0-SNAPSHOT - with the same outcome.
> 
>

Dynamic Accumulators in 2.x?

2017-10-11 Thread David Capwell

I wrote a spark instrumentation tool that instruments RDDs to give more
fine-grain details on what is going on within a Task.  This is working
right now, but uses volatiles and CAS to pass around this state (which
slows down the task).  We want to lower the overhead of this and make the
main call path single-threaded and pass around the result when the task
competes; which sounds like AccumulatorV2.

I started rewriting the instrumented logic to be based off accumulators,
but having a hard time getting these to show up in the UI/API (using this
to see if I am linking things properly).

So my question is as follows.

When running in the executor and we create a accumulator (that was not
created from SparkContext), how would I stitch things properly so it shows
up with accumulators defined from the spark context?  If this is different
for different versions that is fine since we can figure that out quickly
(hopefully) and change the instrumentation.

Approches taken:

Looked at SparkContext.register and copied the same logic, but at runtime

this.hasNextTotal = new LongAccumulator();
this.hasNextTotal.metadata_$eq(new
AccumulatorMetadata(AccumulatorContext.newId(),
createName("hasNextTotal"), false));
AccumulatorContext.register(hasNextTotal);


That didn't end up working

tried getting the context from a SparkContext.getActive, but its not
defined at runtime


Option opt = SparkContext$.MODULE$.getActive();
if (opt.isDefined()) {
SparkContext sc = opt.get();
hasNextTotal.register(sc, Option.apply("hasNext"), false);
nextTotal.register(sc, Option.apply("next"), false);
}


Any help on this would be very helpful! would really rather not
re-implement the wheel if I can piggy-back off Accumulators.

Thanks for your help!

Target spark version: 2.2.0

PySpark pickling behavior

2017-10-11 Thread Naveen Swamy

Hello fellow users,

1) I am wondering if there is documentation or guidelines to understand in
what situations does Pyspark decide to pickle the functions I use in the
map method.
2) Are there best practices to avoid pickling and sharing variables, etc,

I have a situation where I want to pass to the map methods, however, those
methods use C++ libraries underneath and Pyspark decides to pickle the
entire object and fails when trying to do that.

I tried to use broadcast, the moment I turn my function to use additional
parameters that must be passed through the map object spark decides to
create an object and try to serialize that

Now I can probably create a dummy function that just does the sharing
of the variables and initialize locally. I can chain that to the map
method, I think that would pretty awkward if I have to resort to that.

Here is my situation in code:

class Model(object):
  __metaclass__ = Singleton
  model_loaded = False
  mod = None
@staticmethoddef load(args):
  # load model@staticmethod  def predict(input, args):
  if not model_loaded:
load(args)
  mod.predict(input)
def spark_main()
  args = parse_args()
  lines = read()
  rdd = sc.parallelize(lines)
  rdd = rdd.map(lambda x: Model.predict(x, args) //*fails here with:
pickle.PicklingError: Could not serialize object: TypeError: can't
pickle thread.lock objects*

Thanks, Naveen

Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch

When attempting to run any example program w/ Intellij I am running into
guava versioning issues:

Exception in thread "main" java.lang.NoClassDefFoundError:
com/google/common/cache/CacheLoader
at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:73)
at org.apache.spark.SparkConf.(SparkConf.scala:68)
at org.apache.spark.SparkConf.(SparkConf.scala:55)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$
7.apply(SparkSession.scala:919)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$
7.apply(SparkSession.scala:918)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.
scala:918)
at org.apache.spark.examples.ml.KMeansExample$.main(KMeansExample.scala:40)
at org.apache.spark.examples.ml.KMeansExample.main(KMeansExample.scala)
Caused by: java.lang.ClassNotFoundException: com.google.common.cache.
CacheLoader
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 9 more

The *scope*s for the spark dependencies were already changed from
*provided* to *compile* .  Both `sbt assembly` and `mvn package` had
already been run (successfully) from command line - and the (mvn) project
completely rebuilt inside intellij.

The spark testcases run fine: this is a problem only in the examples
module.  Anyone running these successfully in IJ?  I have tried for
2.1.0-SNAPSHOT and 2.3.0-SNAPSHOT - with the same outcome.

Java Rdd of String to dataframe

2017-10-11 Thread sk skk

Can we create a dataframe from a Java pair rdd of String . I don’t have a
schema as it will be a dynamic Json. I gave encoders.string class.

Any help is appreciated !!

Thanks,
SK

add jars to spark's runtime

2017-10-11 Thread David Capwell

We want to emit the metrics out of spark into our own custom store.  To do
this we built our own sink and tried to add it to spark by doing --jars
path/to/jar and defining the class in metrics.properties which is supplied
with the job.  We noticed that spark kept crashing with the below exception

17/10/11 09:42:37 ERROR metrics.MetricsSystem: Sink class
com.example.ExternalSink cannot be instantiated
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1707)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.lang.ClassNotFoundException: com.example.ExternalSink
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:230)
at
org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:198)
at
org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:194)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at
org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:194)
at
org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:102)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:366)
at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:201)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:223)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
... 4 more

We then added this jar into the spark tarball that we use for testing, and
saw that it was able to load just fine and publish.

My question is, how can I add this jar to the spark runtime rather than the
user runtime?  In production we don't have permissions to write to the jars
dir of spark, so that trick to get this working won't work for us.

Thanks for your time reading this email.

Tested on spark 2.2.0

Re: Job spark blocked and runs indefinitely

2017-10-11 Thread Amine CHERIFI

it seems that the job block whene we call newAPIHadoopRDD to get data from
Hbase. it may be the issue !!
is there another api to load date from hbase ?


 Sent with Mailtrack

<#>

2017-10-11 14:45 GMT+02:00 Sebastian Piu :

> We do have this issue randomly too, so interested in hearing if someone
> was able to get to the bottom of it
>
> On Wed, 11 Oct 2017, 13:40 amine_901, 
> wrote:
>
>> We encounter a problem on a Spark job 1.6(on yarn) that never ends, whene
>> several jobs launched simultaneously.
>> We found that by launching the job spark in yarn-client mode we do not
>> have
>> this problem, unlike launching it in yarn-cluster mode.
>> it could be a trail to find the cause.
>>
>> we changed the code to add a sparkContext.stop ()
>> Indeed, the SparkContext was created (val sparkContext =
>> createSparkContext)
>> but not stopped. this solution has allowed us to decrease the number of
>> jobs
>> that remains blocked but nevertheless we still have some jobs blocked.
>>
>> by analyzing the logs we have found this log that repeats without
>> stopping:
>> /17/09/29 11:04:37 DEBUG SparkEventPublisher: Enqueue
>> SparkListenerExecutorMetricsUpdate(1,WrappedArray())
>> 17/09/29 11:04:41 DEBUG ApplicationMaster: Sending progress
>> 17/09/29 11:04:41 DEBUG ApplicationMaster: Number of pending allocations
>> is
>> 0. Sleeping for 5000. /
>>
>> Does someone have an idea about this issue ?
>> Thank you in advance
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


-- 
CHERIFI Mohamed Amine
Développeur Big data/Data scientist
07 81 65 17 03

Re: Job spark blocked and runs indefinitely

2017-10-11 Thread Sebastian Piu

We do have this issue randomly too, so interested in hearing if someone was
able to get to the bottom of it

On Wed, 11 Oct 2017, 13:40 amine_901,  wrote:

> We encounter a problem on a Spark job 1.6(on yarn) that never ends, whene
> several jobs launched simultaneously.
> We found that by launching the job spark in yarn-client mode we do not have
> this problem, unlike launching it in yarn-cluster mode.
> it could be a trail to find the cause.
>
> we changed the code to add a sparkContext.stop ()
> Indeed, the SparkContext was created (val sparkContext =
> createSparkContext)
> but not stopped. this solution has allowed us to decrease the number of
> jobs
> that remains blocked but nevertheless we still have some jobs blocked.
>
> by analyzing the logs we have found this log that repeats without stopping:
> /17/09/29 11:04:37 DEBUG SparkEventPublisher: Enqueue
> SparkListenerExecutorMetricsUpdate(1,WrappedArray())
> 17/09/29 11:04:41 DEBUG ApplicationMaster: Sending progress
> 17/09/29 11:04:41 DEBUG ApplicationMaster: Number of pending allocations is
> 0. Sleeping for 5000. /
>
> Does someone have an idea about this issue ?
> Thank you in advance
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Job spark blocked and runs indefinitely

2017-10-11 Thread amine_901

We encounter a problem on a Spark job 1.6(on yarn) that never ends, whene
several jobs launched simultaneously.
We found that by launching the job spark in yarn-client mode we do not have
this problem, unlike launching it in yarn-cluster mode.
it could be a trail to find the cause.

we changed the code to add a sparkContext.stop ()
Indeed, the SparkContext was created (val sparkContext = createSparkContext)
but not stopped. this solution has allowed us to decrease the number of jobs
that remains blocked but nevertheless we still have some jobs blocked.

by analyzing the logs we have found this log that repeats without stopping:
/17/09/29 11:04:37 DEBUG SparkEventPublisher: Enqueue
SparkListenerExecutorMetricsUpdate(1,WrappedArray())
17/09/29 11:04:41 DEBUG ApplicationMaster: Sending progress
17/09/29 11:04:41 DEBUG ApplicationMaster: Number of pending allocations is
0. Sleeping for 5000. /

Does someone have an idea about this issue ? 
Thank you in advance



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Job spark blocked and runs indefinitely

2017-10-11 Thread amine_901

We encounter a problem on a Spark job 1.6(on yarn) that never ends, whene
several jobs launched simultaneously.
We found that by launching the job spark in yarn-client mode we do not have
this problem, unlike launching it in yarn-cluster mode.
it could be a trail to find the cause.

we changed the code to add a sparkContext.stop ()
Indeed, the SparkContext was created (val sparkContext = createSparkContext)
but not stopped. this solution has allowed us to decrease the number of jobs
that remains blocked but nevertheless we still have some jobs blocked.

by analyzing the logs we have found this log that repeats without stopping:
/17/09/29 11:04:37 DEBUG SparkEventPublisher: Enqueue
SparkListenerExecutorMetricsUpdate(1,WrappedArray())
17/09/29 11:04:41 DEBUG ApplicationMaster: Sending progress
17/09/29 11:04:41 DEBUG ApplicationMaster: Number of pending allocations is
0. Sleeping for 5000. /

do someone has any idea about this issue ? 
Thank you in advance



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Kafka streams vs Spark streaming

2017-10-11 Thread Sachin Mittal

Well depends upon use case. Say the metric you are evaluating is grouped by
a key and you want to parallelize the operation by adding more instances so
certain instance deal with only a particular group it is always better to
have partitioning also done on that key. This way a particular instance
will always compute upon certain partitions and hence certain keys.

So in such case you need to make sure the producers are also producing
based on that key.

Its optional yes but for good performance one needs to ensure topics are
partitioned based on key hashes.

In spark this is not needed as it is not backed by a topic.

In short kafka streams are backed by a topic and that does create some
downside (side by side having some upsides too).


On Wed, Oct 11, 2017 at 2:00 PM, Sabarish Sasidharan  wrote:

> @Sachin
> >>The partition key is very important if you need to run multiple
> instances of streams application and certain instance processing certain
> partitions only.
>
> Again, depending on partition key is optional. It's actually a feature
> enabler, so we can use local state stores to improve throughput. I don't
> see this as a downside.
>
> Regards
> Sab
>
> On 11 Oct 2017 1:44 pm, "Sachin Mittal"  wrote:
>
>> Kafka streams has a lower learning curve and if your source data is in
>> kafka topics it is pretty simple to integrate it with.
>> It can run like a library inside your main programs.
>>
>> So as compared to spark streams
>> 1. Is much simpler to implement.
>> 2. Is not much heavy on hardware unlike spark.
>>
>>
>> On the downside
>> 1. It is not elastic. You need to anticipate before hand on volume of
>> data you will have. Very difficult to add and reduce topic partitions later
>> on.
>> 2. The partition key is very important if you need to run multiple
>> instances of streams application and certain instance processing certain
>> partitions only.
>>  In case you need aggregation on a different key you may need to
>> re-partition the data to a new topic and run new streams app against that.
>>
>> So yes if you have good idea about your data and if it comes from kafka
>> and you want to build something quick without much hardware kafka streams
>> is a way to go.
>>
>> We had first tried spark streaming but given hardware limitation and
>> complexity of fetching data from mongodb we decided kafka streams as way to
>> go forward.
>>
>> Thanks
>> Sachin
>>
>>
>>
>>
>>
>> On Wed, Oct 11, 2017 at 1:01 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Has anyone had an experience of using Kafka streams versus Spark?
>>>
>>> I am not familiar with Kafka streams concept except that it is a set of
>>> libraries.
>>>
>>> Any feedback will be appreciated.
>>>
>>> Regards,
>>>
>>> Mich
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>
>>

Re: Kafka streams vs Spark streaming

2017-10-11 Thread Sachin Mittal

No it wont work this way.
Say you have 9 partitions and 3 instances.
1 = {1, 2, 3}
2 = {4, 5, 6}
3 = (7, 8, 9}
And lets say a particular key (k1) is always written to partition 4.
Now say you increase partitions to 12 you may have:
1 = {1, 2, 3, 4}
2 = {5, 6, 7, 8}
3 = (9, 10, 11, 12}

Now it is possible that k1 is still written to 4 and now processed by
instance 1 or may be redistributed to some new partition say 9 and
processed by machine 3.
So we would have some old data for that key written to partition 4 and new
data after added new partitions written to partition 9.

Hence some aggregation for time window spanning both the partitions would
report incorrect results.

As a result it is general practice that one creates slightly more
partitions than anticipated data volume. Hence kafka topics and streaming
is not as elastic as spark.



On Wed, Oct 11, 2017 at 1:56 PM, Sabarish Sasidharan  wrote:

> @Sachin
> >>is not elastic. You need to anticipate before hand on volume of data you
> will have. Very difficult to add and reduce topic partitions later on.
>
> Why do you say so Sachin? Kafka Streams will readjust once we add more
> partitions to the Kafka topic. And when we add more machines, rebalancing
> auto distributes the partitions among the new stream threads.
>
> Regards
> Sab
>
> On 11 Oct 2017 1:44 pm, "Sachin Mittal"  wrote:
>
>> Kafka streams has a lower learning curve and if your source data is in
>> kafka topics it is pretty simple to integrate it with.
>> It can run like a library inside your main programs.
>>
>> So as compared to spark streams
>> 1. Is much simpler to implement.
>> 2. Is not much heavy on hardware unlike spark.
>>
>>
>> On the downside
>> 1. It is not elastic. You need to anticipate before hand on volume of
>> data you will have. Very difficult to add and reduce topic partitions later
>> on.
>> 2. The partition key is very important if you need to run multiple
>> instances of streams application and certain instance processing certain
>> partitions only.
>>  In case you need aggregation on a different key you may need to
>> re-partition the data to a new topic and run new streams app against that.
>>
>> So yes if you have good idea about your data and if it comes from kafka
>> and you want to build something quick without much hardware kafka streams
>> is a way to go.
>>
>> We had first tried spark streaming but given hardware limitation and
>> complexity of fetching data from mongodb we decided kafka streams as way to
>> go forward.
>>
>> Thanks
>> Sachin
>>
>>
>>
>>
>>
>> On Wed, Oct 11, 2017 at 1:01 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Has anyone had an experience of using Kafka streams versus Spark?
>>>
>>> I am not familiar with Kafka streams concept except that it is a set of
>>> libraries.
>>>
>>> Any feedback will be appreciated.
>>>
>>> Regards,
>>>
>>> Mich
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>
>>

Re: Kafka streams vs Spark streaming

2017-10-11 Thread Sabarish Sasidharan

@Sachin
>>The partition key is very important if you need to run multiple instances
of streams application and certain instance processing certain partitions
only.

Again, depending on partition key is optional. It's actually a feature
enabler, so we can use local state stores to improve throughput. I don't
see this as a downside.

Regards
Sab

On 11 Oct 2017 1:44 pm, "Sachin Mittal"  wrote:

> Kafka streams has a lower learning curve and if your source data is in
> kafka topics it is pretty simple to integrate it with.
> It can run like a library inside your main programs.
>
> So as compared to spark streams
> 1. Is much simpler to implement.
> 2. Is not much heavy on hardware unlike spark.
>
>
> On the downside
> 1. It is not elastic. You need to anticipate before hand on volume of data
> you will have. Very difficult to add and reduce topic partitions later on.
> 2. The partition key is very important if you need to run multiple
> instances of streams application and certain instance processing certain
> partitions only.
>  In case you need aggregation on a different key you may need to
> re-partition the data to a new topic and run new streams app against that.
>
> So yes if you have good idea about your data and if it comes from kafka
> and you want to build something quick without much hardware kafka streams
> is a way to go.
>
> We had first tried spark streaming but given hardware limitation and
> complexity of fetching data from mongodb we decided kafka streams as way to
> go forward.
>
> Thanks
> Sachin
>
>
>
>
>
> On Wed, Oct 11, 2017 at 1:01 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> Has anyone had an experience of using Kafka streams versus Spark?
>>
>> I am not familiar with Kafka streams concept except that it is a set of
>> libraries.
>>
>> Any feedback will be appreciated.
>>
>> Regards,
>>
>> Mich
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>

Re: Kafka streams vs Spark streaming

2017-10-11 Thread Sabarish Sasidharan

@Sachin
>>is not elastic. You need to anticipate before hand on volume of data you
will have. Very difficult to add and reduce topic partitions later on.

Why do you say so Sachin? Kafka Streams will readjust once we add more
partitions to the Kafka topic. And when we add more machines, rebalancing
auto distributes the partitions among the new stream threads.

Regards
Sab

On 11 Oct 2017 1:44 pm, "Sachin Mittal"  wrote:

> Kafka streams has a lower learning curve and if your source data is in
> kafka topics it is pretty simple to integrate it with.
> It can run like a library inside your main programs.
>
> So as compared to spark streams
> 1. Is much simpler to implement.
> 2. Is not much heavy on hardware unlike spark.
>
>
> On the downside
> 1. It is not elastic. You need to anticipate before hand on volume of data
> you will have. Very difficult to add and reduce topic partitions later on.
> 2. The partition key is very important if you need to run multiple
> instances of streams application and certain instance processing certain
> partitions only.
>  In case you need aggregation on a different key you may need to
> re-partition the data to a new topic and run new streams app against that.
>
> So yes if you have good idea about your data and if it comes from kafka
> and you want to build something quick without much hardware kafka streams
> is a way to go.
>
> We had first tried spark streaming but given hardware limitation and
> complexity of fetching data from mongodb we decided kafka streams as way to
> go forward.
>
> Thanks
> Sachin
>
>
>
>
>
> On Wed, Oct 11, 2017 at 1:01 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> Has anyone had an experience of using Kafka streams versus Spark?
>>
>> I am not familiar with Kafka streams concept except that it is a set of
>> libraries.
>>
>> Any feedback will be appreciated.
>>
>> Regards,
>>
>> Mich
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>

Re: Kafka streams vs Spark streaming

2017-10-11 Thread Sachin Mittal

Kafka streams has a lower learning curve and if your source data is in
kafka topics it is pretty simple to integrate it with.
It can run like a library inside your main programs.

So as compared to spark streams
1. Is much simpler to implement.
2. Is not much heavy on hardware unlike spark.

On the downside
1. It is not elastic. You need to anticipate before hand on volume of data
you will have. Very difficult to add and reduce topic partitions later on.
2. The partition key is very important if you need to run multiple
instances of streams application and certain instance processing certain
partitions only.
 In case you need aggregation on a different key you may need to
re-partition the data to a new topic and run new streams app against that.

So yes if you have good idea about your data and if it comes from kafka and
you want to build something quick without much hardware kafka streams is a
way to go.

We had first tried spark streaming but given hardware limitation and
complexity of fetching data from mongodb we decided kafka streams as way to
go forward.

Thanks
Sachin

On Wed, Oct 11, 2017 at 1:01 PM, Mich Talebzadeh 
wrote:

> Hi,
>
> Has anyone had an experience of using Kafka streams versus Spark?
>
> I am not familiar with Kafka streams concept except that it is a set of
> libraries.
>
> Any feedback will be appreciated.
>
> Regards,
>
> Mich
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Kafka streams vs Spark streaming

2017-10-11 Thread Mich Talebzadeh

Hi,

Has anyone had an experience of using Kafka streams versus Spark?

I am not familiar with Kafka streams concept except that it is a set of
libraries.

Any feedback will be appreciated.

Regards,

Mich



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: best spark spatial lib?

2017-10-11 Thread Imran Rajjad

Thanks guy for the response.

Basically I am migrating an oracle pl/sql procedure to spark-java. In
oracle I have a table with geometry column, on which I am able to do a

"where col = 1 and geom.within(another_geom)"

I am looking for a less complicated port in to spark for which queries. I
will give these libraries a shot.

@Ram .. Magellan I gues does not support Java

regards,
Imran

On Wed, Oct 11, 2017 at 12:07 AM, Ram Sriharsha 
wrote:

> why can't you do this in Magellan?
> Can you post a sample query that you are trying to run that has spatial
> and logical operators combined? Maybe I am not understanding the issue
> properly
>
> Ram
>
> On Tue, Oct 10, 2017 at 2:21 AM, Imran Rajjad  wrote:
>
>> I need to have a location column inside my Dataframe so that I can do
>> spatial queries and geometry operations. Are there any third-party packages
>> that perform this kind of operations. I have seen a few like Geospark and
>> megalan but they don't support operations where spatial and logical
>> operators can be combined.
>>
>> regards,
>> Imran
>>
>> --
>> I.R
>>
>
>

-- 
I.R

Re: Running spark examples in Intellij

Re: Running spark examples in Intellij

Re: Running spark examples in Intellij

Dynamic Accumulators in 2.x?

PySpark pickling behavior

Running spark examples in Intellij

Java Rdd of String to dataframe

add jars to spark's runtime

Re: Job spark blocked and runs indefinitely

Re: Job spark blocked and runs indefinitely

Job spark blocked and runs indefinitely

Job spark blocked and runs indefinitely

Re: Kafka streams vs Spark streaming

Re: Kafka streams vs Spark streaming

Re: Kafka streams vs Spark streaming

Re: Kafka streams vs Spark streaming

Re: Kafka streams vs Spark streaming

Kafka streams vs Spark streaming

Re: best spark spatial lib?

19 matches

Site Navigation

Mail list logo

Footer information