How do I link JavaEsSpark.saveToEs() to a sparkConf?

2015-12-14 Thread Spark Enthusiast
Folks,
I have the following program :
SparkConf conf = new 
SparkConf().setMaster("local").setAppName("Indexer").set("spark.driver.maxResultSize",
 "2g");conf.set("es.index.auto.create", "true");conf.set("es.nodes", 
"localhost");conf.set("es.port", "9200");conf.set("es.write.operation", 
"index");JavaSparkContext sc = new JavaSparkContext(conf);
          .          .
JavaEsSpark.saveToEs(filteredFields, "foo");

I get an error saying cannot find storage. Looks like the driver program cannot 
the Elastic Search Server. Seeing the program, I have not associated 
JavaEsSpark to the SparkConf. 
Question: How do I associate JavaEsSpark to SparkConf?



Getting an error when trying to read a GZIPPED file

2015-09-02 Thread Spark Enthusiast
Folks,
I have an input file which is gzipped. I use sc.textFile("foo.gz") when I see 
the following problem. Can someone help me how to fix this?
15/09/03 10:05:32 INFO deprecation: mapred.job.id is deprecated. Instead, use 
mapreduce.job.id15/09/03 10:05:32 INFO CodecPool: Got brand-new decompressor 
[.gz]15/09/03 10:06:15 WARN MemoryStore: Not enough space to cache rdd_2_0 in 
memory! (computed 216.3 MB so far)15/09/03 10:06:15 INFO MemoryStore: Memory 
use = 156.2 KB (blocks) + 213.1 MB (scratch space shared across 1 thread(s)) = 
213.3 MB. Storage limit = 265.1 MB.





Spark

2015-08-24 Thread Spark Enthusiast
I was running a Spark Job to crunch a 9GB apache log file When I saw the 
following error:


15/08/25 04:25:16 WARN scheduler.TaskSetManager: Lost task 99.0 in stage 37.0 
(TID 4115, ip-10-150-137-100.ap-southeast-1.compute.internal): 
ExecutorLostFailure (executor 29 lost)15/08/25 04:25:16 INFO 
scheduler.DAGScheduler: Resubmitted ShuffleMapTask(37, 40), so marking it as 
still running15/08/25 04:25:16 INFO scheduler.DAGScheduler: Resubmitted 
ShuffleMapTask(37, 86), so marking it as still running15/08/25 04:25:16 INFO 
scheduler.DAGScheduler: Resubmitted ShuffleMapTask(37, 84), so marking it as 
still running15/08/25 04:25:16 INFO scheduler.DAGScheduler: Resubmitted 
ShuffleMapTask(37, 22), so marking it as still running15/08/25 04:25:16 INFO 
scheduler.DAGScheduler: Resubmitted ShuffleMapTask(37, 48), so marking it as 
still running15/08/25 04:25:16 INFO scheduler.DAGScheduler: Resubmitted 
ShuffleMapTask(37, 12), so marking it as still running15/08/25 04:25:16 INFO 
scheduler.DAGScheduler: Executor lost: 29 (epoch 59)15/08/25 04:25:16 INFO 
storage.BlockManagerMasterActor: Trying to remove executor 29 from 
BlockManagerMaster.15/08/25 04:25:16 INFO storage.BlockManagerMasterActor: 
Removing block manager BlockManagerId(29, 
ip-10-150-137-100.ap-southeast-1.compute.internal, 39411)
                      .                      .Encountered Exception An error 
occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.: 
org.apache.spark.SparkException: Job cancelled because SparkContext was shut 
down at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:699)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:698)
 at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at 
org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:698)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1411)
 at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84) at 
org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1346) at 
org.apache.spark.SparkContext.stop(SparkContext.scala:1380) at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:143)
                    .                    .
Looking further, it seems like takeOrdered (called by my application) uses 
collect() internally and hence drains out all the Drive memory.
line 361, in top10EndPoints    topEndpoints = endpointCounts.takeOrdered(10, 
lambda s: -1 * s[1])  File /home/hadoop/spark/python/pyspark/rdd.py, line 
1174, in takeOrdered    return self.mapPartitions(lambda it: 
[heapq.nsmallest(num, it, key)]).reduce(merge)  File 
/home/hadoop/spark/python/pyspark/rdd.py, line 739, in reduce    vals = 
self.mapPartitions(func).collect()  File 
/home/hadoop/spark/python/pyspark/rdd.py, line 713, in collect    port = 
self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())  File 
/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 
538, in __call__    self.target_id, self.name)  File 
/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 
300, in get_return_value    format(target_id, '.', name), value)
How can I rewrite this code



endpointCounts = (access_logs  .map(lambda log: (log.endpoint, 
1))  .reduceByKey(lambda a, b : a + b))
#Endpoints is now a list of Tuples of [(endpoint1, count1), (endpoint2, 
count2), ]
topEndpoints = endpointCounts.takeOrdered(10, lambda s: -1 * s[1])

so that this error does not happen?

How to parse multiple event types using Kafka

2015-08-23 Thread Spark Enthusiast
Folks,
I use the following Streaming API from KafkaUtils :
public JavaPairInputDStreamString, String inputDStream() {

HashSetString topicsSet = new 
HashSetString(Arrays.asList(topics.split(,)));
HashMapString, String kafkaParams = new HashMapString, String();
kafkaParams.put(Tokens.KAFKA_BROKER_LIST_TOKEN.getRealTokenName(), brokers);

return KafkaUtils.createDirectStream(
streamingContext,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet
);

}

I catch the messages using :JavaDStreamString messages = inputDStream.map(new 
FunctionTuple2String, String, String() {
@Override
public String call(Tuple2String, String tuple2) {
return tuple2._2();
}
});
My problem is, each of these Kafka Topics stream in different message types. 
How do I distinguish messages that are of type1, messages that are of type2, 
. ?
I tried the following:
private class ParseEventsT implements FunctionString, T {
final ClassT parameterClass;

private ParseEvents(ClassT parameterClass) {
this.parameterClass = parameterClass;
}

public T call(String message) throws Exception {
ObjectMapper mapper = new ObjectMapper();

T parsedMessage = null;

try {
parsedMessage = mapper.readValue(message, this.parameterClass);
} catch (Exception e1) {
logger.error(Ignoring Unknown Message %s, message);
  
}
return parsedMessage;
}
}JavaDStreamType1 type1Events = messages.map(new 
ParseEventsType1(Type1.class));JavaDStreamType2 type2Events = 
messages.map(new ParseEventsType2(Type2.class));JavaDStreamType3 
type3Events = messages.map(new ParseEventsType3(Type3.class));
But this does not work because type1 catches type2 messages and ignores them. 
Is there a clean way of handling this ?




How to automatically relaunch a Driver program after crashes?

2015-08-19 Thread Spark Enthusiast
Folks,
As I see, the Driver program is a single point of failure. Now, I have seen 
ways as to how to make it recover from failures on a restart (using 
Checkpointing) but I have not seen anything as to how to restart it 
automatically if it crashes.
Will running the Driver as a Hadoop Yarn Application do it? Can someone educate 
me as to how?

Re: How to automatically relaunch a Driver program after crashes?

2015-08-19 Thread Spark Enthusiast
Thanks for the reply.
Are Standalone or Mesos the only options? Is there a way to auto relaunch if 
driver runs as a Hadoop Yarn Application? 


 On Wednesday, 19 August 2015 12:49 PM, Todd bit1...@163.com wrote:
   

 There is an option for the spark-submit (Spark standalone or Mesos with 
cluster deploy mode only)
  --supervise If given, restarts the driver on failure.




At 2015-08-19 14:55:39, Spark Enthusiast sparkenthusi...@yahoo.in wrote:
 
Folks,
As I see, the Driver program is a single point of failure. Now, I have seen 
ways as to how to make it recover from failures on a restart (using 
Checkpointing) but I have not seen anything as to how to restart it 
automatically if it crashes.
Will running the Driver as a Hadoop Yarn Application do it? Can someone educate 
me as to how?


  

Re: Not seeing Log messages

2015-08-11 Thread Spark Enthusiast
Forgot to mention. Here is how I run the program :
 ./bin/spark-submit --conf spark.app.master=local[1] 
~/workspace/spark-python/ApacheLogWebServerAnalysis.py


 On Wednesday, 12 August 2015 10:28 AM, Spark Enthusiast 
sparkenthusi...@yahoo.in wrote:
   

 I wrote a small python program :
def parseLogs(self):
 Read and parse log file 
self._logger.debug(Parselogs() start)
self.parsed_logs = (self._sc
.textFile(self._logFile)
.map(self._parseApacheLogLine)
.cache())

self.access_logs = (self.parsed_logs
.filter(lambda s: s[1] == 1)
.map(lambda s: s[0])
.cache())

self.failed_logs = (self.parsed_logs
.filter(lambda s: s[1] == 0)
.map(lambda s: s[0]))
failed_logs_count = self.failed_logs.count()
if failed_logs_count  0:
self._logger.debug('Number of invalid logline: %d' % 
self.failed_logs.count())

for line in self.failed_logs.take(20):
self._logger.debug('Invalid logline: %s' % line)


self._logger.debug('Read %d lines, successfully parsed %d lines, failed to 
parse %d lines' % \
  (self.parsed_logs.count(), self.access_logs.count(), 
self.failed_logs.count()))


return (self.parsed_logs, self.access_logs, self.failed_logs)
def main(argv):
try:
logger = createLogger(pyspark, logging.DEBUG, LogAnalyzer.log, ./)
logger.debug(Starting LogAnalyzer)
myLogAnalyzer =  ApacheLogAnalyzer(logger)
(parsed_logs, access_logs, failed_logs) = myLogAnalyzer.parseLogs()
except Exception as e:
print Encountered Exception %s %str(e)

logger.debug('Read %d lines, successfully parsed %d lines, failed to parse 
%d lines' % 
   (parsed_logs.count(), access_logs.count(), 
failed_logs.count()))
logger.info(DONE. ALL TESTS PASSED)

I see some log messages:Starting LogAnalyzerParselogs() startDONE. ALL 
TESTS PASSED
But do not see some log messages:Read %d lines, successfully parsed %d lines, 
failed to parse %d lines'
But, This line:logger.debug('Read %d lines, successfully parsed %d lines, 
failed to parse %d lines' % 
   (parsed_logs.count(), access_logs.count(), 
failed_logs.count()))I get the following error :
Encountered Exception Cannot pickle files that are not opened for reading
Do not have a clue as to what's happening. Any help will be appreciated.



  

Not seeing Log messages

2015-08-11 Thread Spark Enthusiast
I wrote a small python program :
def parseLogs(self):
 Read and parse log file 
self._logger.debug(Parselogs() start)
self.parsed_logs = (self._sc
.textFile(self._logFile)
.map(self._parseApacheLogLine)
.cache())

self.access_logs = (self.parsed_logs
.filter(lambda s: s[1] == 1)
.map(lambda s: s[0])
.cache())

self.failed_logs = (self.parsed_logs
.filter(lambda s: s[1] == 0)
.map(lambda s: s[0]))
failed_logs_count = self.failed_logs.count()
if failed_logs_count  0:
self._logger.debug('Number of invalid logline: %d' % 
self.failed_logs.count())

for line in self.failed_logs.take(20):
self._logger.debug('Invalid logline: %s' % line)


self._logger.debug('Read %d lines, successfully parsed %d lines, failed to 
parse %d lines' % \
  (self.parsed_logs.count(), self.access_logs.count(), 
self.failed_logs.count()))


return (self.parsed_logs, self.access_logs, self.failed_logs)
def main(argv):
try:
logger = createLogger(pyspark, logging.DEBUG, LogAnalyzer.log, ./)
logger.debug(Starting LogAnalyzer)
myLogAnalyzer =  ApacheLogAnalyzer(logger)
(parsed_logs, access_logs, failed_logs) = myLogAnalyzer.parseLogs()
except Exception as e:
print Encountered Exception %s %str(e)

logger.debug('Read %d lines, successfully parsed %d lines, failed to parse 
%d lines' % 
   (parsed_logs.count(), access_logs.count(), 
failed_logs.count()))
logger.info(DONE. ALL TESTS PASSED)

I see some log messages:Starting LogAnalyzerParselogs() startDONE. ALL 
TESTS PASSED
But do not see some log messages:Read %d lines, successfully parsed %d lines, 
failed to parse %d lines'
But, This line:logger.debug('Read %d lines, successfully parsed %d lines, 
failed to parse %d lines' % 
   (parsed_logs.count(), access_logs.count(), 
failed_logs.count()))I get the following error :
Encountered Exception Cannot pickle files that are not opened for reading
Do not have a clue as to what's happening. Any help will be appreciated.



How do I Process Streams that span multiple lines?

2015-08-03 Thread Spark Enthusiast
All  examples of Spark Stream programming that I see assume streams of lines 
that are then tokenised and acted upon (like the WordCount example).
How do I process Streams that span multiple lines? Are there examples that I 
can use? 

Can a Spark Driver Program be a REST Service by itself?

2015-07-01 Thread Spark Enthusiast
Folks,
My Use case is as follows:
My Driver program will be aggregating a bunch of Event Streams and acting on 
it. The Action on the aggregated events is configurable and can change 
dynamically.
One way I can think of is to run the Spark Driver as a Service where a config 
push can be caught via an API that the Driver exports.Can I have a Spark Driver 
Program run as a REST Service by itself? Is this a common use case?
Is there a better way to solve my problem?
Thanks

Can I do Joins across Event Streams ?

2015-07-01 Thread Spark Enthusiast
Hi,
I have to build a system that reacts to a set of events. Each of these events 
are separate streams by themselves which are consumed from different Kafka 
Topics and hence will have different InputDStreams.
Questions:
Will I be able to do joins across multiple InputDStreams and collate the output 
using a single Accumulator?These Event Streams can have their own frequency of 
occurrence. How will I be able to co-ordinate the out of sync behaviour?

Serialization Exception

2015-06-29 Thread Spark Enthusiast
For prototyping purposes, I created a test program injecting dependancies using 
Spring. 

Nothing fancy. This is just a re-write of KafkaDirectWordCount. When I run 
this, I get the following exception:
Exception in thread main org.apache.spark.SparkException: Task not 
serializable
    at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315)
    at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
    at 
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:528)
    at 
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:528)
    at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
    at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
    at org.apache.spark.SparkContext.withScope(SparkContext.scala:681)
    at 
org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:258)
    at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:527)
    at 
org.apache.spark.streaming.api.java.JavaDStreamLike$class.map(JavaDStreamLike.scala:157)
    at 
org.apache.spark.streaming.api.java.AbstractJavaDStreamLike.map(JavaDStreamLike.scala:43)
    at 
com.olacabs.spark.examples.WordCountProcessorKafkaImpl.process(WordCountProcessorKafkaImpl.java:45)
    at com.olacabs.spark.examples.WordCountApp.main(WordCountApp.java:49)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: Object of 
org.apache.spark.streaming.kafka.DirectKafkaInputDStream is being serialized  
possibly as a part of closure of an RDD operation. This is because  the DStream 
object is being referred to from within the closure.  Please rewrite the RDD 
operation inside this DStream to avoid this.  This has been enforced to avoid 
bloating of Spark tasks  with unnecessary objects.
Serialization stack:
    - object not serializable (class: 
org.apache.spark.streaming.api.java.JavaStreamingContext, value: 
org.apache.spark.streaming.api.java.JavaStreamingContext@7add323c)
    - field (class: com.olacabs.spark.examples.WordCountProcessorKafkaImpl, 
name: streamingContext, type: class 
org.apache.spark.streaming.api.java.JavaStreamingContext)
    - object (class com.olacabs.spark.examples.WordCountProcessorKafkaImpl, 
com.olacabs.spark.examples.WordCountProcessorKafkaImpl@29a1505c)
    - field (class: com.olacabs.spark.examples.WordCountProcessorKafkaImpl$1, 
name: this$0, type: class 
com.olacabs.spark.examples.WordCountProcessorKafkaImpl)
    - object (class com.olacabs.spark.examples.WordCountProcessorKafkaImpl$1, 
com.olacabs.spark.examples.WordCountProcessorKafkaImpl$1@c6c82aa)
    - field (class: 
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, 
type: interface org.apache.spark.api.java.function.Function)
    - object (class 
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, function1)
    at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
    at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
    at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)
    ... 23 more



Can someone help me figure out why?


Here is the Code :

public interface EventProcessor extends Serializable {
void process();
}
public class WordCountProcessorKafkaImpl implements EventProcessor {

private static final Pattern SPACE = Pattern.compile( );

@Autowired
@Qualifier(streamingContext)
JavaStreamingContext streamingContext;

@Autowired
@Qualifier(inputDStream)
JavaPairInputDStreamString, String inputDStream;

@Override
public void process() {
// Get the lines, split them into words, count the words and print
JavaDStreamString lines = inputDStream.map(new 
FunctionTuple2String, String, String() {
@Override
public String call(Tuple2String, String tuple2) {
return 

Re: Spark or Storm

2015-06-17 Thread Spark Enthusiast
Again, by Storm, you mean Storm Trident, correct? 


 On Wednesday, 17 June 2015 10:09 PM, Michael Segel 
msegel_had...@hotmail.com wrote:
   

 Actually the reverse.
Spark Streaming is really a micro batch system where the smallest window is 1/2 
a second (500ms). So for CEP, its not really a good idea. 
So in terms of options…. spark streaming, storm, samza, akka and others… 
Storm is probably the easiest to pick up,  spark streaming / akka may give you 
more flexibility and akka would work for CEP. 
Just my $0.02

On Jun 16, 2015, at 9:40 PM, Spark Enthusiast sparkenthusi...@yahoo.in wrote:
I have a use-case where a stream of Incoming events have to be aggregated and 
joined to create Complex events. The aggregation will have to happen at an 
interval of 1 minute (or less).
The pipeline is :                                  send events                  
                        enrich eventUpstream services --- 
KAFKA - event Stream Processor  Complex Event Processor 
 Elastic Search.
From what I understand, Storm will make a very good ESP and Spark Streaming 
will make a good CEP.
But, we are also evaluating Storm with Trident.
How does Spark Streaming compare with Storm with Trident?
Sridhar Chellappa


  


 On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com wrote:
   

 I have a similar scenario where we need to bring data from kinesis to hbase. 
Data volecity is 20k per 10 mins. Little manipulation of data will be required 
but that's regardless of the tool so we will be writing that piece in Java 
pojo. All env is on aws. Hbase is on a long running EMR and kinesis on a 
separate cluster.TIA.
Best
AyanOn 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote:

The programming models for the two frameworks are conceptually rather 
different; I haven't worked with Storm for quite some time, but based on my old 
experience with it, I would equate Spark Streaming more with Storm's Trident 
API, rather than with the raw Bolt API. Even then, there are significant 
differences, but it's a bit closer.

If you can share your use case, we might be able to provide better guidance.

Regards,
Will

On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote:

Hi All,

I am evaluating spark VS storm ( spark streaming  ) and i am not able to see 
what is equivalent of Bolt in storm inside spark.

Any help will be appreciated on this ?

Thanks ,
Ashish
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




   



  

Re: Spark or Storm

2015-06-17 Thread Spark Enthusiast
When you say Storm, did you mean Storm with Trident or Storm?
My use case does not have simple transformation. There are complex events that 
need to be generated by joining the incoming event stream.
Also, what do you mean by No Back PRessure ?

 


 On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com wrote:
   

 We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm.
Some of the important draw backs are:
Spark has no back pressure (receiver rate limit can alleviate this to a certain 
point, but it's far from ideal)There is also no exactly-once semantics. 
(updateStateByKey can achieve this semantics, but is not practical if you have 
any significant amount of state because it does so by dumping the entire state 
on every checkpointing)

There are also some minor drawbacks that I'm sure will be fixed quickly, like 
no task timeout, not being able to read from Kafka using multiple nodes, data 
loss hazard with Kafka.
It's also not possible to attain very low latency in Spark, if that's what you 
need.
The pos for Spark is the concise and IMO more intuitive syntax, especially if 
you compare it with Storm's Java API.
I admit I might be a bit biased towards Storm tho as I'm more familiar with it.
Also, you can do some processing with Kinesis. If all you need to do is 
straight forward transformation and you are reading from Kinesis to begin with, 
it might be an easier option to just do the transformation in Kinesis.




On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan 
sabarish.sasidha...@manthan.com wrote:

Whatever you write in bolts would be the logic you want to apply on your 
events. In Spark, that logic would be coded in map() or similar such  
transformations and/or actions. Spark doesn't enforce a structure for capturing 
your processing logic like Storm does.Regards
SabProbably overloading the question a bit.

In Storm, Bolts have the functionality of getting triggered on events. Is that 
kind of functionality possible with Spark streaming? During each phase of the 
data processing, the transformed data is stored to the database and this 
transformed data should then be sent to a new pipeline for further processing

How can this be achieved using Spark?



On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast sparkenthusi...@yahoo.in 
wrote:

I have a use-case where a stream of Incoming events have to be aggregated and 
joined to create Complex events. The aggregation will have to happen at an 
interval of 1 minute (or less).
The pipeline is :                                  send events                  
                        enrich eventUpstream services --- 
KAFKA - event Stream Processor  Complex Event Processor 
 Elastic Search.
From what I understand, Storm will make a very good ESP and Spark Streaming 
will make a good CEP.
But, we are also evaluating Storm with Trident.
How does Spark Streaming compare with Storm with Trident?
Sridhar Chellappa


  


 On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com wrote:
   

 I have a similar scenario where we need to bring data from kinesis to hbase. 
Data volecity is 20k per 10 mins. Little manipulation of data will be required 
but that's regardless of the tool so we will be writing that piece in Java 
pojo. All env is on aws. Hbase is on a long running EMR and kinesis on a 
separate cluster.TIA.
Best
AyanOn 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote:

The programming models for the two frameworks are conceptually rather 
different; I haven't worked with Storm for quite some time, but based on my old 
experience with it, I would equate Spark Streaming more with Storm's Trident 
API, rather than with the raw Bolt API. Even then, there are significant 
differences, but it's a bit closer.

If you can share your use case, we might be able to provide better guidance.

Regards,
Will

On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote:

Hi All,

I am evaluating spark VS storm ( spark streaming  ) and i am not able to see 
what is equivalent of Bolt in storm inside spark.

Any help will be appreciated on this ?

Thanks ,
Ashish
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




   





  

Re: Spark or Storm

2015-06-16 Thread Spark Enthusiast
I have a use-case where a stream of Incoming events have to be aggregated and 
joined to create Complex events. The aggregation will have to happen at an 
interval of 1 minute (or less).
The pipeline is :                                  send events                  
                        enrich eventUpstream services --- 
KAFKA - event Stream Processor  Complex Event Processor 
 Elastic Search.
From what I understand, Storm will make a very good ESP and Spark Streaming 
will make a good CEP.
But, we are also evaluating Storm with Trident.
How does Spark Streaming compare with Storm with Trident?
Sridhar Chellappa


  


 On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com wrote:
   

 I have a similar scenario where we need to bring data from kinesis to hbase. 
Data volecity is 20k per 10 mins. Little manipulation of data will be required 
but that's regardless of the tool so we will be writing that piece in Java 
pojo. All env is on aws. Hbase is on a long running EMR and kinesis on a 
separate cluster.TIA.
Best
AyanOn 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote:

The programming models for the two frameworks are conceptually rather 
different; I haven't worked with Storm for quite some time, but based on my old 
experience with it, I would equate Spark Streaming more with Storm's Trident 
API, rather than with the raw Bolt API. Even then, there are significant 
differences, but it's a bit closer.

If you can share your use case, we might be able to provide better guidance.

Regards,
Will

On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote:

Hi All,

I am evaluating spark VS storm ( spark streaming  ) and i am not able to see 
what is equivalent of Bolt in storm inside spark.

Any help will be appreciated on this ?

Thanks ,
Ashish
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org