Regarding anomaly detection in real time streaming data

2020-05-11 Thread Hemant Garg
Hello sir,
I'm currently working on a project where i would've to detect anomalies in
real time streaming data pushing data from kafka into apache spark. I chose
to go with streaming kmeans clustering algorithm, but I couldn't find much
about it. Do you think it is a suitable algorithm to go with or should i
think of something else.


Re: What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread Chanh Le
The different between Stream vs Micro Batch is about Ordering of Messages
> Spark Streaming guarantees ordered processing of RDDs in one DStream. Since 
> each RDD is processed in parallel, there is not order guaranteed within the 
> RDD. This is a tradeoff design Spark made. If you want to process the 
> messages in order within the RDD, you have to process them in one thread, 
> which does not have the benefit of parallelism.

More about that 
http://samza.apache.org/learn/documentation/0.10/comparisons/spark-streaming.html
 
<http://samza.apache.org/learn/documentation/0.10/comparisons/spark-streaming.html>





> On Sep 27, 2016, at 2:12 PM, kant kodali  wrote:
> 
> What is the difference between mini-batch vs real time streaming in practice 
> (not theory)? In theory, I understand mini batch is something that batches in 
> the given time frame whereas real time streaming is more like do something as 
> the data arrives but my biggest question is why not have mini batch with 
> epsilon time frame (say one millisecond) or I would like to understand reason 
> why one would be an effective solution than other?
> I recently came across one example where mini-batch (Apache Spark) is used 
> for Fraud detection and real time streaming (Apache Flink) used for Fraud 
> Prevention. Someone also commented saying mini-batches would not be an 
> effective solution for fraud prevention (since the goal is to prevent the 
> transaction from occurring as it happened) Now I wonder why this wouldn't be 
> so effective with mini batch (Spark) ? Why is it not effective to run 
> mini-batch with 1 millisecond latency? Batching is a technique used 
> everywhere including the OS and the Kernel TCP/IP stack where the data to the 
> disk or network are indeed buffered so what is the convincing factor here to 
> say one is more effective than other?
> Thanks,
> kant
> 



Re: What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread Alonso Isidoro Roman
mini batch or near real time: processing frames within 500 ms or more

real time: processing frames in 5 ms-10ms.

The main difference is processing velocity, i think.

Apache Spark Streaming is mini batch, not true real time.

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>

2016-09-27 11:15 GMT+02:00 kant kodali :

> I understand the difference between fraud detection and fraud prevention
> in general but I am not interested in the semantic war on what these terms
> precisely mean. I am more interested in understanding the difference
> between mini-batch vs real time streaming from CS perspective.
>
>
>
> On Tue, Sep 27, 2016 12:54 AM, Mich Talebzadeh mich.talebza...@gmail.com
> wrote:
>
>> Replace mini-batch with micro-batching and do a search again. what is
>> your understanding of fraud detection?
>>
>> Spark streaming can be used for risk calculation and fraud detection
>> (including stopping fraud going through for example credit card
>> fraud) effectively "in practice". it can even be used for Complex Event
>> Processing.
>>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 27 September 2016 at 08:12, kant kodali  wrote:
>>
>> What is the difference between mini-batch vs real time streaming in
>> practice (not theory)? In theory, I understand mini batch is something that
>> batches in the given time frame whereas real time streaming is more like do
>> something as the data arrives but my biggest question is why not have mini
>> batch with epsilon time frame (say one millisecond) or I would like to
>> understand reason why one would be an effective solution than other?
>> I recently came across one example where mini-batch (Apache Spark) is
>> used for Fraud detection and real time streaming (Apache Flink) used for
>> Fraud Prevention. Someone also commented saying mini-batches would not be
>> an effective solution for fraud prevention (since the goal is to prevent
>> the transaction from occurring as it happened) Now I wonder why this
>> wouldn't be so effective with mini batch (Spark) ? Why is it not effective
>> to run mini-batch with 1 millisecond latency? Batching is a technique used
>> everywhere including the OS and the Kernel TCP/IP stack where the data to
>> the disk or network are indeed buffered so what is the convincing factor
>> here to say one is more effective than other?
>> Thanks,
>> kant
>>
>>
>>


Re: What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread kant kodali
I understand the difference between fraud detection and fraud prevention in
general but I am not interested in the semantic war on what these terms
precisely mean. I am more interested in understanding the difference
between mini-batch vs real time streaming from CS perspective.
 





On Tue, Sep 27, 2016 12:54 AM, Mich Talebzadeh mich.talebza...@gmail.com
wrote:
Replace mini-batch with micro-batching and do a search again. what is your
understanding of fraud detection?
Spark streaming can be used for risk calculation and fraud detection (including
stopping fraud going through for example credit card fraud) effectively "in
practice". it can even be used for Complex Event Processing.

HTH
Dr Mich Talebzadeh



LinkedIn 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com




Disclaimer: Use it at your own risk.  Any and all responsibility for any loss,
damage or destruction
of data or any other property which may arise from relying on this
email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction. 




On 27 September 2016 at 08:12, kant kodali   wrote:
What is the difference between mini-batch vs real time streaming in practice
(not theory)? In theory, I understand mini batch is something that batches in
the given time frame whereas real time streaming is more like do something as
the data arrives but my biggest question is why not have mini batch with epsilon
time frame (say one millisecond) or I would like to understand reason why one
would be an effective solution than other?I recently came across one example
where mini-batch (Apache Spark) is used for Fraud detection and real time
streaming (Apache Flink) used for Fraud Prevention. Someone also commented
saying mini-batches would not be an effective solution for fraud prevention
(since the goal is to prevent the transaction from occurring as it happened) Now
I wonder why this wouldn't be so effective with mini batch (Spark) ? Why is it
not effective to run mini-batch with 1 millisecond latency? Batching is a
technique used everywhere including the OS and the Kernel TCP/IP stack where the
data to the disk or network are indeed buffered so what is the convincing factor
here to say one is more effective than other?Thanks,kant

Re: What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread Mich Talebzadeh
Replace mini-batch with micro-batching and do a search again. what is your
understanding of fraud detection?

Spark streaming can be used for risk calculation and fraud detection
(including stopping fraud going through for example credit card
fraud) effectively "in practice". it can even be used for Complex Event
Processing.


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 September 2016 at 08:12, kant kodali  wrote:

> What is the difference between mini-batch vs real time streaming in
> practice (not theory)? In theory, I understand mini batch is something that
> batches in the given time frame whereas real time streaming is more like do
> something as the data arrives but my biggest question is why not have mini
> batch with epsilon time frame (say one millisecond) or I would like to
> understand reason why one would be an effective solution than other?
> I recently came across one example where mini-batch (Apache Spark) is used
> for Fraud detection and real time streaming (Apache Flink) used for Fraud
> Prevention. Someone also commented saying mini-batches would not be an
> effective solution for fraud prevention (since the goal is to prevent the
> transaction from occurring as it happened) Now I wonder why this wouldn't
> be so effective with mini batch (Spark) ? Why is it not effective to run
> mini-batch with 1 millisecond latency? Batching is a technique used
> everywhere including the OS and the Kernel TCP/IP stack where the data to
> the disk or network are indeed buffered so what is the convincing factor
> here to say one is more effective than other?
> Thanks,
> kant
>


What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread kant kodali
What is the difference between mini-batch vs real time streaming in practice
(not theory)? In theory, I understand mini batch is something that batches in
the given time frame whereas real time streaming is more like do something as
the data arrives but my biggest question is why not have mini batch with epsilon
time frame (say one millisecond) or I would like to understand reason why one
would be an effective solution than other?I recently came across one example
where mini-batch (Apache Spark) is used for Fraud detection and real time
streaming (Apache Flink) used for Fraud Prevention. Someone also commented
saying mini-batches would not be an effective solution for fraud prevention
(since the goal is to prevent the transaction from occurring as it happened) Now
I wonder why this wouldn't be so effective with mini batch (Spark) ? Why is it
not effective to run mini-batch with 1 millisecond latency? Batching is a
technique used everywhere including the OS and the Kernel TCP/IP stack where the
data to the disk or network are indeed buffered so what is the convincing factor
here to say one is more effective than other?Thanks,kant

Re: real-time streaming

2014-10-28 Thread ll
thanks jay.  do you think spark is a good fit for handling streaming &
analyzing videos in real time?  in this case, we're streaming 30 frames per
second, and each frame is an image (size:  roughly 500K - 1MB).  we need to
analyze every frame and return the analysis result back instantly in real
time.  thanks again.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/real-time-streaming-tp17526p17528.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: real-time streaming

2014-10-28 Thread jay vyas
a REAL TIME stream, by definition, delivers data every X seconds.  you can
easily do this with spark. roughly here is the way to create a stream
gobbler and attach a spark app to read its data every X seconds

- Write a Runnable thread which reads data from a source.  Test that it
works independently.

- Add that thread into a DStream Handler, and implement onStart() such that
the thread above is launched in the onStart(), andadd logic to onStop() to
safely destroy the above thread.

- Set the window time (i.e. to 5 seconds)

- Start your spark streaming context, and run a forEachRDD (...) in your
spark app.

- MAke sure that you launch with 2 or more workers.



On Tue, Oct 28, 2014 at 1:44 PM, ll  wrote:

> the spark tutorial shows that we can create a stream that reads "new files"
> from a directory.
>
> that seems to have some lag time, as we have to write the data to file
> first
> and then wait until spark stream picks it up.
>
> what is the best way to implement REAL 'REAL-TIME' streaming for analysis
> in
> real time?  for example, like streaming videos, sounds, images, etc
> continuously?
>
> thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/real-time-streaming-tp17526.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
jay vyas


real-time streaming

2014-10-28 Thread ll
the spark tutorial shows that we can create a stream that reads "new files"
from a directory.  

that seems to have some lag time, as we have to write the data to file first
and then wait until spark stream picks it up.

what is the best way to implement REAL 'REAL-TIME' streaming for analysis in
real time?  for example, like streaming videos, sounds, images, etc
continuously?

thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/real-time-streaming-tp17526.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org