Re: Flink ML Use cases

2019-05-25 Thread Abhishek Singh
Thanks for the confirmation, Fabian.


*Regards,*
*Abhishek Kumar Singh*

*Search Engine Engineer*
*Mob :+91 7709735480 *


*...*


On Sat, May 25, 2019 at 8:55 PM Fabian Hueske  wrote:

> Hi Abhishek,
>
> Your observation is correct. Right now, the Flink ML module is in a
> half-baked state and is only supported in batch mode.
> It is not integrated with the DataStream API. FLIP-23 proposes a feature
> that allows to evaluated an externally trained model (stored as PMML) on a
> stream of data.
>
> There is another effort to implement a new machine learning API /
> environment based on the Table API. This will be supported for batch and
> streaming sources.
> However, this effort just started and the features is not available yet.
>
> Best, Fabian
>
> Am So., 19. Mai 2019 um 11:54 Uhr schrieb Abhishek Singh <
> asingh2...@gmail.com>:
>
>>
>> Thanks again for the above resources.
>>
>> I went through the project and also ran the example on my system to get a
>> grasp of the architecture.
>>
>> However, this project does not use Flink ML in it at all.
>>
>> Also, after having done enough research on Flink ML, I also found that it
>> does not let us persist the model, that's why I am not able to re-use the
>> model trained using Flink ML.
>>
>> It looks like Flink ML cannot really be used for real-life use cases as
>> it neither lets us persist the trained model, nor can it help us to use the
>> trained model on a *DataStream*.
>>
>> Please correct me if I am wrong.
>>
>>
>>
>>
>> *Regards,*
>> *Abhishek Kumar Singh*
>>
>> *Search Engine Engineer*
>> *Mob :+91 7709735480 *
>>
>>
>> *...*
>>
>>
>> On Wed, May 15, 2019 at 11:25 AM Abhishek Singh 
>> wrote:
>>
>>>
>>> Thanks a lot Rong and Sameer.
>>>
>>> Looks like this is what I wanted.
>>>
>>> I will try the above projects.
>>>
>>> *Regards,*
>>> *Abhishek Kumar Singh*
>>>
>>> *Search Engineer*
>>> *Mob :+91 7709735480 *
>>>
>>>
>>> *...*
>>>
>>>
>>> On Wed, May 15, 2019 at 8:00 AM Rong Rong  wrote:
>>>
>>>> Hi Abhishek,
>>>>
>>>> Based on your description, I think this FLIP proposal[1] seems to fit
>>>> perfectly for your use case.
>>>> you can also checkout the Github repo by Boris (CCed) for the PMML
>>>> implementation[2]. This proposal is still under development [3], you are
>>>> more than welcome to test out and share your feedbacks.
>>>>
>>>> Thanks,
>>>> Rong
>>>>
>>>> [1]
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-23+-+Model+Serving
>>>> [2] https://github.com/FlinkML/flink-modelServer /
>>>> https://github.com/FlinkML/flink-speculative-modelServer
>>>> [3] https://github.com/apache/flink/pull/7446
>>>>
>>>> On Tue, May 14, 2019 at 4:44 PM Sameer Wadkar 
>>>> wrote:
>>>>
>>>>> If you can save the model as a PMML file you can apply it on a stream
>>>>> using one of the java pmml libraries.
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On May 14, 2019, at 4:44 PM, Abhishek Singh 
>>>>> wrote:
>>>>>
>>>>> I was looking forward to using Flink ML for my project where I think I
>>>>> can use SVM.
>>>>>
>>>>> I have been able to run a bath job using flink ML and trained and
>>>>> tested my data.
>>>>>
>>>>> Now I want to do the following:-
>>>>> 1. Applying the above-trained model to a stream of events from Kafka
>>>>> (Using Data Streams) :For this, I want to know if Flink ML can be used
>>>>> with Data Streams.
>>>>>
>>>>> 2. Persisting the model: I may want to save the trained model for some
>>>>> time future.
>>>>>
>>>>> Can the above 2 use cases be achieved using Apache Flink?
>>>>>
>>>>> *Regards,*
>>>>> *Abhishek Kumar Singh*
>>>>>
>>>>> *Search Engineer*
>>>>> *Mob :+91 7709735480 *
>>>>>
>>>>>
>>>>> *...*
>>>>>
>>>>>


Re: Flink ML Use cases

2019-05-19 Thread Abhishek Singh
Thanks again for the above resources.

I went through the project and also ran the example on my system to get a
grasp of the architecture.

However, this project does not use Flink ML in it at all.

Also, after having done enough research on Flink ML, I also found that it
does not let us persist the model, that's why I am not able to re-use the
model trained using Flink ML.

It looks like Flink ML cannot really be used for real-life use cases as it
neither lets us persist the trained model, nor can it help us to use the
trained model on a *DataStream*.

Please correct me if I am wrong.




*Regards,*
*Abhishek Kumar Singh*

*Search Engine Engineer*
*Mob :+91 7709735480 *


*...*


On Wed, May 15, 2019 at 11:25 AM Abhishek Singh 
wrote:

>
> Thanks a lot Rong and Sameer.
>
> Looks like this is what I wanted.
>
> I will try the above projects.
>
> *Regards,*
> *Abhishek Kumar Singh*
>
> *Search Engineer*
> *Mob :+91 7709735480 *
>
>
> *...*
>
>
> On Wed, May 15, 2019 at 8:00 AM Rong Rong  wrote:
>
>> Hi Abhishek,
>>
>> Based on your description, I think this FLIP proposal[1] seems to fit
>> perfectly for your use case.
>> you can also checkout the Github repo by Boris (CCed) for the PMML
>> implementation[2]. This proposal is still under development [3], you are
>> more than welcome to test out and share your feedbacks.
>>
>> Thanks,
>> Rong
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-23+-+Model+Serving
>> [2] https://github.com/FlinkML/flink-modelServer /
>> https://github.com/FlinkML/flink-speculative-modelServer
>> [3] https://github.com/apache/flink/pull/7446
>>
>> On Tue, May 14, 2019 at 4:44 PM Sameer Wadkar 
>> wrote:
>>
>>> If you can save the model as a PMML file you can apply it on a stream
>>> using one of the java pmml libraries.
>>>
>>> Sent from my iPhone
>>>
>>> On May 14, 2019, at 4:44 PM, Abhishek Singh 
>>> wrote:
>>>
>>> I was looking forward to using Flink ML for my project where I think I
>>> can use SVM.
>>>
>>> I have been able to run a bath job using flink ML and trained and tested
>>> my data.
>>>
>>> Now I want to do the following:-
>>> 1. Applying the above-trained model to a stream of events from Kafka
>>> (Using Data Streams) :For this, I want to know if Flink ML can be used
>>> with Data Streams.
>>>
>>> 2. Persisting the model: I may want to save the trained model for some
>>> time future.
>>>
>>> Can the above 2 use cases be achieved using Apache Flink?
>>>
>>> *Regards,*
>>> *Abhishek Kumar Singh*
>>>
>>> *Search Engineer*
>>> *Mob :+91 7709735480 *
>>>
>>>
>>> *...*
>>>
>>>


Re: Flink ML Use cases

2019-05-14 Thread Abhishek Singh
Thanks a lot Rong and Sameer.

Looks like this is what I wanted.

I will try the above projects.

*Regards,*
*Abhishek Kumar Singh*

*Search Engineer*
*Mob :+91 7709735480 *


*...*


On Wed, May 15, 2019 at 8:00 AM Rong Rong  wrote:

> Hi Abhishek,
>
> Based on your description, I think this FLIP proposal[1] seems to fit
> perfectly for your use case.
> you can also checkout the Github repo by Boris (CCed) for the PMML
> implementation[2]. This proposal is still under development [3], you are
> more than welcome to test out and share your feedbacks.
>
> Thanks,
> Rong
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-23+-+Model+Serving
> [2] https://github.com/FlinkML/flink-modelServer /
> https://github.com/FlinkML/flink-speculative-modelServer
> [3] https://github.com/apache/flink/pull/7446
>
> On Tue, May 14, 2019 at 4:44 PM Sameer Wadkar  wrote:
>
>> If you can save the model as a PMML file you can apply it on a stream
>> using one of the java pmml libraries.
>>
>> Sent from my iPhone
>>
>> On May 14, 2019, at 4:44 PM, Abhishek Singh  wrote:
>>
>> I was looking forward to using Flink ML for my project where I think I
>> can use SVM.
>>
>> I have been able to run a bath job using flink ML and trained and tested
>> my data.
>>
>> Now I want to do the following:-
>> 1. Applying the above-trained model to a stream of events from Kafka
>> (Using Data Streams) :For this, I want to know if Flink ML can be used
>> with Data Streams.
>>
>> 2. Persisting the model: I may want to save the trained model for some
>> time future.
>>
>> Can the above 2 use cases be achieved using Apache Flink?
>>
>> *Regards,*
>> *Abhishek Kumar Singh*
>>
>> *Search Engineer*
>> *Mob :+91 7709735480 *
>>
>>
>> *...*
>>
>>


Flink ML Use cases

2019-05-14 Thread Abhishek Singh
I was looking forward to using Flink ML for my project where I think I can
use SVM.

I have been able to run a bath job using flink ML and trained and tested my
data.

Now I want to do the following:-
1. Applying the above-trained model to a stream of events from Kafka
(Using Data Streams) :For this, I want to know if Flink ML can be used
with Data Streams.

2. Persisting the model: I may want to save the trained model for some time
future.

Can the above 2 use cases be achieved using Apache Flink?

*Regards,*
*Abhishek Kumar Singh*

*Search Engineer*
*Mob :+91 7709735480 *


*...*


Re: rocksdb without checkpointing

2017-02-15 Thread Abhishek Singh
Sorry, that was a red herring. Checkpointing was not getting triggered
because we never enabled it.

Our application is inherently restartable because we can use our own output
to rebuild state. All that is working fine for us - including restart
semantics - without having to worry about upgrading flink topology. Once we
have something in production, will be happy to share more details in flink
forums. We are very pleased with flink so far. Some paradigms are messy
(scale of select for e.g), but we are very pleased overall !!


On Wed, Feb 15, 2017 at 7:54 PM vinay patil  wrote:

> Hi Abhishek,
>
> You can disable checkpointing by not commenting env.enableCheckpointing
>
> What do you mean by "We are trying to do application level checkpointing"
>
> Regards,
> Vinay Patil
>
> On Thu, Feb 16, 2017 at 12:42 AM, abhishekrs [via Apache Flink User
> Mailing List archive.] <[hidden email]
> > wrote:
>
> Is it possible to set state backend as RocksDB without asking it to
> checkpoint?
>
> We are trying to do application level checkpointing (since it gives us
> better flexibility to upgrade our flink pipeline and also restore state in
> a application specific upgrade friendly way). So we don’t really need
> rocksDB to do any checkpointing. Moreover, we also observed that there is
> 20s stall every hour that seems to correlate with rocksDB wanting to
> checkpoint.
>
> Will the following work (effectively disable checkpointing)?
>
> new RocksDBStateBackend("file:///dev/null")
>
>
> Or is there a better way?
>
> -Abhishek-
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/rocksdb-without-checkpointing-tp11645.html
> To start a new topic under Apache Flink User Mailing List archive., email 
> [hidden
> email] 
> To unsubscribe from Apache Flink User Mailing List archive., click here.
> NAML
> 
>
>
>
> --
> View this message in context: Re: rocksdb without checkpointing
> 
> Sent from the Apache Flink User Mailing List archive. mailing list archive
>  at
> Nabble.com.
>


Re: Further aggregation possible after sink?

2017-02-13 Thread Abhishek Singh
You can keep adding stages, but then your sink is no more a sink - it would
have transformed into a map or a flatmap !

On Mon, Feb 13, 2017 at 12:34 PM Mohit Anchlia 
wrote:

> Is it possible to further add aggregation after the sink task executes? Or
> is the sink the last stage of the workflow? Is this flow possible?
>
> start stream -> transform -> load (sink) -> mark final state as loaded in
> a table after all the load was successful in previous state (sink)
>


Re: weird client failure/timeout

2017-01-23 Thread Abhishek Singh
Hi Stephan,

This did not work. For the working case I do see a better utilization of
available slots. However the non working case still doesn't work.

Basically I assigned a unique group to the sources in my for loop - given I
have way more slots than the parallelism I seek.

I know about the parallel source. Doesn't source eat up a slot (like
spark)? Since my data is pre partitioned, I was merely monitoring from
source (keeping it lightweight) and then fanning out to do the actual
reads/work from the next (event driven) operator (after splitting the
stream from source).

This is more like a batch use case. However, I want to use a single
streaming job to do streaming + batch. This batch job emits a application
level marker that gets fanned back in to declare success/completion for the
batch.

Since my data is pre partitioned, my windows don't need to run globally.
Also I don't know how to have a global keyBy (shuffle) and then send a app
marker from source to all the operators. Which is why I keep things hand
partitioned (I can send something from source to each of my partitions and
they get sent to my sink for a count up to indicate completion). I can
control how the markers are sent forward, and my keyBy and windowing
happens with a parallelism of 1 - so I know I can reach the next stage to
keep propagating  my marker. Except that the pattern doesn't scale beyond 8
partitions:(

-Abhishek-

On Mon, Jan 23, 2017 at 10:42 AM Stephan Ewen  wrote:

> Hi!
>
> I think what you are seeing is the effect of too mans tasks going to the
> same task slot. Have a look here:
> https://ci.apache.org/projects/flink/flink-docs-release-1.2/concepts/runtime.html#task-slots-and-resources
>
> By default, Flink shares task slots across all distinct pipelines of the
> same program, for easier "getting started" scheduling behavior.
> For proper deployments (or setups where you just have very , I would make
> sure that the program sets different "sharing groups" (via
> "..slotSharingGroup()") on the different streams.
>
> Also, rather than defining 100s of different sources, I would consider
> defining one source and making it parallel. It works better with Flink's
> default scheduling parameters.
>
> Hope that helps.
>
> Stephan
>
>
>
>
> On Mon, Jan 23, 2017 at 5:40 PM, Abhishek R. Singh <
> abhis...@tetrationanalytics.com> wrote:
>
> Actually, I take it back. It is the last union that is causing issues (of
> job being un-submittable). If I don’t conbineAtEnd, I can go higher (at
> least deploy the job), all the way up to 63.
>
> After that it starts failing in too many files open in Rocks DB (which I
> can understand and is at least better than silently not accepting my job).
>
> Caused by: java.lang.RuntimeException: Error while opening RocksDB
> instance.
> at
> org.apache.flink.contrib.streaming.state.RocksDBStateBackend.initializeForJob(RocksDBStateBackend.java:306)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.createStateBackend(StreamTask.java:821)
> at
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.setup(AbstractStreamOperator.java:118)
> ... 4 more
> Caused by: org.rocksdb.RocksDBException: IO error:
> /var/folders/l1/ncffkbq11_lg6tjk_3cvc_n0gn/T/flink-io-45a78866-a9da-40ca-be51-a894c4fac9be/3815eb68c3777ba4f504e8529db6e145/StreamSource_39_0/dummy_state/7ff48c49-b6ce-4de8-ba7e-8a240b181ae2/db/MANIFEST-01:
> Too many open files
> at org.rocksdb.RocksDB.open(Native Method)
> at org.rocksdb.RocksDB.open(RocksDB.java:239)
> at
> org.apache.flink.contrib.streaming.state.RocksDBStateBackend.initializeForJob(RocksDBStateBackend.java:304)
> ... 6 more
>
>
> On Jan 23, 2017, at 8:20 AM, Abhishek R. Singh <
> abhis...@tetrationanalytics.com> wrote:
>
> Is there a limit on how many DataStreams can be defined in a streaming
> program?
>
> Looks like flink has problems handling too many data streams? I simplified
> my topology further. For eg, this works (parallelism of 4)
>
> 
>
> However, when I try to go beyond 51 (found empirically by parametrizing
> nParts), it barfs again. Submission fails, it wants me to increase
> akka.client.timeout
>
> Here is the reduced code for repro (union at the end itself is not an
> issue). It is the parallelism of the first for loop:
>
> int nParts = cfg.getInt("dummyPartitions", 4);
>
> boolean combineAtEnd = cfg.getBoolean("dummyCombineAtEnd", true);
>
> // create lots of streams
> List> streams = new ArrayList<>(nParts);
> for (int i = 0; i < nParts; i++) {
>   streams.add(env
>   .readFile(
>   new TextInputFormat(new Path("/tmp/input")),
>   "/tmp/input",
>   FileProcessingMode.PROCESS_CONTINUOUSLY,
>   1000,
>   FilePathFilter.createDefaultFilter())
>   .setParallelism(1).name("src"));
> }
>
> if (combineAtEnd == true) {
>   DataStream combined = streams.get(0);
>   for (int i = 1; i < nParts; i++) {
> combined = combined.union(stream

Re: checkpoint notifier not found?

2016-12-12 Thread Abhishek Singh
Will be happy to. Could you guide me a bit in terms of what I need to do?

I am a newbie to open source contributing. And currently at Frankfurt
airport. When I hit ground will be happy to contribute back. Love the
project !!

Thanks for the awesomeness.


On Mon, Dec 12, 2016 at 12:29 PM Stephan Ewen  wrote:

> Thanks for reporting this.
> It would be awesome if you could file a JIRA or a pull request for fixing
> the docs for that.
>
> On Sat, Dec 10, 2016 at 2:02 AM, Abhishek R. Singh <
> abhis...@tetrationanalytics.com> wrote:
>
> I was following the official documentation:
> https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state.html
>
> Looks like this is the right one to be using: import
> org.apache.flink.runtime.state.CheckpointListener;
>
> -Abhishek-
>
> On Dec 9, 2016, at 4:30 PM, Abhishek R. Singh <
> abhis...@tetrationanalytics.com> wrote:
>
> I can’t seem to find CheckpointNotifier. Appreciate help !
>
> CheckpointNotifier is not a member of package
> org.apache.flink.streaming.api.checkpoint
>
> From my pom.xml:
>
> 
> org.apache.flink
> flink-scala_2.11
> 1.1.3
> 
> 
> org.apache.flink
> flink-streaming-scala_2.11
> 1.1.3
> 
> 
> org.apache.flink
> flink-clients_2.11
> 1.1.3
> 
> 
> org.apache.flink
> flink-statebackend-rocksdb_2.11
> 1.1.3
> 
>
>
>
>


Re: custom sources

2016-05-20 Thread Abhishek Singh
Thanks. I am still in theory/evaluation mode. Will try to code this up to
see if checkpoint will become an issue. I do have a high rate of ingest and
lots of in flight data. Hopefully flink back pressure keeps this
nicely bounded.

I doubt it will be a problem for me - because even spark is writing
all in-flight data to disk - because all partitioning goes thru disk and is
inline - ie sync. Flink disk usage is write only and for failure case only.
Looks pretty compelling so far.

On Friday, May 20, 2016, Ufuk Celebi  wrote:

> On Thu, May 19, 2016 at 7:48 PM, Abhishek R. Singh
> > wrote:
> > There seems to be some relationship between watermarks, triggers and
> > checkpoint that is someone not being leveraged.
>
> Checkpointing is independent of this, yes. Did the state size become a
> problem for your use case? There are various users running Flink with
> very large state sizes without any issues. The recommended state
> backend for these use cases is the RocksDB backend.
>
> The barriers are triggered at the sources and flow with the data
> (
> https://ci.apache.org/projects/flink/flink-docs-release-1.0/internals/stream_checkpointing.html
> ).
> Everything in-flight after the barrier is not relevant for the
> checkpoint. We are only interested in a consistent state snapshot.
>


Re: flink async snapshots

2016-05-20 Thread Abhishek Singh
Yes. Thanks for explaining.

On Friday, May 20, 2016, Ufuk Celebi  wrote:

> On Thu, May 19, 2016 at 8:54 PM, Abhishek R. Singh
> > wrote:
> > If you can take atomic in-memory copies, then it works (at the cost of
> > doubling your instantaneous memory). For larger state (say rocks DB),
> won’t
> > you have to stop the world (atomic snapshot) and make a copy? Doesn’t
> that
> > make it synchronous, instead of background/async?
>
> Hey Abhishek,
>
> that's correct. There are two variants for RocksDB:
>
> - semi-async (default): snapshot is taking via RocksDB backup feature
> to backup to a directory (sync). This is then copied to the final
> checkpoint location (async, e.g copy to HDFS).
>
> - fully-async: snapshot is taking via RocksDB snapshot feature (sync,
> but no full copy and essentially "free"). With this snapshot we
> iterate over all k/v-pairs and copy them to the final checkpoint
> location (async, e.g. copy to HDFS).
>
> You enable the second variant via:
> rocksDbBackend.enableFullyAsyncSnapshots();
>
> This is only part of the 1.1-SNAPSHOT version though.
>
> I'm not too familiar with the performance characteristics of both
> variants, but maybe Aljoscha can chime in.
>
> Does this clarify things for you?
>
> – Ufuk
>