Re: Is there any Natural Language Processing samples for flink?

2022-07-27 Thread Suneel Marthi
Yes you can there have been uses of Apache OpenNLP in Flink pipelines and
other Machine Translation libraries - happy to chat with you offline.

On Wed, Jul 27, 2022 at 1:07 PM John Smith  wrote:

> But we can use some sort of Java/Scala NLP lib within our Fink Jobs I
> guess...
>
> On Tue, Jul 26, 2022 at 9:59 PM Yunfeng Zhou 
> wrote:
>
>> Hi John,
>>
>> So far as I know, Flink does not have an official library or sample
>> specializing in NLP cases yet. You can refer to Flink ML[1] for machine
>> learning samples or Deep Learning on Flink[2] for deep learning samples.
>>
>> [1] https://github.com/apache/flink-ml
>> [2] https://github.com/flink-extended/dl-on-flink
>>
>> Best,
>> Yunfeng
>>
>> On Tue, Jul 26, 2022 at 10:49 PM John Smith 
>> wrote:
>>
>>> As the title asks... All I see is Spark examples.
>>>
>>> Thanks
>>>
>>


Re: AWS Client Builder with default credentials

2020-02-24 Thread Suneel Marthi
Not sure if this helps - this is how I invoke a Sagemaker endpoint model
from a flink pipeline.

See
https://github.com/smarthi/NMT-Sagemaker-Inference/blob/master/src/main/java/de/dws/berlin/util/AwsUtil.java



On Mon, Feb 24, 2020 at 10:08 AM David Magalhães 
wrote:

> Hi Robert, thanks for your reply.
>
> GlobalConfiguration.loadConfiguration was useful to check if a
> flink-conf.yml file was on resources, for the integration tests that I'm
> doing. On the cluster I will use the default configurations.
>
> On Fri, Feb 21, 2020 at 10:58 AM Robert Metzger 
> wrote:
>
>> There are multiple ways of passing configuration parameters to your user
>> defined code in Flink
>>
>> a)  use getRuntimeContext().getUserCodeClassLoader().getResource() to
>> load a config file from your user code jar or the classpath.
>> b)  use
>> getRuntimeContext().getExecutionConfig().getGlobalJobParameters() to
>> access a configuration object serialized from the main method.
>> you can pass a custom object to the job parameters, or use Flink's
>> "Configuration" object in your main method:
>>
>> final StreamExecutionEnvironment env = 
>> StreamExecutionEnvironment.getExecutionEnvironment();
>>
>> Configuration config = new Configuration();
>> config.setString("foo", "bar");
>> env.getConfig().setGlobalJobParameters(config);
>>
>> c) Load the flink-conf.yaml:
>>
>> Configuration conf = GlobalConfiguration.loadConfiguration();
>>
>> I'm not 100% sure if this approach works, as it is not intended to be
>> used in user code (I believe).
>>
>>
>> Let me know if this helps!
>>
>> Best,
>> Robert
>>
>> On Thu, Feb 20, 2020 at 1:50 PM Chesnay Schepler 
>> wrote:
>>
>>> First things first, we do not intend for users to use anything in the S3
>>> filesystem modules except the filesystems itself,
>>> meaning that you're somewhat treading on unsupported ground here.
>>>
>>> Nevertheless, the S3 modules contain a large variety of AWS-provided
>>> CerentialsProvider implementations,
>>> that can derive credentials from environment variables, system
>>> properties, files on the classpath and many more.
>>>
>>> Ultimately though, you're kind of asking us how to use AWS APIs, for
>>> which I would direct you to the AWS documentation.
>>>
>>> On 20/02/2020 13:16, David Magalhães wrote:
>>>
>>> I'm using
>>> org.apache.flink.fs.s3base.shaded.com.amazonaws.client.builder.AwsClientBuilder
>>> to create a S3 client to copy objects and delete object inside
>>> a TwoPhaseCommitSinkFunction.
>>>
>>> Shouldn't be another way to set up configurations without put them
>>> hardcoded ? Something like core-site.xml or flink-conf.yaml ?
>>>
>>> Right now I need to have them hardcoded like this.
>>>
>>> AmazonS3ClientBuilder.standard
>>>   .withPathStyleAccessEnabled(true)
>>>   .withEndpointConfiguration(
>>> new EndpointConfiguration("http://minio:9000;, "us-east-1")
>>>   )
>>>   .withCredentials(
>>> new AWSStaticCredentialsProvider(new
>>> BasicAWSCredentials("minio", "minio123"))
>>>   )
>>>   .build
>>>
>>> Thanks
>>>
>>>
>>>


Re: Read multiline JSON/XML

2019-11-29 Thread Suneel Marthi
For XML, u could look at Mahout's XMLInputFormat (if u r using HadoopInput
Format).

On Fri, Nov 29, 2019 at 9:01 AM Chesnay Schepler  wrote:

> Why vino?
>
> He's specifically asking whether Flink offers something _like_ spark.
>
> On 29/11/2019 14:39, vino yang wrote:
>
> Hi Flavio,
>
> IMO, it would take more effect to ask this question in the Spark user
> mailing list.
>
> WDYT?
>
> Best,
> Vino
>
> Flavio Pompermaier  于2019年11月29日周五 下午7:09写道:
>
>> Hi to all,
>> is there any out-of-the-box option to read multiline JSON or XML like in
>> Spark?
>> It would be awesome to have something like
>>
>> spark.read .option("multiline", true) .json("/path/to/user.json")
>>
>> Best,
>> Flavio
>>
>
>


Re: Do flink have plans to support Deep Learning?

2019-04-18 Thread Suneel Marthi
that's a very open-ended question.

There's been enough work done on using Flink for Deep Learning model
inference - with TensorFlow (look at Eron Wright's Flink-Tensorflow
project), with Amazon Sagemaker (i have code for that) or work from
LightBend on Flink Model serving.

So yes, there's enuf of that running in production and don't expect any of
that yet to be part of flink codebase.

On Thu, Apr 18, 2019 at 10:01 AM Manjusha Vuyyuru 
wrote:

> Hello,
>
> Do flink have any plans to support Deep Learning, in near future?
>
> Thanks,
> Manju
>
>


Re: Flink Anomaly Detection

2017-07-20 Thread Suneel Marthi
FWIW, We have a built a similar Log Aggregator internally using Apache Nifi
+ KFC stack

(KFC = Kafka, Flink, Cassandra)

Using Apache NiFi for ingesting logs from Openstack via rsyslog and writing
them out to Kafka topics -> Flink Streaming + CEP for detecting anomalous
patterns -> persist the patterns with relevant metadata to Cassandra ->
Dashboard or Search Engine.

We are using Flink CEP for detecting patterns in server logs and to flag
alerts onto a dashboard.

You can check out the implementation here -
https://github.com/keedio/openstack-log-processor



On Thu, Jul 20, 2017 at 5:47 PM, Branham, Jeremy [IT] <
jeremy.d.bran...@sprint.com> wrote:

> Raj -
> I'm looking for the same thing.
> As the ML library doesn't support DataStream api, I'm tossing ideas around
> maybe using the windowing function to build up a model that changes over
> time.
>
>
>
> Jeremy D. Branham
> Technology Architect - Sprint
> O: +1 (972) 405-2970 | M: +1 (817) 791-1627
> jeremy.d.bran...@sprint.com
> #gettingbettereveryday
>
>
> -Original Message-
> From: Raj Kumar [mailto:smallthings1...@gmail.com]
> Sent: Thursday, July 20, 2017 4:24 PM
> To: user@flink.apache.org
> Subject: Flink Anomaly Detection
>
> Hi,
>
> I don't see much discussion on Anomaly detection using Flink. we are
> working on a project where we need to monitor the server logs in real time.
> If there is any sudden spike in the number of transactions(Unusual), server
> errors, we need to create an alert.
>
> 1. How can we best achieve this?
> 2. How do we store the historical information about the patterns observed
> and compute the baseline? Do we need any external source like Elasticsearch
> to store the window snapshots to build a baseline?
> 3. Baseline should be self-learning as new patterns are discovered and
> baseline should get adjusted based on this.
> 4. Flink ML has any capabilities to achieve this?
>
> Please let me know if you have any approach/suggestions ?
>
>
>
> --
> View this message in context: https://na01.safelinks.
> protection.outlook.com/?url=http%3A%2F%2Fapache-flink-
> user-mailing-list-archive.2336050.n4.nabble.com%2FFlink-
> Anomaly-Detection-tp14370.html=02%7C01%7CJeremy.D.
> Branham%40sprint.com%7C0fd3a0f94d3547bdf12b08d4cfb86865%
> 7C4f8bc0acbd784bf5b55f1b31301d9adf%7C0%7C0%7C636361838310440993=
> Rah8P27ro%2FT5xZJAN%2BFwQv0Ze%2FGD9WuF6lGM3ox1Mac%3D=0
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>
> 
>
> This e-mail may contain Sprint proprietary information intended for the
> sole use of the recipient(s). Any use by others is prohibited. If you are
> not the intended recipient, please contact the sender and delete all copies
> of the message.
>


Re: Integrating Flink CEP with a Rules Engine

2017-06-23 Thread Suneel Marthi
Sorry I didn't read the whole thread.

We have a similar rqmt wherein the users would like to add/update/delete
CEP patterns via UX or REST api and we started discussing building a REST
api for that, glad to see that this is a common ask and if there's already
a community effort around this - that's great to know.

On Fri, Jun 23, 2017 at 9:54 AM, Sridhar Chellappa <flinken...@gmail.com>
wrote:

> Folks,
>
> Plenty of very good points but I see this discussion digressing from what
> I originally asked for. We need a dashboard to let the Business Analysts to
> define rules and the CEP to run them.
>
> My original question was how to solve this with Flink CEP?
>
> From what I see, this is not a solved problem. Correct me if I am wrong.
>
> On Fri, Jun 23, 2017 at 6:52 PM, Kostas Kloudas <
> k.klou...@data-artisans.com> wrote:
>
>> Hi all,
>>
>> Currently there is an ongoing effort to integrate FlinkCEP with Flink's
>> SQL API.
>> There is already an open FLIP for this:
>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-20%
>> 3A+Integration+of+SQL+and+CEP
>> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-20:+Integration+of+SQL+and+CEP>
>>
>> So, if there was an effort for integration of different
>> libraries/tools/functionality as well, it
>> would be nice to go a bit more into details on i) what is already there,
>> ii) what is planned to be
>> integrated for the SQL effort, and iii) what else is required, and
>> consolidate the resources
>> available.
>>
>> This will allow the community to move faster and with a clear roadmap.
>>
>> Kostas
>>
>> On Jun 23, 2017, at 2:51 PM, Suneel Marthi <smar...@apache.org> wrote:
>>
>> FWIW, here's an old Cloudera blog about using Drools with Spark.
>>
>> https://blog.cloudera.com/blog/2015/11/how-to-build-a-comple
>> x-event-processing-app-on-apache-spark-and-drools/
>>
>> It should be possible to invoke Drools from Flink in a similar way (I
>> have not tried it).
>>
>> It all depends on what the use case and how much of present Flink CEP
>> satisfies the use case before considering integration with more complex
>> rule engines.
>>
>>
>> Disclaimer: I work for Red Hat
>>
>> On Fri, Jun 23, 2017 at 8:43 AM, Ismaël Mejía <ieme...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> It is really interesting to see this discussion because that was one
>>> of the questions on the presentation on CEP at Berlin Buzzwords, and
>>> this is one line of work that may eventually make sense to explore.
>>>
>>> Rule engines like drools implement the Rete algorithm that if I
>>> understood correctly optimizes the analysis of a relatively big set of
>>> facts (conditions) into a simpler evaluation graph. For more details
>>> this is a really nice explanation.
>>> https://www.sparklinglogic.com/rete-algorithm-demystified-part-2/
>>>
>>> On flink's CEP I have the impression that you define this graph by
>>> hand. Using a rule engine you could infer an optimal graph from the
>>> set of rules, and then this graph could be translated into CEP
>>> patterns.
>>>
>>> Of course take all of this with a grain of salt because I am not an
>>> expert on both CEP or the Rete algorithm, but I start to see the
>>> connection of both worlds more clearly now. So if anyone else has
>>> ideas of the feasibility of this or can see some other
>>> issues/consequences please comment. I also have the impression that
>>> distribution is less of an issue because the rete network is
>>> calculated only once and updates are not 'dynamic' (but I might be
>>> wrong).
>>>
>>> Ismaël
>>>
>>> ps. I add Thomas in copy who was who made the question in the
>>> conference in case he has some comments/ideas.
>>>
>>>
>>> On Fri, Jun 23, 2017 at 1:48 PM, Kostas Kloudas
>>> <k.klou...@data-artisans.com> wrote:
>>> > Hi Jorn and Sridhar,
>>> >
>>> > It would be worth describing a bit more what these tools are and what
>>> are
>>> > your needs.
>>> > In addition, and to see what the CEP library already offers here you
>>> can
>>> > find the documentation:
>>> >
>>> > https://ci.apache.org/projects/flink/flink-docs-release-1.3/
>>> dev/libs/cep.html
>>> >
>>> >
>>> > Thanks,
>>> > Kostas
>>> >
>>> > On Jun 23, 2017,

Re: Integrating Flink CEP with a Rules Engine

2017-06-23 Thread Suneel Marthi
FWIW, here's an old Cloudera blog about using Drools with Spark.

https://blog.cloudera.com/blog/2015/11/how-to-build-a-complex-event-processing-app-on-apache-spark-and-drools/

It should be possible to invoke Drools from Flink in a similar way (I have
not tried it).

It all depends on what the use case and how much of present Flink CEP
satisfies the use case before considering integration with more complex
rule engines.


Disclaimer: I work for Red Hat

On Fri, Jun 23, 2017 at 8:43 AM, Ismaël Mejía  wrote:

> Hello,
>
> It is really interesting to see this discussion because that was one
> of the questions on the presentation on CEP at Berlin Buzzwords, and
> this is one line of work that may eventually make sense to explore.
>
> Rule engines like drools implement the Rete algorithm that if I
> understood correctly optimizes the analysis of a relatively big set of
> facts (conditions) into a simpler evaluation graph. For more details
> this is a really nice explanation.
> https://www.sparklinglogic.com/rete-algorithm-demystified-part-2/
>
> On flink's CEP I have the impression that you define this graph by
> hand. Using a rule engine you could infer an optimal graph from the
> set of rules, and then this graph could be translated into CEP
> patterns.
>
> Of course take all of this with a grain of salt because I am not an
> expert on both CEP or the Rete algorithm, but I start to see the
> connection of both worlds more clearly now. So if anyone else has
> ideas of the feasibility of this or can see some other
> issues/consequences please comment. I also have the impression that
> distribution is less of an issue because the rete network is
> calculated only once and updates are not 'dynamic' (but I might be
> wrong).
>
> Ismaël
>
> ps. I add Thomas in copy who was who made the question in the
> conference in case he has some comments/ideas.
>
>
> On Fri, Jun 23, 2017 at 1:48 PM, Kostas Kloudas
>  wrote:
> > Hi Jorn and Sridhar,
> >
> > It would be worth describing a bit more what these tools are and what are
> > your needs.
> > In addition, and to see what the CEP library already offers here you can
> > find the documentation:
> >
> > https://ci.apache.org/projects/flink/flink-docs-
> release-1.3/dev/libs/cep.html
> >
> >
> > Thanks,
> > Kostas
> >
> > On Jun 23, 2017, at 1:41 PM, Jörn Franke  wrote:
> >
> > Hallo,
> >
> > It si possible, but some caveat : flink is a distributed system, but in
> > drools the fact are only locally available. This may lead to strange
> effects
> > when rules update the fact base.
> >
> > Best regards
> >
> > On 23. Jun 2017, at 12:49, Sridhar Chellappa 
> wrote:
> >
> > Folks,
> >
> > I am new to Flink.
> >
> > One of the reasons why I am interested in Flink is because of its CEP
> > library. Our CEP logic comprises of a set of complex business rules which
> > will have to be managed (Create, Update, Delete) by a bunch of business
> > analysts.
> >
> > Is there a way I can integrate other third party tools (Drools,
> OpenRules)
> > to let Business Analysts define rules and  execute them using Flink's CEP
> > library?
> >
> >
>


Re: Suggestion for top 'k' products

2017-03-13 Thread Suneel Marthi
For an example implementation using Flink, check out
https://github.com/bigpetstore/bigpetstore-flink/blob/master/src/main/java/org/apache/bigtop/bigpetstore/flink/java/FlinkStreamingRecommender.java

On Mon, Mar 13, 2017 at 1:29 PM, Suneel Marthi <smar...@apache.org> wrote:

> A simple way is to populate a Priority Queue of  max size 'k' and
> implement a comparator on ur records.  That would ensure that u always have
> Top k records at any instant in time.
>
> On Mon, Mar 13, 2017 at 1:25 PM, Meghashyam Sandeep V <
> vr1meghash...@gmail.com> wrote:
>
>> Hi All,
>>
>> I'm trying to use Flink for a use case where I would want to see my top
>> selling products in time windows in near real time (windows of size 1-2
>> mins if fine). I guess this is the most common use case to use streaming
>> apis in e-commerce. I see that I can iterate over records in a windowed
>> stream and do the sorting myself. I'm wondering if thats the best way. Is
>> there any in built sort functionality that I missed anywhere in Flink docs?
>>
>> Thanks,
>> Sandeep
>>
>
>


Re: Suggestion for top 'k' products

2017-03-13 Thread Suneel Marthi
A simple way is to populate a Priority Queue of  max size 'k' and implement
a comparator on ur records.  That would ensure that u always have Top k
records at any instant in time.

On Mon, Mar 13, 2017 at 1:25 PM, Meghashyam Sandeep V <
vr1meghash...@gmail.com> wrote:

> Hi All,
>
> I'm trying to use Flink for a use case where I would want to see my top
> selling products in time windows in near real time (windows of size 1-2
> mins if fine). I guess this is the most common use case to use streaming
> apis in e-commerce. I see that I can iterate over records in a windowed
> stream and do the sorting myself. I'm wondering if thats the best way. Is
> there any in built sort functionality that I missed anywhere in Flink docs?
>
> Thanks,
> Sandeep
>


Re: Question tableEnv.toDataSet : Table to DataSet<Tuple2<String,Integer>>

2016-08-20 Thread Suneel Marthi
I can confirm that the code u have works in Flink 1.1.0

On Sat, Aug 20, 2016 at 3:37 PM, Camelia Elena Ciolac 
wrote:

>
> Good evening,
>
>
> I started working with the beta Flink SQL in BatchTableEnvironment and I
> am interested to convert the resulted Table object into a
> DataSet>.
>
> I give some lines of code as example:
>
>
> DataSet> ds1 = ...
>
> tableEnv.registerDataSet("EventsTable", ds1, 
> "event_id,region,latitude,longitude");
>
>
> Table result = tableEnv.sql("SELECT region, COUNT(event_id) AS counter
> FROM EventsTable GROUP BY region");
>
> TypeInformation> typeInformation =
> TypeInformation.of(new TypeHint>() { });
> DataSet> resDS = tableEnv.toDataSet(result,
> typeInformation);
>
> At runtime, I get error:
>
>
> Caused by: org.apache.flink.api.table.codegen.CodeGenException:
> Incompatible types of expression and result type.
> at org.apache.flink.api.table.codegen.CodeGenerator$$anonfun$
> generateResultExpression$2.apply(CodeGenerator.scala:327)
> at org.apache.flink.api.table.codegen.CodeGenerator$$anonfun$
> generateResultExpression$2.apply(CodeGenerator.scala:325)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.flink.api.table.codegen.CodeGenerator.
> generateResultExpression(CodeGenerator.scala:325)
> at org.apache.flink.api.table.codegen.CodeGenerator.
> generateConverterResultExpression(CodeGenerator.scala:269)
> at org.apache.flink.api.table.plan.nodes.dataset.DataSetRel$
> class.getConversionMapper(DataSetRel.scala:89)
> at org.apache.flink.api.table.plan.nodes.dataset.DataSetAggregate.
> getConversionMapper(DataSetAggregate.scala:38)
> at org.apache.flink.api.table.plan.nodes.dataset.DataSetAggregate.
> translateToPlan(DataSetAggregate.scala:142)
> at org.apache.flink.api.table.BatchTableEnvironment.translate(
> BatchTableEnvironment.scala:274)
> at org.apache.flink.api.java.table.BatchTableEnvironment.toDataSet(
> BatchTableEnvironment.scala:163)
>
>
> What is the correct way to achieve this ?
>
> According to solved JIRA issue FLINK-1991 (https://issues.
> apache.org/jira/browse/FLINK-1991) , this is now possible (in Flink
> 1.1.0).
>
>
> As a side note, it works transforming to DataSet, but when trying to
> convert to a POJO dataset (again with fields String region and Integer
> counter) it gives the same error as above.
>
>
> Thanks in advance,
>
> Best regards,
>
> Camelia
>
>
>
>


Re: flink1.0 DataStream groupby

2016-07-21 Thread Suneel Marthi
It should be keyBy(0) for DataStream API (since Flink 0.9)

Its groupBy() in DataSet API.

On Fri, Jul 22, 2016 at 1:27 AM,  wrote:

> Hi,
> today,I use flink to rewrite my spark project,in spark ,data is
> rdd,and it have much transformations and actions,but in flink,the
> DataStream does not have groupby and foreach,
>   for example,
>val env=StreamExecutionEnvironment.createLocalEnvironment()
>   val data=List(("1"->"a"),("2"->"b"),("1"->"c"),("2"->"f"))
>   val ds=env.fromCollection(data)
>   val dskeyby=ds.groupBy(0)
>   ds.print()
>  env.execute()
>
> the code "val dskeyby=ds.groupBy(0)" is error,say "value groupBy is not a
> member of org.apache.flink.streaming.api.scala.DataStream"
> so , the solution is?
> 
>
>
>
>
>


Re: Using Kafka and Flink for batch processing of a batch data source

2016-07-21 Thread Suneel Marthi
I meant to respond to this thread yesterday, but got busy with work and
slipped me.

This is possible doable using Flink Streaming, others can correct me here.

*Assumption:* Both the Batch and Streaming processes are reading from a
single Kafka topic and by "Batched data", I am assuming its the same data
that's being fed to Streaming but aggregated over a longer time period.

This could be done using a Lambda like Architecture.

1. A Kafka topic that's ingesting data to be distributed to various
consumers.
2. A Flink Streaming process with a small time window (minutes/seconds)
that's ingesting from Kafka and handles data over this small window.
3. Another Flink Streaming process with a very long time window (few hrs ?)
that's also ingesting from Kafka and is munging over large time periods of
data (think mini-batch that extends Streaming).

This should work and u don't need a separate Batch process.  A similar
architecture using Spark Streaming (for both batch and streaming) is
demonstrated by Cloudera's Oryx 2.0 project - see http://oryx.io


On Thu, Jul 21, 2016 at 12:41 PM, milind parikh 
wrote:

> At this point in time, imo, batch processing is not why you should be
> considering Flink.
>
> That said, I predict that the stream processing (and event processing)
> will become the dominant methodology; as we begin to gravitate towards  "I
> can't wait; I want it now" phenomenon. In that methodology,  I believe
> Flink represents the cutting edge of what is possible; at this point in
> time.
>
> Regards
> Milind
>
> On Jul 20, 2016 4:57 PM, "Leith Mudge"  wrote:
>
> Thanks Milind & Till,
>
>
>
> This is what I thought from my reading of the documentation but it is nice
> to have it confirmed by people more knowledgeable.
>
>
>
> Supplementary to this question is whether Flink is the best choice for
> batch processing at this point in time or whether I would be better to look
> at a more mature and dedicated batch processing engine such as Spark? I do
> like the choices that adopting the unified programming model outlined in
> Apache Beam/Google Cloud Dataflow SDK and this purports to have runners for
> both Flink and Spark.
>
>
>
> Regards,
>
>
>
> Leith
>
> *From: *Till Rohrmann 
> *Date: *Wednesday, 20 July 2016 at 5:05 PM
> *To: *
> *Subject: *Re: Using Kafka and Flink for batch processing of a batch data
> source
>
>
>
> At the moment there is also no batch source for Kafka. I'm also not so
> sure how you would define a batch given a Kafka stream. Only reading till a
> certain offset? Or maybe until one has read n messages?
>
>
>
> I think it's best to write the batch data to HDFS or another batch data
> store.
>
>
>
> Cheers,
>
> Till
>
>
>
> On Wed, Jul 20, 2016 at 8:08 AM, milind parikh 
> wrote:
>
> It likely does not make sense to publish a file ( "batch data") into
> Kafka; unless the file is very small.
>
> An improvised pub-sub mechanism for Kafka could be to (a) write the file
> into a persistent store outside of kafka (b) publishing of a message into
> Kafka about that write so as to enable processing of that file.
>
> If you really needed to have provenance around processing, you could route
> data processing through Nifi before Flink.
>
> Regards
> Milind
>
>
>
> On Jul 19, 2016 9:37 PM, "Leith Mudge"  wrote:
>
> I am currently working on an architecture for a big data streaming and
> batch processing platform. I am planning on using Apache Kafka for a
> distributed messaging system to handle data from streaming data sources and
> then pass on to Apache Flink for stream processing. I would also like to
> use Flink's batch processing capabilities to process batch data.
>
> Does it make sense to pass the batched data through Kafka on a periodic
> basis as a source for Flink batch processing (is this even possible?) or
> should I just write the batch data to a data store and then process by
> reading into Flink?
>
>
> --
>
>
> | All rights in this email and any attached documents or files are
> expressly reserved. This e-mail, and any files transmitted with it,
> contains confidential information which may be subject to legal privilege.
> If you are not the intended recipient, please delete it and notify Palamir
> Pty Ltd by e-mail. Palamir Pty Ltd does not warrant this transmission or
> attachments are free from viruses or similar malicious code and does not
> accept liability for any consequences to the recipient caused by opening or
> using this e-mail. For the legal protection of our business, any email sent
> or received by us may be monitored or intercepted. | Please consider the
> environment before printing this email. |
>
>
>
> --
>
> | All rights in this email and any attached documents or files are
> expressly reserved. This e-mail, and any files transmitted with it,
> contains confidential 

Re: Random access to small global state

2016-07-09 Thread Suneel Marthi
U could use ignite too, I believe they have a plugin for flink streaming.

Sent from my iPhone

> On Jul 9, 2016, at 8:05 AM, Sebastian  wrote:
> 
> Hi,
> 
> I'm planning to work on a streaming recommender in Flink, and one problem 
> that I have is that the algorithm needs random access to a small global state 
> (say a million counts). It should be ok if there is some inconsistency in the 
> state (e.g., delay in seeing updates).
> 
> Does anyone here have experience with such things? I'm thinking of connecting 
> Flink to a lighweight in-memory key-value store such as memcache for that.
> 
> Best,
> Sebastian


Re: Reading whole files (from S3)

2016-06-07 Thread Suneel Marthi
You can use Mahout XMLInputFormat with Flink - HAdoopInputFormat
definitions. See


http://stackoverflow.com/questions/29429428/xmlinputformat-for-apache-flink
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Read-XML-from-HDFS-td7023.html


On Tue, Jun 7, 2016 at 10:11 PM, Jamie Grier 
wrote:

> Hi Andrea,
>
> How large are these data files?  The implementation you've mentioned here
> is only usable if they are very small.  If so, you're fine.  If not read
> on...
>
> Processing XML input files in parallel is tricky.  It's not a great format
> for this type of processing as you've seen.  They are tricky to split and
> more complex to iterate through than simpler formats. However, others have
> implemented XMLInputFormat classes for Hadoop.  Have you looked at these?
> Mahout has an XMLInputFormat implementation for example but I haven't used
> it directly.
>
> Anyway, you can reuse Hadoop InputFormat implementations in Flink
> directly.  This is likely a good route.  See Flink's HadoopInputFormat
> class.
>
> -Jamie
>
>
> On Tue, Jun 7, 2016 at 7:35 AM, Andrea Cisternino 
> wrote:
>
>> Hi all,
>>
>> I am evaluating Apache Flink for processing large sets of Geospatial data.
>> The use case I am working on will involve reading a certain number of GPX
>> files stored on Amazon S3.
>>
>> GPX files are actually XML files and therefore cannot be read on a line
>> by line basis.
>> One GPX file will produce one or more Java objects that will contain the
>> geospatial data we need to process (mostly a list of geographical points).
>>
>> To cover this use case I tried to extend the FileInputFormat class:
>>
>> public class WholeFileInputFormat extends FileInputFormat
>> {
>>   private boolean hasReachedEnd = false;
>>
>>   public WholeFileInputFormat() {
>> unsplittable = true;
>>   }
>>
>>   @Override
>>   public void open(FileInputSplit fileSplit) throws IOException {
>> super.open(fileSplit);
>> hasReachedEnd = false;
>>   }
>>
>>   @Override
>>   public String nextRecord(String reuse) throws IOException {
>> // uses apache.commons.io.IOUtils
>> String fileContent = IOUtils.toString(stream, StandardCharsets.UTF_8);
>> hasReachedEnd = true;
>> return fileContent;
>>   }
>>
>>   @Override
>>   public boolean reachedEnd() throws IOException {
>> return hasReachedEnd;
>>   }
>> }
>>
>> This class returns the content of the whole file as a string.
>>
>> Is this the right approach?
>> It seems to work when run locally with local files but I wonder if it
>> would
>> run into problems when tested in a cluster.
>>
>> Thanks in advance.
>>   Andrea.
>>
>> --
>> Andrea Cisternino, Erlangen, Germany
>> GitHub: http://github.com/acisternino
>> GitLab: https://gitlab.com/u/acisternino
>>
>
>
>
> --
>
> Jamie Grier
> data Artisans, Director of Applications Engineering
> @jamiegrier 
> ja...@data-artisans.com
>
>


Re: how to convert datastream to collection

2016-05-03 Thread Suneel Marthi
DataStream> *newCentroids = new DataStream<>.()*

*Iterator> iter =
DataStreamUtils.collect(newCentroids);*

*List> list = Lists.newArrayList(iter);*

On Tue, May 3, 2016 at 10:26 AM, subash basnet  wrote:

> Hello all,
>
> Suppose I have the datastream as:
> DataStream> *newCentroids*;
>
> How to get collection of *newCentroids * to be able to loop as below:
>  private Collection> *centroids*;
>  for (Centroid cent : *centroids*) {
>   }
>
>
>
> Best Regards,
> Subash Basnet
>


Re: Gelly CommunityDetection in scala example

2016-04-27 Thread Suneel Marthi
Recall facing a similar issue while trying to contribute a gelly-scala
example to flink-training.

See
https://github.com/dataArtisans/flink-training-exercises/blob/master/src/main/scala/com/dataartisans/flinktraining/exercises/gelly_scala/PageRankWithEdgeWeights.scala

On Wed, Apr 27, 2016 at 11:35 AM, Trevor Grant 
wrote:

> The following example in the scala shell worked as expected:
>
> import org.apache.flink.graph.library.LabelPropagation
>
> val verticesWithCommunity = graph.run(new LabelPropagation(30))
>
> // print the result
> verticesWithCommunity.print
>
>
> I tried to extend the example to use CommunityDetection:
>
> import org.apache.flink.graph.library.CommunityDetection
>
> val verticesWithCommunity = graph.run(new CommunityDetection(30, 0.5))
>
> // print the result
> verticesWithCommunity.print
>
>
> And meant the following error:
> error: polymorphic expression cannot be instantiated to expected type;
> found : [K]org.apache.flink.graph.library.CommunityDetection[K]
> required: org.apache.flink.graph.GraphAlgorithm[Long,String,Double,?]
> val verticesWithCommunity = graph.run(new CommunityDetection(30, 0.5))
> ^
>
> I haven't been able to come up with a hack to make this work. Any
> advice/bug?
>
> I invtestigated the code base a little, seems to be an issue with what
> Graph.run expects to see vs. what LabelPropagation returns vs. what
> CommunityDetection returns.
>
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>


Re: Powered by Flink

2016-04-06 Thread Suneel Marthi
I was gonna hold off on that until we get Mahout 0.12.0 out of the door
(targeted for this weekend).

I would add Apache NiFi to the list.

Future :

Apache Mahout
Apache BigTop

Openstack and Kubernetes (skunkworks)


On Wed, Apr 6, 2016 at 3:03 AM, Sebastian <s...@apache.org> wrote:

> You should also add Apache Mahout, whose new Samsara DSL also runs on
> Flink.
>
> -s
>
> On 06.04.2016 08:50, Henry Saputra wrote:
>
>> Thanks, Slim. I have just updated the wiki page with this entries.
>>
>> On Tue, Apr 5, 2016 at 10:20 AM, Slim Baltagi <sbalt...@gmail.com
>> <mailto:sbalt...@gmail.com>> wrote:
>>
>> Hi
>>
>> The following are missing in the ‘Powered by Flink’ list:
>>
>>   * *king.com <http://king.com>
>> *
>> https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces88
>>   * *Otto Group
>> *
>> http://data-artisans.com/how-we-selected-apache-flink-at-otto-group/
>>   * *Eura Nova *https://research.euranova.eu/flink-forward-2015-talk/
>>   * *Big Data Europe *http://www.big-data-europe.eu
>>
>> Thanks
>>
>> Slim Baltagi
>>
>>
>> On Apr 5, 2016, at 10:08 AM, Robert Metzger <rmetz...@apache.org
>>> <mailto:rmetz...@apache.org>> wrote:
>>>
>>> Hi everyone,
>>>
>>> I would like to bring the "Powered by Flink" wiki page [1] to the
>>> attention of Flink user's who recently joined the Flink community.
>>> The list tracks which organizations are using Flink.
>>> If your company / university / research institute / ... is using
>>> Flink but the name is not yet listed there, let me know and I'll
>>> add the name.
>>>
>>> Regards,
>>> Robert
>>>
>>> [1]
>>> https://cwiki.apache.org/confluence/display/FLINK/Powered+by+Flink
>>>
>>>
>>> On Mon, Oct 19, 2015 at 4:10 PM, Matthias J. Sax <mj...@apache.org
>>> <mailto:mj...@apache.org>> wrote:
>>>
>>> +1
>>>
>>> On 10/19/2015 04:05 PM, Maximilian Michels wrote:
>>> > +1 Let's collect in the Wiki for now. At some point in time,
>>> we might
>>> > want to have a dedicated page on the Flink homepage.
>>> >
>>> > On Mon, Oct 19, 2015 at 3:31 PM, Timo Walther
>>> <twal...@apache.org <mailto:twal...@apache.org>> wrote:
>>> >> Ah ok, sorry. I think linking to the wiki is also ok.
>>> >>
>>> >>
>>> >> On 19.10.2015 15:18, Fabian Hueske wrote:
>>> >>>
>>> >>> @Timo: The proposal was to keep the list in the wiki (can
>>> be easily
>>> >>> extended) but link from the main website to the wiki page.
>>> >>>
>>> >>> 2015-10-19 15:16 GMT+02:00 Timo Walther
>>> <twal...@apache.org <mailto:twal...@apache.org>>:
>>> >>>
>>> >>>> +1 for adding it to the website instead of wiki.
>>> >>>> "Who is using Flink?" is always a question difficult to
>>> answer to
>>> >>>> interested users.
>>> >>>>
>>> >>>>
>>> >>>> On 19.10.2015 15:08, Suneel Marthi wrote:
>>> >>>>
>>> >>>> +1 to this.
>>> >>>>
>>> >>>> On Mon, Oct 19, 2015 at 3:00 PM, Fabian Hueske
>>> <fhue...@gmail.com <mailto:fhue...@gmail.com>> wrote:
>>> >>>>
>>> >>>>> Sounds good +1
>>> >>>>>
>>> >>>>> 2015-10-19 14:57 GMT+02:00 Márton Balassi <
>>> <balassi.mar...@gmail.com <mailto:balassi.mar...@gmail.com>>
>>> >>>>> balassi.mar...@gmail.com <mailto:balassi.mar...@gmail.com
>>> >>:
>>> >>>>>
>>> >>>>>> Thanks for starting and big +1 for making it more
>>> prominent.
>>> >>>>>>
>>> >>>>>> On Mon, Oct 19, 2015 at 2:53 PM, Fabian Hueske <
>>>

Re: Flink ML 1.0.0 - Saving and Loading Models to Score a Single Feature Vector

2016-03-29 Thread Suneel Marthi
U may want to use FlinkMLTools.persist() methods which use
TypeSerializerFormat and don't enforce IOReadableWritable.



On Tue, Mar 29, 2016 at 2:12 PM, Sourigna Phetsarath <
gna.phetsar...@teamaol.com> wrote:

> Till,
>
> Thank you for your reply.
>
> Having this issue though, WeightVector does not extend IOReadWriteable:
>
> *public* *class* SerializedOutputFormat<*T* *extends* IOReadableWritable>
>
> *case* *class* WeightVector(weights: Vector, intercept: Double) *extends*
> Serializable {}
>
>
> However, I will use the approach to write out the weights as text.
>
>
> On Tue, Mar 29, 2016 at 5:01 AM, Till Rohrmann 
> wrote:
>
>> Hi Gna,
>>
>> there are no utilities yet to do that but you can do it manually. In the
>> end, a model is simply a Flink DataSet which you can serialize to some
>> file. Upon reading this DataSet you simply have to give it to your
>> algorithm to be used as the model. The following code snippet illustrates
>> this approach:
>>
>> mlr.fit(inputDS, parameters)
>>
>> // write model to disk using the SerializedOutputFormat
>> mlr.weightsOption.get.write(new SerializedOutputFormat[WeightVector], "path")
>>
>> // read the serialized model from disk
>> val model = env.readFile(new SerializedInputFormat[WeightVector], "path")
>>
>> // set the read model for the MLR algorithm
>> mlr.weightsOption = model
>>
>> Cheers,
>> Till
>> ​
>>
>> On Tue, Mar 29, 2016 at 10:46 AM, Simone Robutti <
>> simone.robu...@radicalbit.io> wrote:
>>
>>> To my knowledge there is nothing like that. PMML is not supported in any
>>> form and there's no custom saving format yet. If you really need a quick
>>> and dirty solution, it's not that hard to serialize the model into a file.
>>>
>>> 2016-03-28 17:59 GMT+02:00 Sourigna Phetsarath <
>>> gna.phetsar...@teamaol.com>:
>>>
 Flinksters,

 Is there an example of saving a Trained Model, loading a Trained Model
 and then scoring one or more feature vectors using Flink ML?

 All of the examples I've seen have shown only sequential fit and
 predict.

 Thank you.

 -Gna
 --


 *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
 Applied Research Chapter
 770 Broadway, 5th Floor, New York, NY 10003
 o: 212.402.4871 // m: 917.373.7363
 vvmr: 8890237 aim: sphetsarath20 t: @sourigna

 * *

>>>
>>>
>>
>
>
> --
>
>
> *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
> Applied Research Chapter
> 770 Broadway, 5th Floor, New York, NY 10003
> o: 212.402.4871 // m: 917.373.7363
> vvmr: 8890237 aim: sphetsarath20 t: @sourigna
>
> * *
>


Re: Upserts with Flink-elasticsearch

2016-03-28 Thread Suneel Marthi
Would it be useful to modify the existing Elasticsearch 1x sink to be able
to handle Upserts ?


On Mon, Mar 28, 2016 at 5:32 PM, Zach Cox  wrote:

> Hi Madhukar - with the current Elasticsearch sink in Flink 1.0.0 [1], I
> don't think an upsert is possible, since IndexRequestBuilder can only
> return an IndexRequest.
>
> In Flink 1.1, the Elasticsearch 2.x sink [2] provides a RequestIndexer [3]
> that you can pass an UpdateRequest to do an upsert.
>
> Thanks,
> Zach
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/streaming/connectors/elasticsearch.html
> [2]
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/connectors/elasticsearch2.html
> [3]
> https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/connectors/elasticsearch2/RequestIndexer.html
>
>
> On Mon, Mar 28, 2016 at 2:18 PM Madhukar Thota 
> wrote:
>
>> Is it possible to do Upsert with existing flink-elasticsearch connector
>> today?
>>
>


Re: Convert Datastream to Collector or List

2016-03-19 Thread Suneel Marthi
DataStream ds = ...

Iterator iter = DataStreamUtils.collect(ds);

List list = Lists.newArrayList(iterator);

Hope that helps.


On Wed, Mar 16, 2016 at 7:37 AM, Ahmed Nader 
wrote:

> Hi,
> I want to pass an object of type DataStream ,after applying map function
> on it, as a parameter to be used somewhere else. But when i do so, i get an
> error message of trying to access a null context object.
> Is there a way that i can convert this DataStream object to a list or a
> collector so as to be used somewhere else afterwards.
> Thanks,
> Ahmed
>


Re: Availability for the ElasticSearch 2 streaming connector

2016-02-18 Thread Suneel Marthi
Thanks Zach, I have a few minor changes too locally; I'll push a PR out
tomorrow that has ur changes too.

On Wed, Feb 17, 2016 at 5:13 PM, Zach Cox <zcox...@gmail.com> wrote:

> I recently did exactly what Robert described: I copied the code from this
> (closed) PR https://github.com/apache/flink/pull/1479, modified it a bit,
> and just included it in my own project that uses the Elasticsearch 2 java
> api. Seems to work well. Here are the files so you can do the same:
>
> https://gist.github.com/zcox/59e486be7aeeca381be0
>
> -Zach
>
>
> On Wed, Feb 17, 2016 at 4:06 PM Suneel Marthi <suneel.mar...@gmail.com>
> wrote:
>
>> Hey I missed this thread, sorry about that.
>>
>> I have a basic connector working with ES 2.0 which I can push out.  Its
>> not optimized yet and I don't have the time to look at it, if someone would
>> like to take it over go ahead I can send a PR.
>>
>> On Wed, Feb 17, 2016 at 4:57 PM, Robert Metzger <rmetz...@apache.org>
>> wrote:
>>
>>> Hi Mihail,
>>>
>>> It seems that nobody is actively working on the elasticsearch2 connector
>>> right now. The 1.0.0 release is already feature frozen, only bug fixes or
>>> (some) pending pull requests go in.
>>>
>>> What you can always do is copy the code from our current elasticsearch
>>> connector, set the dependency to the version you would like to use and
>>> adopt our code to their API changes. I think it might take not much time to
>>> get it working.
>>> (The reason why we usually need more time for stuff like this are
>>> integration tests and documentation).
>>>
>>> Please let me know if that solution doesn't work for you.
>>>
>>> Regards,
>>> Robert
>>>
>>>
>>> On Tue, Feb 16, 2016 at 2:53 PM, Vieru, Mihail <mihail.vi...@zalando.de>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> in reference to this ticket
>>>> https://issues.apache.org/jira/browse/FLINK-3115 when do you think
>>>> that an ElasticSearch 2 streaming connector will become available? Will it
>>>> make it for the 1.0 release?
>>>>
>>>> That would be great, as we are planning to use that particular version
>>>> of ElasticSearch in the very near future.
>>>>
>>>> Best regards,
>>>> Mihail
>>>>
>>>
>>>
>>


Re: Availability for the ElasticSearch 2 streaming connector

2016-02-17 Thread Suneel Marthi
Hey I missed this thread, sorry about that.

I have a basic connector working with ES 2.0 which I can push out.  Its not
optimized yet and I don't have the time to look at it, if someone would
like to take it over go ahead I can send a PR.

On Wed, Feb 17, 2016 at 4:57 PM, Robert Metzger  wrote:

> Hi Mihail,
>
> It seems that nobody is actively working on the elasticsearch2 connector
> right now. The 1.0.0 release is already feature frozen, only bug fixes or
> (some) pending pull requests go in.
>
> What you can always do is copy the code from our current elasticsearch
> connector, set the dependency to the version you would like to use and
> adopt our code to their API changes. I think it might take not much time to
> get it working.
> (The reason why we usually need more time for stuff like this are
> integration tests and documentation).
>
> Please let me know if that solution doesn't work for you.
>
> Regards,
> Robert
>
>
> On Tue, Feb 16, 2016 at 2:53 PM, Vieru, Mihail 
> wrote:
>
>> Hi,
>>
>> in reference to this ticket
>> https://issues.apache.org/jira/browse/FLINK-3115 when do you think that
>> an ElasticSearch 2 streaming connector will become available? Will it make
>> it for the 1.0 release?
>>
>> That would be great, as we are planning to use that particular version of
>> ElasticSearch in the very near future.
>>
>> Best regards,
>> Mihail
>>
>
>


Re: Reading Binary Data (Matrix) with Flink

2016-01-24 Thread Suneel Marthi
There should be a env.readbinaryfile() IIRC, check that

Sent from my iPhone

> On Jan 24, 2016, at 12:44 PM, Saliya Ekanayake <esal...@gmail.com> wrote:
> 
> Thank you for the response on this, but I still have some doubt. Simply, the 
> files is not in HDFS, it's in local storage. In Flink if I run the program 
> with, say 5 parallel tasks, what I would like to do is to read a block of 
> rows in each task as shown below. I looked at the simple CSV reader and was 
> thinking to create a custom one like that, but I would need to know the task 
> number to read the relevant block. Is this possible?
> 
> 
> 
> Thank you,
> Saliya
> 
>> On Wed, Jan 20, 2016 at 12:47 PM, Till Rohrmann <trohrm...@apache.org> wrote:
>> With readHadoopFile you can use all of Hadoop’s FileInputFormats and thus 
>> you can also do everything with Flink, what you can do with Hadoop. Simply 
>> take the same Hadoop FileInputFormat which you would take for your MapReduce 
>> job.
>> 
>> Cheers,
>> Till
>> 
>> 
>>> On Wed, Jan 20, 2016 at 3:16 PM, Saliya Ekanayake <esal...@gmail.com> wrote:
>>> Thank you, I saw the readHadoopFile, but I was not sure how it can be used 
>>> to the following, which is what I need. The logic of the code requires an 
>>> entire row to operate on, so in our current implementation with P tasks, 
>>> each of them will read a rectangular block of (N/P) x N from the matrix. Is 
>>> this possible with readHadoopFile? Also, the file may not be in hdfs, so is 
>>> it possible to refer to local disk in doing this?
>>> 
>>> Thank you
>>> 
>>>> On Wed, Jan 20, 2016 at 1:31 AM, Chiwan Park <chiwanp...@apache.org> wrote:
>>>> Hi Saliya,
>>>> 
>>>> You can use the input format from Hadoop in Flink by using readHadoopFile 
>>>> method. The method returns a dataset which of type is Tuple2<Key, Value>. 
>>>> Note that MapReduce equivalent transformation in Flink is composed of map, 
>>>> groupBy, and reduceGroup.
>>>> 
>>>> > On Jan 20, 2016, at 3:04 PM, Suneel Marthi <smar...@apache.org> wrote:
>>>> >
>>>> > Guess u r looking for Flink's BinaryInputFormat to be able to read 
>>>> > blocks of data from HDFS
>>>> >
>>>> > https://ci.apache.org/projects/flink/flink-docs-release-0.10/api/java/org/apache/flink/api/common/io/BinaryInputFormat.html
>>>> >
>>>> > On Wed, Jan 20, 2016 at 12:45 AM, Saliya Ekanayake <esal...@gmail.com> 
>>>> > wrote:
>>>> > Hi,
>>>> >
>>>> > I am trying to use Flink perform a parallel batch operation on a NxN 
>>>> > matrix represented as a binary file. Each (i,j) element is stored as a 
>>>> > Java Short value. In a typical MapReduce programming with Hadoop, each 
>>>> > map task will read a block of rows of this matrix and perform 
>>>> > computation on that block and emit result to the reducer.
>>>> >
>>>> > How is this done in Flink? I am new to Flink and couldn't find a binary 
>>>> > reader so far. Any help is greatly appreciated.
>>>> >
>>>> > Thank you,
>>>> > Saliya
>>>> >
>>>> > --
>>>> > Saliya Ekanayake
>>>> > Ph.D. Candidate | Research Assistant
>>>> > School of Informatics and Computing | Digital Science Center
>>>> > Indiana University, Bloomington
>>>> > Cell 812-391-4914
>>>> > http://saliya.org
>>>> >
>>>> 
>>>> Regards,
>>>> Chiwan Park
>>> 
>>> 
>>> 
>>> -- 
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>> Cell 812-391-4914
>>> http://saliya.org
> 
> 
> 
> -- 
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org


Re: Using Hadoop Input/Output formats

2015-11-24 Thread Suneel Marthi
Guess, it makes sense to add readHadoopXXX() methods to
StreamExecutionEnvironment (for feature parity with what's existing
presently in ExecutionEnvironment).

Also Flink-2949 addresses the need to add relevant syntactic sugar wrappers
in DataSet api for the code snippet in Fabian's previous email. Its not
cool, having to instantiate a JobConf in client code and having to pass
that around.



On Tue, Nov 24, 2015 at 2:26 PM, Fabian Hueske  wrote:

> Hi Nick,
>
> you can use Flink's HadoopInputFormat wrappers also for the DataStream
> API. However, DataStream does not offer as much "sugar" as DataSet because
> StreamEnvironment does not offer dedicated createHadoopInput or
> readHadoopFile methods.
>
> In DataStream Scala you can read from a Hadoop InputFormat
> (TextInputFormat in this case) as follows:
>
> val textData: DataStream[(LongWritable, Text)] = env.createInput(
>   new HadoopInputFormat[LongWritable, Text](
> new TextInputFormat,
> classOf[LongWritable],
> classOf[Text],
> new JobConf()
> ))
>
> The Java version is very similar.
>
> Note: Flink has wrappers for both MR APIs: mapred and mapreduce.
>
> Cheers,
> Fabian
>
> 2015-11-24 19:36 GMT+01:00 Chiwan Park :
>
>> I’m not streaming expert. AFAIK, the layer can be used with only DataSet.
>> There are some streaming-specific features such as distributed snapshot in
>> Flink. These need some supports of source and sink. So you have to
>> implement I/O.
>>
>> > On Nov 25, 2015, at 3:22 AM, Nick Dimiduk  wrote:
>> >
>> > I completely missed this, thanks Chiwan. Can these be used with
>> DataStreams as well as DataSets?
>> >
>> > On Tue, Nov 24, 2015 at 10:06 AM, Chiwan Park 
>> wrote:
>> > Hi Nick,
>> >
>> > You can use Hadoop Input/Output Format without modification! Please
>> check the documentation[1] in Flink homepage.
>> >
>> > [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/hadoop_compatibility.html
>> >
>> > > On Nov 25, 2015, at 3:04 AM, Nick Dimiduk 
>> wrote:
>> > >
>> > > Hello,
>> > >
>> > > Is it possible to use existing Hadoop Input and OutputFormats with
>> Flink? There's a lot of existing code that conforms to these interfaces,
>> seems a shame to have to re-implement it all. Perhaps some adapter shim..?
>> > >
>> > > Thanks,
>> > > Nick
>> >
>> > Regards,
>> > Chiwan Park
>> >
>> >
>>
>> Regards,
>> Chiwan Park
>>
>>
>>
>>
>


Re: Powered by Flink

2015-10-19 Thread Suneel Marthi
+1 to this.

On Mon, Oct 19, 2015 at 3:00 PM, Fabian Hueske  wrote:

> Sounds good +1
>
> 2015-10-19 14:57 GMT+02:00 Márton Balassi :
>
> > Thanks for starting and big +1 for making it more prominent.
> >
> > On Mon, Oct 19, 2015 at 2:53 PM, Fabian Hueske 
> wrote:
> >
> >> Thanks for starting this Kostas.
> >>
> >> I think the list is quite hidden in the wiki. Should we link from
> >> flink.apache.org to that page?
> >>
> >> Cheers, Fabian
> >>
> >> 2015-10-19 14:50 GMT+02:00 Kostas Tzoumas :
> >>
> >>> Hi everyone,
> >>>
> >>> I started a "Powered by Flink" wiki page, listing some of the
> >>> organizations that are using Flink:
> >>>
> >>> https://cwiki.apache.org/confluence/display/FLINK/Powered+by+Flink
> >>>
> >>> If you would like to be added to the list, just send me a short email
> >>> with your organization's name and a description and I will add you to
> the
> >>> wiki page.
> >>>
> >>> Best,
> >>> Kostas
> >>>
> >>
> >>
> >
>


Re: setSlotSharing NPE: Starting a stream consumer in a thread

2015-10-03 Thread Suneel Marthi
While on that Marton,  would it make sense to have a
dataStream.writeAsJson() method?

On Sat, Oct 3, 2015 at 11:54 PM, Márton Balassi 
wrote:

> Hi Jay,
>
> As for the NPE: the file monitoring function throws it when the location
> is empty. Try running the datagenerator first! :) This behaviour is
> unwanted though, I am adding a JIRA ticket for it.
>
> Best,
>
> Marton
>
> On Sun, Oct 4, 2015 at 5:28 AM, Márton Balassi 
> wrote:
>
>> Hi Jay,
>>
>> Creating a batch and a streaming environment in a single Java source file
>> is fine, they just run separately. (If you run it from an IDE locally they
>> might conflict as the second one would try to launch a local executor on a
>> port that is most likely already taken by the first one.) I would suggest
>> to have these jobs in separate files currently, exactly for the previous
>> reason.
>>
>> Looking at your
>> code 
>> ExecutionEnvironment.getExecutionEnvironment().registerTypeWithKryoSerializer(com.github.rnowling.bps.datagenerator.datamodels.Product.class,
>> new FlinkBPSGenerator.ProductSerializer()); does not do much good for you.
>> You need to register your serializers to the environment to which you are
>> using. Currently you would need to register it to the streaming env
>> variable. If you would like to also assemble a batch job you need to add
>> them there too.
>>
>> As for the streaming job I assume that you are using Flink version 0.9.1
>> and checking out the problem shortly.
>>
>> Best,
>>
>> Marton
>>
>> On Sun, Oct 4, 2015 at 3:37 AM, jay vyas 
>> wrote:
>>
>>> Here is a distilled example of the issue, should be easier to debug for
>>> folks interested... :)
>>>
>>> public static void main(String[] args) {
>>>
>>>
>>> 
>>> ExecutionEnvironment.getExecutionEnvironment().registerTypeWithKryoSerializer(com.github.rnowling.bps.datagenerator.datamodels.Product.class,
>>>  new FlinkBPSGenerator.ProductSerializer());
>>> 
>>> ExecutionEnvironment.getExecutionEnvironment().registerTypeWithKryoSerializer(com.github.rnowling.bps.datagenerator.datamodels.Transaction.class,
>>>  new FlinkBPSGenerator.TransactionSerializer());
>>>
>>> final StreamExecutionEnvironment env = 
>>> StreamExecutionEnvironment.getExecutionEnvironment();
>>>
>>> //when running "env.execute" this stream should start consuming...
>>> DataStream dataStream = env.readFileStream("/tmp/a", 1000, 
>>> FileMonitoringFunction.WatchType.ONLY_NEW_FILES);
>>> dataStream.iterate().map(new MapFunction() {
>>> public String map(String value) throws Exception {
>>> System.out.println(value);
>>> return ">>> > > > > " + value + " < < < <  <<<";
>>> }
>>> });
>>> try {
>>> env.execute();
>>> }
>>> catch(Exception e){
>>> e.printStackTrace();
>>> }
>>> }
>>>
>>>
>>>
>>> On Sat, Oct 3, 2015 at 9:08 PM, jay vyas 
>>> wrote:
>>>
 Hi flink !

 Looks like "setSlotSharing" is throwing an NPE when I try to start a
 Thread  which runs a streaming job.

 I'm trying to do this by creating a dataStream from env.readFileStream,
 and then later starting a job which writes files out ...

 However, I get

 Exception in thread "main" java.lang.NullPointerException
 at
 org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.setSlotSharing(StreamingJobGraphGenerator.java:361)

 What is the right way to create a stream and batch job all in one
 environment?

 For reference, here is a gist of the code
 https://gist.github.com/jayunit100/c7ab61d1833708d290df, and the
 offending line is the

 DataStream dataStream = env.readFileStream("/tmp/a",1000, 
 FileMonitoringFunction.WatchType.ONLY_NEW_FILES);

 line.

 Thanks again !

 --
 jay vyas

>>>
>>>
>>>
>>> --
>>> jay vyas
>>>
>>
>>
>