Re: WELCOME to user@flink.apache.org

2015-06-04 Thread Hawin Jiang
Hi Admin

Please let me know if you are received my email or not.
Thanks.



Best regards
Hawin Jiang

On Thu, Jun 4, 2015 at 10:26 AM,  wrote:

> Hi! This is the ezmlm program. I'm managing the
> user@flink.apache.org mailing list.
>
> Acknowledgment: I have added the address
>
>hawin.ji...@gmail.com
>
> to the user mailing list.
>
> Welcome to user@flink.apache.org!
>
> Please save this message so that you know the address you are
> subscribed under, in case you later want to unsubscribe or change your
> subscription address.
>
>
> --- Administrative commands for the user list ---
>
> I can handle administrative requests automatically. Please
> do not send them to the list address! Instead, send
> your message to the correct command address:
>
> To subscribe to the list, send a message to:
>
>
> To remove your address from the list, send a message to:
>
>
> Send mail to the following for info and FAQ for this list:
>
>
>
> Similar addresses exist for the digest list:
>
>
>
> To get messages 123 through 145 (a maximum of 100 per request), mail:
>
>
> To get an index with subject and author for messages 123-456 , mail:
>
>
> They are always returned as sets of 100, max 2000 per request,
> so you'll actually get 100-499.
>
> To receive all messages with the same subject as message 12345,
> send a short message to:
>
>
> The messages should contain one line or word of text to avoid being
> treated as sp@m, but I will ignore their content.
> Only the ADDRESS you send to is important.
>
> You can start a subscription for an alternate address,
> for example "john@host.domain", just add a hyphen and your
> address (with '=' instead of '@') after the command word:
> 
>
> To stop subscription for this address, mail:
> 
>
> In both cases, I'll send a confirmation message to that address. When
> you receive it, simply reply to it to complete your subscription.
>
> If despite following these instructions, you do not get the
> desired results, please contact my owner at
> user-ow...@flink.apache.org. Please be patient, my owner is a
> lot slower than I am ;-)
>
> --- Enclosed is a copy of the request I received.
>
> Return-Path: 
> Received: (qmail 97112 invoked by uid 99); 4 Jun 2015 17:26:27 -
> Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142)
> by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Jun 2015 17:26:27
> +
> Received: from localhost (localhost [127.0.0.1])
> by spamd1-us-west.apache.org (ASF Mail Server at
> spamd1-us-west.apache.org) with ESMTP id 47645CB606
> for  gmail@flink.apache.org>; Thu,  4 Jun 2015 17:26:27 + (UTC)
> X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
> X-Spam-Flag: NO
> X-Spam-Score: 2.88
> X-Spam-Level: **
> X-Spam-Status: No, score=2.88 tagged_above=-999 required=6.31
> tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
> HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01,
> SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled
> Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
> dkim=pass (2048-bit key) header.d=gmail.com
> Received: from mx1-us-west.apache.org ([10.40.0.8])
> by localhost (spamd1-us-west.apache.org [10.40.0.7])
> (amavisd-new, port 10024)
> with ESMTP id euvYYaufNvU8
> for  gmail@flink.apache.org>;
> Thu,  4 Jun 2015 17:26:22 + (UTC)
> Received: from mail-ig0-f182.google.com (mail-ig0-f182.google.com
> [209.85.213.182])
> by mx1-us-west.apache.org (ASF Mail Server at
> mx1-us-west.apache.org) with ESMTPS id 3DACF275E3
> for  gmail@flink.apache.org>; Thu,  4 Jun 2015 17:26:22 + (UTC)
> Received: by igbzc4 with SMTP id zc4so14034401igb.0
> for  gmail@flink.apache.org>; Thu, 04 Jun 2015 10:26:21 -0700 (PDT)
> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
> d=gmail.com; s=20120113;
>
> h=mime-version:in-reply-to:references:date:message-id:subject:from:to
>  :content-type;
> bh=jCSYooUYuk4wmxJb795WdSFxPiK/LwY/6lOcqzPoBJo=;
>
> b=yu6myf9wL4fQ2HktgrhpKjBxUzfkzzPvVht1SNTDohJJSXXr8ye6R5SEGKVdQZhFrJ
>
>  KgUEsWgweS4xxPjeS1MwKe0Rf8M1gPBG79uM1098ZffPWkYA0jlVwCMHexxNO8SOCBce
>
>  H6VOGfRTIcDrUY+QyxPW1aPyeFJSNdUhlBgbhqYYWUDXeiXIB2PULWDGxnynMEnHAmJs
>
>  Fma93iNeEKoYdn38Lt1j/bxuUKFEyYNEJ6Nn6HtA/lpOGfJmwLqeN9HqTj/a2vzP/+1L
>
>  siCCCBkN/DLnnULiOXOCB2FryvVOmlF5io8bIJHM74FPTw5VwOgXLZmKfbMjyCEmsiB9
>  YT1Q==
> MIME-Version: 1.0
> X-Received: by 10.42.203.4 with

Re: WELCOME to user@flink.apache.org

2015-06-04 Thread Hawin Jiang
Cool

Here is my question:

Do we have insert, update and remove
operations on Apache Flink?

For example: I have 10 million records
in my test file. I want to add one record, update one record and remove
one record from this test file.

How to implement it by Flink?

Thanks.


On Thu, Jun 4, 2015 at 10:44 AM, Robert Metzger  wrote:

> Yes, I've got this message.
>
> On Thu, Jun 4, 2015 at 7:42 PM, Hawin Jiang  wrote:
>
>> Hi Admin
>>
>> Please let me know if you are received my email or not.
>> Thanks.
>>
>>
>>
>> Best regards
>> Hawin Jiang
>>
>> On Thu, Jun 4, 2015 at 10:26 AM,  wrote:
>>
>>> Hi! This is the ezmlm program. I'm managing the
>>> user@flink.apache.org mailing list.
>>>
>>> Acknowledgment: I have added the address
>>>
>>>hawin.ji...@gmail.com
>>>
>>> to the user mailing list.
>>>
>>> Welcome to user@flink.apache.org!
>>>
>>> Please save this message so that you know the address you are
>>> subscribed under, in case you later want to unsubscribe or change your
>>> subscription address.
>>>
>>>
>>> --- Administrative commands for the user list ---
>>>
>>> I can handle administrative requests automatically. Please
>>> do not send them to the list address! Instead, send
>>> your message to the correct command address:
>>>
>>> To subscribe to the list, send a message to:
>>>
>>>
>>> To remove your address from the list, send a message to:
>>>
>>>
>>> Send mail to the following for info and FAQ for this list:
>>>
>>>
>>>
>>> Similar addresses exist for the digest list:
>>>
>>>
>>>
>>> To get messages 123 through 145 (a maximum of 100 per request), mail:
>>>
>>>
>>> To get an index with subject and author for messages 123-456 , mail:
>>>
>>>
>>> They are always returned as sets of 100, max 2000 per request,
>>> so you'll actually get 100-499.
>>>
>>> To receive all messages with the same subject as message 12345,
>>> send a short message to:
>>>
>>>
>>> The messages should contain one line or word of text to avoid being
>>> treated as sp@m, but I will ignore their content.
>>> Only the ADDRESS you send to is important.
>>>
>>> You can start a subscription for an alternate address,
>>> for example "john@host.domain", just add a hyphen and your
>>> address (with '=' instead of '@') after the command word:
>>> 
>>>
>>> To stop subscription for this address, mail:
>>> 
>>>
>>> In both cases, I'll send a confirmation message to that address. When
>>> you receive it, simply reply to it to complete your subscription.
>>>
>>> If despite following these instructions, you do not get the
>>> desired results, please contact my owner at
>>> user-ow...@flink.apache.org. Please be patient, my owner is a
>>> lot slower than I am ;-)
>>>
>>> --- Enclosed is a copy of the request I received.
>>>
>>> Return-Path: 
>>> Received: (qmail 97112 invoked by uid 99); 4 Jun 2015 17:26:27 -
>>> Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142)
>>> by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Jun 2015 17:26:27
>>> +
>>> Received: from localhost (localhost [127.0.0.1])
>>> by spamd1-us-west.apache.org (ASF Mail Server at
>>> spamd1-us-west.apache.org) with ESMTP id 47645CB606
>>> for >> gmail@flink.apache.org>; Thu,  4 Jun 2015 17:26:27 + (UTC)
>>> X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
>>> X-Spam-Flag: NO
>>> X-Spam-Score: 2.88
>>> X-Spam-Level: **
>>> X-Spam-Status: No, score=2.88 tagged_above=-999 required=6.31
>>> tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
>>> HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01,
>>> SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled
>>> Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
>>> dkim=pass (2048-bit key) header.d=gmail.com
>>> Received: from mx1-us-west.apache.org ([10.40.0.8])
>>> by localhost (spamd1-us-west.apache.org [10.40.0.7])
>>> (amavisd-new, port 10024)
>>> with ESM

Re: Apache Flink transactions

2015-06-05 Thread Hawin Jiang
Thanks all

Actually, I want to know more info about Flink SQL and Flink performance
Here is the Spark benchmark. Maybe you already saw it before.
https://amplab.cs.berkeley.edu/benchmark/

Thanks.



Best regards
Hawin



On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske  wrote:

> If you want to append data to a data set that is store as files (e.g., on
> HDFS), you can go for a directory structure as follows:
>
> dataSetRootFolder
>   - part1
> - 1
> - 2
> - ...
>   - part2
> - 1
> - ...
>   - partX
>
> Flink's file format supports recursive directory scans such that you can
> add new subfolders to dataSetRootFolder and read the full data set.
>
> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek :
>
>> Hi,
>> I think the example could be made more concise by using the Table API.
>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>>
>> Please let us know if you have questions about that, it is still quite
>> new.
>>
>> On Fri, Jun 5, 2015 at 9:03 AM, hawin  wrote:
>> > Hi Aljoscha
>> >
>> > Thanks for your reply.
>> > Do you have any tips for Flink SQL.
>> > I know that Spark support ORC format. How about Flink SQL?
>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines of
>> code.
>> > How to make that as simple as possible by flink.
>> > I am going to use Flink in my future project.  Sorry for so many
>> questions.
>> > I believe that you guys will make a world difference.
>> >
>> >
>> > @Chiwan
>> > You made a very good example for me.
>> > Thanks a lot
>> >
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
>> > Sent from the Apache Flink User Mailing List archive. mailing list
>> archive at Nabble.com.
>>
>
>


Best wishes for Kostas Tzoumas and Robert Metzger

2015-06-07 Thread Hawin Jiang
Hi All

 

As you know that Kostas Tzoumas and Robert Metzger will give us two Flink
talks on 2015 Hadoop summit. 

That is an excellent opportunity to introduce Apache Flink to the world. 

Best wishes for Kostas Tzoumas and Robert Metzger.

 

 

Here is the details info: 

 

Topic: Apache Flink deep-dive

Time: 1:45pm - 2:25pm 2015/06/10

Speakers: Kostas Tzoumas and Robert Metzger

 

Topic: Flexible and Real-time Stream Processing with Apache Flink

Time: 3:10pm - 3:50pm 2015/06/11

Speakers: Kostas Tzoumas and Robert Metzger

 

 

 

 

 

Best regards

Hawin



Re: Apache Flink transactions

2015-06-08 Thread Hawin Jiang
Hi Aljoscha

I want to know what is the apache flink performance if I run the same SQL
as below.
Do you have any apache flink benchmark information?
Such as: https://amplab.cs.berkeley.edu/benchmark/
Thanks.



SELECT pageURL, pageRank FROM rankings WHERE pageRank > X

Query 1A
32,888 resultsQuery 1B
3,331,851 resultsQuery 1C
89,974,976 results05101520253035404550Redshift (HDD)Impala - DiskImpala -
MemShark - DiskShark - MemHiveTez0510152025303540455055Redshift (HDD)Impala
- DiskImpala - MemShark - DiskShark - MemHiveTez0510152025303540Redshift
(HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTezOld DataMedian
Response Time (s)Redshift (HDD) - Current2.492.619.46Impala - Disk - 1.2.3
12.01512.01537.085Impala - Mem - 1.2.32.173.0136.04Shark - Disk - 0.8.16.67
22.4Shark - Mem - 0.8.11.71.83.6Hive - 0.12 YARN50.4959.9343.34Tez - 0.2.0
28.2236.3526.44


On Mon, Jun 8, 2015 at 2:03 AM, Aljoscha Krettek 
wrote:

> Hi,
> actually, what do you want to know about Flink SQL?
>
> Aljoscha
>
> On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang  wrote:
> > Thanks all
> >
> > Actually, I want to know more info about Flink SQL and Flink performance
> > Here is the Spark benchmark. Maybe you already saw it before.
> > https://amplab.cs.berkeley.edu/benchmark/
> >
> > Thanks.
> >
> >
> >
> > Best regards
> > Hawin
> >
> >
> >
> > On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske  wrote:
> >>
> >> If you want to append data to a data set that is store as files (e.g.,
> on
> >> HDFS), you can go for a directory structure as follows:
> >>
> >> dataSetRootFolder
> >>   - part1
> >> - 1
> >> - 2
> >> - ...
> >>   - part2
> >> - 1
> >> - ...
> >>   - partX
> >>
> >> Flink's file format supports recursive directory scans such that you can
> >> add new subfolders to dataSetRootFolder and read the full data set.
> >>
> >> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek :
> >>>
> >>> Hi,
> >>> I think the example could be made more concise by using the Table API.
> >>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
> >>>
> >>> Please let us know if you have questions about that, it is still quite
> >>> new.
> >>>
> >>> On Fri, Jun 5, 2015 at 9:03 AM, hawin  wrote:
> >>> > Hi Aljoscha
> >>> >
> >>> > Thanks for your reply.
> >>> > Do you have any tips for Flink SQL.
> >>> > I know that Spark support ORC format. How about Flink SQL?
> >>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines of
> >>> > code.
> >>> > How to make that as simple as possible by flink.
> >>> > I am going to use Flink in my future project.  Sorry for so many
> >>> > questions.
> >>> > I believe that you guys will make a world difference.
> >>> >
> >>> >
> >>> > @Chiwan
> >>> > You made a very good example for me.
> >>> > Thanks a lot
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > View this message in context:
> >>> >
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
> >>> > Sent from the Apache Flink User Mailing List archive. mailing list
> >>> > archive at Nabble.com.
> >>
> >>
> >
>


Re: Apache Flink transactions

2015-06-09 Thread Hawin Jiang
On Tue, Jun 9, 2015 at 2:29 AM, Aljoscha Krettek 
wrote:

> Hi,
> we don't have any current performance numbers. But the queries mentioned
> on the benchmark page should be easy to implement in Flink. It could be
> interesting if someone ported these queries and ran them with exactly the
> same data on the same machines.
>
> Bill Sparks wrote on the mailing list some days ago (
> http://mail-archives.apache.org/mod_mbox/flink-user/201506.mbox/%3cd1972778.64426%25jspa...@cray.com%3e).
> He seems to be running some tests to compare Flink, Spark and MapReduce.
>
> Regards,
> Aljoscha
>
> On Mon, Jun 8, 2015 at 9:09 PM, Hawin Jiang  wrote:
>
>> Hi Aljoscha
>>
>> I want to know what is the apache flink performance if I run the same SQL
>> as below.
>> Do you have any apache flink benchmark information?
>> Such as: https://amplab.cs.berkeley.edu/benchmark/
>> Thanks.
>>
>>
>>
>> SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
>>
>> Query 1A
>> 32,888 resultsQuery 1B
>> 3,331,851 resultsQuery 1C
>> 89,974,976 results05101520253035404550Redshift (HDD)Impala - DiskImpala
>> - MemShark - DiskShark - MemHiveTez0510152025303540455055Redshift
>> (HDD)Impala - DiskImpala - MemShark - DiskShark - 
>> MemHiveTez0510152025303540Redshift
>> (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTezOld DataMedian
>> Response Time (s)Redshift (HDD) - Current2.492.619.46Impala - Disk -
>> 1.2.312.01512.01537.085Impala - Mem - 1.2.32.173.0136.04Shark - Disk -
>> 0.8.16.6722.4Shark - Mem - 0.8.11.71.83.6Hive - 0.12 YARN50.4959.9343.34Tez
>> - 0.2.028.2236.3526.44
>>
>>
>> On Mon, Jun 8, 2015 at 2:03 AM, Aljoscha Krettek 
>> wrote:
>>
>>> Hi,
>>> actually, what do you want to know about Flink SQL?
>>>
>>> Aljoscha
>>>
>>> On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang 
>>> wrote:
>>> > Thanks all
>>> >
>>> > Actually, I want to know more info about Flink SQL and Flink
>>> performance
>>> > Here is the Spark benchmark. Maybe you already saw it before.
>>> > https://amplab.cs.berkeley.edu/benchmark/
>>> >
>>> > Thanks.
>>> >
>>> >
>>> >
>>> > Best regards
>>> > Hawin
>>> >
>>> >
>>> >
>>> > On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske 
>>> wrote:
>>> >>
>>> >> If you want to append data to a data set that is store as files
>>> (e.g., on
>>> >> HDFS), you can go for a directory structure as follows:
>>> >>
>>> >> dataSetRootFolder
>>> >>   - part1
>>> >> - 1
>>> >> - 2
>>> >> - ...
>>> >>   - part2
>>> >> - 1
>>> >> - ...
>>> >>   - partX
>>> >>
>>> >> Flink's file format supports recursive directory scans such that you
>>> can
>>> >> add new subfolders to dataSetRootFolder and read the full data set.
>>> >>
>>> >> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek :
>>> >>>
>>> >>> Hi,
>>> >>> I think the example could be made more concise by using the Table
>>> API.
>>> >>>
>>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>>> >>>
>>> >>> Please let us know if you have questions about that, it is still
>>> quite
>>> >>> new.
>>> >>>
>>> >>> On Fri, Jun 5, 2015 at 9:03 AM, hawin  wrote:
>>> >>> > Hi Aljoscha
>>> >>> >
>>> >>> > Thanks for your reply.
>>> >>> > Do you have any tips for Flink SQL.
>>> >>> > I know that Spark support ORC format. How about Flink SQL?
>>> >>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines
>>> of
>>> >>> > code.
>>> >>> > How to make that as simple as possible by flink.
>>> >>> > I am going to use Flink in my future project.  Sorry for so many
>>> >>> > questions.
>>> >>> > I believe that you guys will make a world difference.
>>> >>> >
>>> >>> >
>>> >>> > @Chiwan
>>> >>> > You made a very good example for me.
>>> >>> > Thanks a lot
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > --
>>> >>> > View this message in context:
>>> >>> >
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
>>> >>> > Sent from the Apache Flink User Mailing List archive. mailing list
>>> >>> > archive at Nabble.com.
>>> >>
>>> >>
>>> >
>>>
>>
>>
>


Re: Apache Flink transactions

2015-06-09 Thread Hawin Jiang
Hey Aljoscha

I also sent an email to Bill for asking the latest test results. From
Bill's email, Apache Spark performance looks like better than Flink.
How about your thoughts.



Best regards
Hawin



On Tue, Jun 9, 2015 at 2:29 AM, Aljoscha Krettek 
wrote:

> Hi,
> we don't have any current performance numbers. But the queries mentioned
> on the benchmark page should be easy to implement in Flink. It could be
> interesting if someone ported these queries and ran them with exactly the
> same data on the same machines.
>
> Bill Sparks wrote on the mailing list some days ago (
> http://mail-archives.apache.org/mod_mbox/flink-user/201506.mbox/%3cd1972778.64426%25jspa...@cray.com%3e).
> He seems to be running some tests to compare Flink, Spark and MapReduce.
>
> Regards,
> Aljoscha
>
> On Mon, Jun 8, 2015 at 9:09 PM, Hawin Jiang  wrote:
>
>> Hi Aljoscha
>>
>> I want to know what is the apache flink performance if I run the same SQL
>> as below.
>> Do you have any apache flink benchmark information?
>> Such as: https://amplab.cs.berkeley.edu/benchmark/
>> Thanks.
>>
>>
>>
>> SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
>>
>> Query 1A
>> 32,888 resultsQuery 1B
>> 3,331,851 resultsQuery 1C
>> 89,974,976 results05101520253035404550Redshift (HDD)Impala - DiskImpala
>> - MemShark - DiskShark - MemHiveTez0510152025303540455055Redshift
>> (HDD)Impala - DiskImpala - MemShark - DiskShark - 
>> MemHiveTez0510152025303540Redshift
>> (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTezOld DataMedian
>> Response Time (s)Redshift (HDD) - Current2.492.619.46Impala - Disk -
>> 1.2.312.01512.01537.085Impala - Mem - 1.2.32.173.0136.04Shark - Disk -
>> 0.8.16.6722.4Shark - Mem - 0.8.11.71.83.6Hive - 0.12 YARN50.4959.9343.34Tez
>> - 0.2.028.2236.3526.44
>>
>>
>> On Mon, Jun 8, 2015 at 2:03 AM, Aljoscha Krettek 
>> wrote:
>>
>>> Hi,
>>> actually, what do you want to know about Flink SQL?
>>>
>>> Aljoscha
>>>
>>> On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang 
>>> wrote:
>>> > Thanks all
>>> >
>>> > Actually, I want to know more info about Flink SQL and Flink
>>> performance
>>> > Here is the Spark benchmark. Maybe you already saw it before.
>>> > https://amplab.cs.berkeley.edu/benchmark/
>>> >
>>> > Thanks.
>>> >
>>> >
>>> >
>>> > Best regards
>>> > Hawin
>>> >
>>> >
>>> >
>>> > On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske 
>>> wrote:
>>> >>
>>> >> If you want to append data to a data set that is store as files
>>> (e.g., on
>>> >> HDFS), you can go for a directory structure as follows:
>>> >>
>>> >> dataSetRootFolder
>>> >>   - part1
>>> >> - 1
>>> >> - 2
>>> >> - ...
>>> >>   - part2
>>> >> - 1
>>> >> - ...
>>> >>   - partX
>>> >>
>>> >> Flink's file format supports recursive directory scans such that you
>>> can
>>> >> add new subfolders to dataSetRootFolder and read the full data set.
>>> >>
>>> >> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek :
>>> >>>
>>> >>> Hi,
>>> >>> I think the example could be made more concise by using the Table
>>> API.
>>> >>>
>>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>>> >>>
>>> >>> Please let us know if you have questions about that, it is still
>>> quite
>>> >>> new.
>>> >>>
>>> >>> On Fri, Jun 5, 2015 at 9:03 AM, hawin  wrote:
>>> >>> > Hi Aljoscha
>>> >>> >
>>> >>> > Thanks for your reply.
>>> >>> > Do you have any tips for Flink SQL.
>>> >>> > I know that Spark support ORC format. How about Flink SQL?
>>> >>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines
>>> of
>>> >>> > code.
>>> >>> > How to make that as simple as possible by flink.
>>> >>> > I am going to use Flink in my future project.  Sorry for so many
>>> >>> > questions.
>>> >>> > I believe that you guys will make a world difference.
>>> >>> >
>>> >>> >
>>> >>> > @Chiwan
>>> >>> > You made a very good example for me.
>>> >>> > Thanks a lot
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > --
>>> >>> > View this message in context:
>>> >>> >
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
>>> >>> > Sent from the Apache Flink User Mailing List archive. mailing list
>>> >>> > archive at Nabble.com.
>>> >>
>>> >>
>>> >
>>>
>>
>>
>


Best way to write data to HDFS by Flink

2015-06-10 Thread Hawin Jiang
Hi All

 

Can someone tell me what is the best way to write data to HDFS when Flink
received data from Kafka?

Big thanks for your example.

 

 

 

 

Best regards

Hawin

 



Re: Best way to write data to HDFS by Flink

2015-06-10 Thread Hawin Jiang
Thanks Marton
I will use this code to implement my testing.



Best regards
Hawin

On Wed, Jun 10, 2015 at 1:30 AM, Márton Balassi 
wrote:

> Dear Hawin,
>
> You can pass a hdfs path to DataStream's and DataSet's writeAsText and
> writeAsCsv methods.
> I assume that you are running a Streaming topology, because your source is
> Kafka, so it would look like the following:
>
> StreamExecutionEnvironment env =
> StreamExecutionEnvironment.getExecutionEnvironment();
>
> env.addSource(PerisitentKafkaSource(..))
>   .map(/* do you operations*/)
>
> .wirteAsText("hdfs://:/path/to/your/file");
>
> Check out the relevant section of the streaming docs for more info. [1]
>
> [1]
> http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#connecting-to-the-outside-world
>
> Best,
>
> Marton
>
> On Wed, Jun 10, 2015 at 10:22 AM, Hawin Jiang 
> wrote:
>
>> Hi All
>>
>>
>>
>> Can someone tell me what is the best way to write data to HDFS when Flink
>> received data from Kafka?
>>
>> Big thanks for your example.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Best regards
>>
>> Hawin
>>
>>
>>
>
>


Re: Best wishes for Kostas Tzoumas and Robert Metzger

2015-06-10 Thread Hawin Jiang
Hi  Michels

I don't think you can watch them online now.

Can someone share their presentations or feedback to us?
Thanks



Best regards
Hawin

On Mon, Jun 8, 2015 at 2:34 AM, Maximilian Michels  wrote:

> Thank you for your kind wishes :) Good luck from me as well!
>
> I was just wondering, is it possible to stream the talks or watch them
> later on?
>
> On Mon, Jun 8, 2015 at 2:54 AM, Hawin Jiang  wrote:
>
>> Hi All
>>
>>
>>
>> As you know that Kostas Tzoumas and Robert Metzger will give us two Flink
>> talks on 2015 Hadoop summit.
>>
>> That is an excellent opportunity to introduce Apache Flink to the world.
>>
>> Best wishes for Kostas Tzoumas and Robert Metzger.
>>
>>
>>
>>
>>
>> Here is the details info:
>>
>>
>>
>> Topic: Apache Flink deep-dive
>>
>> Time: 1:45pm - 2:25pm 2015/06/10
>>
>> Speakers: Kostas Tzoumas and Robert Metzger
>>
>>
>>
>> Topic: Flexible and Real-time Stream Processing with Apache Flink
>>
>> Time: 3:10pm - 3:50pm 2015/06/11
>>
>> Speakers: Kostas Tzoumas and Robert Metzger
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Best regards
>>
>> Hawin
>>
>
>


Re: Best wishes for Kostas Tzoumas and Robert Metzger

2015-06-10 Thread Hawin Jiang
Hi Robert

Congrats for your presentation. I have downloaded your slides.
Hopefully Flink can move forward quickly.



Best regards
Hawin

On Wed, Jun 10, 2015 at 10:14 PM, Robert Metzger 
wrote:

> Hi Hawin,
>
> here are the slides:
> http://www.slideshare.net/robertmetzger1/apache-flink-deepdive-hadoop-summit-2015-in-san-jose-ca
> Thank you for the wishes. The talk was very well received.
>
> On Wed, Jun 10, 2015 at 10:41 AM, Hawin Jiang 
> wrote:
>
>> Hi  Michels
>>
>> I don't think you can watch them online now.
>>
>> Can someone share their presentations or feedback to us?
>> Thanks
>>
>>
>>
>> Best regards
>> Hawin
>>
>> On Mon, Jun 8, 2015 at 2:34 AM, Maximilian Michels 
>> wrote:
>>
>>> Thank you for your kind wishes :) Good luck from me as well!
>>>
>>> I was just wondering, is it possible to stream the talks or watch them
>>> later on?
>>>
>>> On Mon, Jun 8, 2015 at 2:54 AM, Hawin Jiang 
>>> wrote:
>>>
>>>> Hi All
>>>>
>>>>
>>>>
>>>> As you know that Kostas Tzoumas and Robert Metzger will give us two
>>>> Flink talks on 2015 Hadoop summit.
>>>>
>>>> That is an excellent opportunity to introduce Apache Flink to the
>>>> world.
>>>>
>>>> Best wishes for Kostas Tzoumas and Robert Metzger.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Here is the details info:
>>>>
>>>>
>>>>
>>>> Topic: Apache Flink deep-dive
>>>>
>>>> Time: 1:45pm - 2:25pm 2015/06/10
>>>>
>>>> Speakers: Kostas Tzoumas and Robert Metzger
>>>>
>>>>
>>>>
>>>> Topic: Flexible and Real-time Stream Processing with Apache Flink
>>>>
>>>> Time: 3:10pm - 3:50pm 2015/06/11
>>>>
>>>> Speakers: Kostas Tzoumas and Robert Metzger
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Best regards
>>>>
>>>> Hawin
>>>>
>>>
>>>
>>
>


Kafka0.8.2.1 + Flink0.9.0 issue

2015-06-11 Thread Hawin Jiang
Hi All

 

I am preparing Kafka and Flink performance test now.  In order to avoid my
mistakes, I have downloaded Kafka example from http://kafka.apache.org/ and
Flink streaming Kafka example from http://flink.apache.org

I have run both producer examples on the same cluster.  No any issues from
kafka.apache.org.

But I have received some errors as below when I ran apache Flink Kafka
producer.  

I also posted both code for your reference. Please take a look at it. 

Thanks. 

 

Exception in thread "main" java.lang.Error: Unresolved compilation problems:


The import kafka.consumer cannot be resolved

The import kafka.consumer cannot be resolved

The import kafka.consumer cannot be resolved

The import kafka.consumer cannot be resolved

The import kafka.javaapi cannot be resolved

ConsumerConnector cannot be resolved to a type

ConsumerIterator cannot be resolved to a type

ConsumerConnector cannot be resolved to a type

Consumer cannot be resolved

ConsumerConfig cannot be resolved to a type

KafkaStream cannot be resolved to a type

ConsumerConnector cannot be resolved to a type

KafkaStream cannot be resolved to a type

KafkaStream cannot be resolved to a type

ConsumerConnector cannot be resolved to a type

ConsumerIterator cannot be resolved to a type

ConsumerIterator cannot be resolved to a type

ConsumerIterator cannot be resolved to a type

ConsumerConnector cannot be resolved to a type

ConsumerConnector cannot be resolved to a type

ConsumerConnector cannot be resolved to a type

 

at
org.apache.flink.streaming.connectors.kafka.api.KafkaSource.(KafkaSour
ce.java:26)

at
org.apache.flink.streaming.connectors.kafka.KafkaConsumerExample.main(KafkaC
onsumerExample.java:42)

 

 

 

Here is the Apache Flink example:

*Apache
Flink***

StreamExecutionEnvironment env =
StreamExecutionEnvironment.createLocalEnvironment().setParallelism(4);

 

@SuppressWarnings({ "unused", "serial" })

DataStream stream1 = env.addSource(new
SourceFunction() {

public void run(Collector collector) throws Exception {

for (int i = 0; i < 20; i++) {

collector.collect("message #" + i);

Thread.sleep(100L);

}

 

collector.collect(new String("q"));

}

 

public void cancel() {  

}





}).addSink(

new KafkaSink(host + ":" + port, topic, new
JavaDefaultStringSchema())

)

.setParallelism(3);

 

System.out.println(host+" "+port+" "+topic);



env.execute();

 

 

 

**Apache
Kafka***

public Producer(String topic)

  {

props.put("serializer.class", "kafka.serializer.StringEncoder");

props.put("metadata.broker.list", "192.168.0.112:9092");

// Use random partitioner. Don't need the key type. Just set it to
Integer.

// The message is of type String.

producer = new kafka.javaapi.producer.Producer(new
ProducerConfig(props));

this.topic = topic;

  }

  

  public void run() {

int messageNo = 1;

while(true)

{

  String messageStr = new String("LA_" + messageNo);

  producer.send(new KeyedMessage(topic, messageStr));

  messageNo++;

}

  }

 

 

 

 

Best regards

Hawin

 

 



RE: Kafka0.8.2.1 + Flink0.9.0 issue

2015-06-11 Thread Hawin Jiang
Dear Marton

 

Thanks for supporting again.

I am running these examples at the same project and I am using Eclipse IDE to 
submit it to my Flink cluster. 

 

 

Here is my dependencies

**





junit

junit

4.12

test





org.apache.flink

flink-java

0.9.0-milestone-1





org.apache.flink

flink-clients

0.9.0-milestone-1





org.apache.flink

flink-streaming-connectors

0.9.0-milestone-1





org.apache.flink

flink-streaming-core

0.9.0-milestone-1





org.apache.kafka

kafka_2.10

0.8.2.1





org.apache.kafka

kafka-clients

0.8.2.1





org.apache.hadoop

hadoop-hdfs

2.6.0





org.apache.hadoop

hadoop-auth

2.6.0





org.apache.hadoop

hadoop-common

2.6.0





org.apache.hadoop

hadoop-core

1.2.1





 

*

 

 

 

 

Best regards

Email: hawin.ji...@gmail.com

 

From: Márton Balassi [mailto:balassi.mar...@gmail.com] 
Sent: Thursday, June 11, 2015 12:58 AM
To: user@flink.apache.org
Subject: Re: Kafka0.8.2.1 + Flink0.9.0 issue

 

Dear Hawin,

 

This looks like a dependency issue, the java compiler does not find the kafka 
dependency. How are you trying to run this example? Is it from an IDE or 
submitting it to a flink cluster with bin/flink run? How do you define your 
dependencies, do you use maven or sbt for instance?

 

Best,

 

Marton

 

On Thu, Jun 11, 2015 at 9:43 AM, Hawin Jiang  wrote:

 



Re: Kafka0.8.2.1 + Flink0.9.0 issue

2015-06-11 Thread Hawin Jiang
Dear Marton

What do you meaning for locally Eclipse with 'Run'.
Do you want to me to run it on Namenode?
But my namenode didn't install Kafka.  I only installed Kafka on my data
node servers.
Do I need to install or copy Kafka jar on Namenode? Actually, I don't want
to install everything on Name node server.
Please advise me.
Thanks.


My Flink and Hadoop cluster info as below.

Flink on NameNode
Kafka,Zookeeper and FLink slave1 on Datanode1
Kafka,Zookeeper ,and Flink slave2 on Datanode2
Kafka, Zookeeper and Flink slave3 on Datanode3



On Thu, Jun 11, 2015 at 1:16 AM, Márton Balassi 
wrote:

> Dear Hawin,
>
> No problem, I am gald that you are giving our Kafka connector a try. :)
> The dependencies listed look good. Can you run the example locally from
> Eclipse with 'Run'? I suspect that maybe your Flink cluster does not have
> the access to the kafka dependency then.
>
> As a quick test you could copy the kafka jars to the lib folder of your
> Flink distribution on all the machines in your cluster. Everything that is
> there goes to the classpath of Flink. Another workaround with be to build a
> fat jar for your project containing all the dependencies with 'mvn
> assembly:assembly'. Neither of these are beautiful but would help tracking
> down the root cause.
>
> On Thu, Jun 11, 2015 at 10:04 AM, Hawin Jiang 
> wrote:
>
>> Dear Marton
>>
>>
>>
>> Thanks for supporting again.
>>
>> I am running these examples at the same project and I am using Eclipse
>> IDE to submit it to my Flink cluster.
>>
>>
>>
>>
>>
>> Here is my dependencies
>>
>>
>> **
>>
>> 
>>
>> 
>>
>> *junit*
>>
>> *junit*
>>
>> 4.12
>>
>> test
>>
>> 
>>
>> 
>>
>> org.apache.flink
>>
>> *flink*-java
>>
>> 0.9.0-milestone-1
>>
>> 
>>
>> 
>>
>> org.apache.flink
>>
>> *flink*-clients
>>
>> 0.9.0-milestone-1
>>
>> 
>>
>> 
>>
>> org.apache.flink
>>
>> *flink*-streaming-connectors
>>
>> 0.9.0-milestone-1
>>
>> 
>>
>> 
>>
>> org.apache.flink
>>
>> *flink*-streaming-core
>>
>> 0.9.0-milestone-1
>>
>> 
>>
>> 
>>
>> org.apache.kafka
>>
>> kafka_2.10
>>
>> 0.8.2.1
>>
>> 
>>
>> 
>>
>> org.apache.kafka
>>
>> *kafka*-clients
>>
>> 0.8.2.1
>>
>> 
>>
>> 
>>
>> org.apache.hadoop
>>
>> *hadoop*-*hdfs*
>>
>> 2.6.0
>>
>> 
>>
>> 
>>
>> org.apache.hadoop
>>
>> *hadoop*-*auth*
>>
>> 2.6.0
>>
>> 
>>
>> 
>>
>> org.apache.hadoop
>>
>> *hadoop*-common
>>
>> 2.6.0
>>
>> 
>>
>> 
>>
>> org.apache.hadoop
>>
>>     *hadoop*-core
>>
>> 1.2.1
>>
>> 
>>
>> 
>>
>>
>>
>>
>> *
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Best regards
>>
>> Email: hawin.ji...@gmail.com
>>
>>
>>
>> *From:* Márton Balassi [mailto:balassi.mar...@gmail.com]
>> *Sent:* Thursday, June 11, 2015 12:58 AM
>> *To:* user@flink.apache.org
>> *Subject:* Re: Kafka0.8.2.1 + Flink0.9.0 issue
>>
>>
>>
>> Dear Hawin,
>>
>>
>>
>> This looks like a dependency issue, the java compiler does not find the
>> kafka dependency. How are you trying to run this example? Is it from an IDE
>> or submitting it to a flink cluster with bin/flink run? How do you define
>> your dependencies, do you use maven or sbt for instance?
>>
>>
>>
>> Best,
>>
>>
>>
>> Marton
>>
>>
>>
>> On Thu, Jun 11, 2015 at 9:43 AM, Hawin Jiang 
>> wrote:
>>
>>
>>
>
>


Re: Kafka0.8.2.1 + Flink0.9.0 issue

2015-06-22 Thread Hawin Jiang
Hi  Marton

I have to add whole pom.xml file or just only plugin as below.
I saw L286 to L296 are not correct information in pom.xml.
Thanks.



 org.apache.maven.plugins maven-assembly-plugin 2.4  <
descriptorRefs> jar-with-dependencies   

On Thu, Jun 11, 2015 at 1:43 AM, Márton Balassi 
wrote:

> As for locally I meant the machine that you use for development to see
> whether this works without parallelism. :-) No need to install stuff on
> your Namenode of course.
> Installing Kafka on a machine and having the Kafka Java dependencies
> available for Flink are two very different things. Try adding the following
> [1] to your maven pom. Then execute 'mvn assembly:assembly', this will
> produce a fat jar suffiexed jar-with-dependencies.jar. You should be able
> to run the example form that.
>
> [1]
> https://github.com/mbalassi/flink-dataflow/blob/master/pom.xml#L286-296
>
> On Thu, Jun 11, 2015 at 10:32 AM, Hawin Jiang 
> wrote:
>
>> Dear Marton
>>
>> What do you meaning for locally Eclipse with 'Run'.
>> Do you want to me to run it on Namenode?
>> But my namenode didn't install Kafka.  I only installed Kafka on my data
>> node servers.
>> Do I need to install or copy Kafka jar on Namenode? Actually, I don't
>> want to install everything on Name node server.
>> Please advise me.
>> Thanks.
>>
>>
>> My Flink and Hadoop cluster info as below.
>>
>> Flink on NameNode
>> Kafka,Zookeeper and FLink slave1 on Datanode1
>> Kafka,Zookeeper ,and Flink slave2 on Datanode2
>> Kafka, Zookeeper and Flink slave3 on Datanode3
>>
>>
>>
>> On Thu, Jun 11, 2015 at 1:16 AM, Márton Balassi > > wrote:
>>
>>> Dear Hawin,
>>>
>>> No problem, I am gald that you are giving our Kafka connector a try. :)
>>> The dependencies listed look good. Can you run the example locally from
>>> Eclipse with 'Run'? I suspect that maybe your Flink cluster does not have
>>> the access to the kafka dependency then.
>>>
>>> As a quick test you could copy the kafka jars to the lib folder of your
>>> Flink distribution on all the machines in your cluster. Everything that is
>>> there goes to the classpath of Flink. Another workaround with be to build a
>>> fat jar for your project containing all the dependencies with 'mvn
>>> assembly:assembly'. Neither of these are beautiful but would help tracking
>>> down the root cause.
>>>
>>> On Thu, Jun 11, 2015 at 10:04 AM, Hawin Jiang 
>>> wrote:
>>>
>>>> Dear Marton
>>>>
>>>>
>>>>
>>>> Thanks for supporting again.
>>>>
>>>> I am running these examples at the same project and I am using Eclipse
>>>> IDE to submit it to my Flink cluster.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Here is my dependencies
>>>>
>>>>
>>>> **
>>>>
>>>> 
>>>>
>>>> 
>>>>
>>>> *junit*
>>>>
>>>> *junit*
>>>>
>>>> 4.12
>>>>
>>>> test
>>>>
>>>> 
>>>>
>>>> 
>>>>
>>>> org.apache.flink
>>>>
>>>> *flink*-java
>>>>
>>>> 0.9.0-milestone-1
>>>>
>>>> 
>>>>
>>>> 
>>>>
>>>> org.apache.flink
>>>>
>>>> *flink*-clients
>>>>
>>>> 0.9.0-milestone-1
>>>>
>>>> 
>>>>
>>>> 
>>>>
>>>> org.apache.flink
>>>>
>>>> *flink*-streaming-connectors
>>>>
>>>> 0.9.0-milestone-1
>>>>
>>>> 
>>>>
>>>> 
>>>>
>>>> org.apache.flink
>>>>
>>>> *flink*-streaming-core
>>>>
>>>> 0.9.0-milestone-1
>>>>
>>>> 
>>>>
>>>> 
>>>>
>>>> org.apache.kafka
>>>>
>>>> kafka_2.10
>>>>
>>>> 0.8.2.1

Re: Best way to write data to HDFS by Flink

2015-06-22 Thread Hawin Jiang
Hi  Marton

if we received a huge data from kafka and wrote to HDFS immediately.  We
should use buffer timeout based on your URL
I am not sure you have flume experience.  Flume can be configured buffer
size and partition as well.

What is the partition.
For example:
I want to write 1 minute buffer file to HDFS which is
/data/flink/year=2015/month=06/day=22/hour=21.
if the partition(/data/flink/year=2015/month=06/day=22/hour=21) is there,
no need to create it. Otherwise, flume will create it automatically.
Flume knows the coming data will come to right partition.

I am not sure Flink also provided a similar partition API or configuration
for this.
Thanks.



Best regards
Hawin

On Wed, Jun 10, 2015 at 10:31 AM, Hawin Jiang  wrote:

> Thanks Marton
> I will use this code to implement my testing.
>
>
>
> Best regards
> Hawin
>
> On Wed, Jun 10, 2015 at 1:30 AM, Márton Balassi 
> wrote:
>
>> Dear Hawin,
>>
>> You can pass a hdfs path to DataStream's and DataSet's writeAsText and
>> writeAsCsv methods.
>> I assume that you are running a Streaming topology, because your source
>> is Kafka, so it would look like the following:
>>
>> StreamExecutionEnvironment env =
>> StreamExecutionEnvironment.getExecutionEnvironment();
>>
>> env.addSource(PerisitentKafkaSource(..))
>>   .map(/* do you operations*/)
>>
>> .wirteAsText("hdfs://:/path/to/your/file");
>>
>> Check out the relevant section of the streaming docs for more info. [1]
>>
>> [1]
>> http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#connecting-to-the-outside-world
>>
>> Best,
>>
>> Marton
>>
>> On Wed, Jun 10, 2015 at 10:22 AM, Hawin Jiang 
>> wrote:
>>
>>> Hi All
>>>
>>>
>>>
>>> Can someone tell me what is the best way to write data to HDFS when
>>> Flink received data from Kafka?
>>>
>>> Big thanks for your example.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Best regards
>>>
>>> Hawin
>>>
>>>
>>>
>>
>>
>


Re: writeAsCsv on HDFS

2015-06-25 Thread Hawin Jiang
HI  Flavio

Here is the example from Marton:
You can used env.writeAsText method directly.


StreamExecutionEnvironment env = StreamExecutionEnvironment.
getExecutionEnvironment();

env.addSource(PerisitentKafkaSource(..))
  .map(/* do you operations*/)

.wirteAsText("hdfs://:/path/to/your/file");

Check out the relevant section of the streaming docs for more info. [1]

[1]
http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#connecting-to-the-outside-world


On Thu, Jun 25, 2015 at 6:25 AM, Stephan Ewen  wrote:

> You could also just qualify the HDFS URL, if that is simpler (put host and
> port of the namenode in there): "hdfs://myhost:40010/path/to/file"
>
> On Thu, Jun 25, 2015 at 3:20 PM, Robert Metzger 
> wrote:
>
>> You have to put it into all machines
>>
>> On Thu, Jun 25, 2015 at 3:17 PM, Flavio Pompermaier > > wrote:
>>
>>> Do I have to put the hadoop conf file on each task manager or just on
>>> the job-manager?
>>>
>>> On Thu, Jun 25, 2015 at 3:12 PM, Chiwan Park 
>>> wrote:
>>>
 It represents the folder containing the hadoop config files. :)

 Regards,
 Chiwan Park


 > On Jun 25, 2015, at 10:07 PM, Flavio Pompermaier <
 pomperma...@okkam.it> wrote:
 >
 > fs.hdfs.hadoopconf represents the folder containing the hadoop config
 files (*-site.xml) or just one specific hadoop config file (e.g.
 core-site.xml or the hdfs-site.xml)?
 >
 > On Thu, Jun 25, 2015 at 3:04 PM, Robert Metzger 
 wrote:
 > Hi Flavio,
 >
 > there is a file called "conf/flink-conf.yaml"
 > Add a new line in the file with the following contents:
 >
 > fs.hdfs.hadoopconf: /path/to/your/hadoop/config
 >
 > This should fix the problem.
 > Flink can not load the configuration file from the jar containing the
 user code, because the file system is initialized independent of the the
 job. So there is (currently) no way of initializing the file system using
 the user code classloader.
 >
 > What you can do is making the configuration file available to Flink's
 system classloader. For example by putting your user jar into the lib/
 folder of Flink. You can also add the path to the Hadoop configuration
 files into the CLASSPATH of Flink (but you need to do that on all 
 machines).
 >
 > I think the easiest approach is using Flink's configuration file.
 >
 >
 > On Thu, Jun 25, 2015 at 2:59 PM, Flavio Pompermaier <
 pomperma...@okkam.it> wrote:
 > Could you describe it better with an example please? Why Flink
 doesn't load automatically the properties of the hadoop conf files within
 the jar?
 >
 > On Thu, Jun 25, 2015 at 2:55 PM, Robert Metzger 
 wrote:
 > Hi,
 >
 > Flink is not loading the Hadoop configuration from the classloader.
 You have to specify the path to the Hadoop configuration in the flink
 configuration "fs.hdfs.hadoopconf"
 >
 > On Thu, Jun 25, 2015 at 2:50 PM, Flavio Pompermaier <
 pomperma...@okkam.it> wrote:
 > Hi to all,
 > I'm experiencing some problem in writing a file as csv on HDFS with
 flink 0.9.0.
 > The code I use is
 >   myDataset.writeAsCsv(new Path("hdfs:///tmp",
 "myFile.csv").toString());
 >
 > If I run the job from Eclipse everything works fine but when I deploy
 the job on the cluster (cloudera 5.1.3) I obtain the following exception:
 >
 > Caused by: java.io.IOException: The given HDFS file URI
 (hdfs:///tmp/myFile.csv) did not describe the HDFS NameNode. The attempt to
 use a default HDFS configuration, as specified in the 'fs.hdfs.hdfsdefault'
 or 'fs.hdfs.hdfssite' config parameter failed due to the following problem:
 Either no default file system was registered, or the provided configuration
 contains no valid authority component (fs.default.name or
 fs.defaultFS) describing the (hdfs namenode) host and port.
 >   at
 org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.initialize(HadoopFileSystem.java:291)
 >   at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:258)
 >   at org.apache.flink.core.fs.Path.getFileSystem(Path.java:309)
 >   at
 org.apache.flink.api.common.io.FileOutputFormat.initializeGlobal(FileOutputFormat.java:273)
 >   at
 org.apache.flink.runtime.jobgraph.OutputFormatVertex.initializeOnMaster(OutputFormatVertex.java:84)
 >   at
 org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:520)
 >   ... 25 more
 >
 > The core-site.xml is present in the fat jar and contains the property
 >
 > 
 > fs.defaultFS
 > hdfs://myServerX:8020
 >   
 >
 > I compiled flink with the following command:
 >
 >  mvn clean  install -Dhadoop.version=2.3.0-cdh5.1.3
 -Dhbase.version=0.98.1-c

Re: Kafka0.8.2.1 + Flink0.9.0 issue

2015-06-25 Thread Hawin Jiang
Dear Marton

I have upgraded my Flink to 0.9.0.  But I could not consume a data from
Kafka by Flink.
I have fully followed your example.
Please help me.
Thanks.


Here is my code
StreamExecutionEnvironment env =
StreamExecutionEnvironment.createLocalEnvironment().setParallelism(4);

DataStream kafkaStream = env
.addSource(new KafkaSource(host + ":" + port, topic, new
JavaDefaultStringSchema()));
kafkaStream.print();

env.execute();


Here are some errors:

15/06/25 22:57:52 INFO consumer.ConsumerFetcherThread:
[ConsumerFetcherThread-flink-group_hawin-1435298168910-10520844-0-1],
Shutting down
15/06/25 22:57:52 INFO consumer.SimpleConsumer: Reconnect due to socket
error: java.nio.channels.ClosedByInterruptException
15/06/25 22:57:52 INFO consumer.ConsumerFetcherThread:
[ConsumerFetcherThread-flink-group_hawin-1435298168910-10520844-0-1],
Stopped
15/06/25 22:57:52 INFO consumer.ConsumerFetcherThread:
[ConsumerFetcherThread-flink-group_hawin-1435298168910-10520844-0-1],
Shutdown completed
15/06/25 22:57:52 INFO consumer.ConsumerFetcherManager:
[ConsumerFetcherManager-1435298169147] All connections stopped
15/06/25 22:57:52 INFO zkclient.ZkEventThread: Terminate ZkClient event
thread.
15/06/25 22:57:52 INFO zookeeper.ZooKeeper: Session: 0x14e2e5b2dad000a
closed
15/06/25 22:57:52 INFO consumer.ZookeeperConsumerConnector:
[flink-group_hawin-1435298168910-10520844], ZKConsumerConnector shutdown
completed in 40 ms
15/06/25 22:57:52 ERROR tasks.SourceStreamTask: Custom Source -> Stream
Sink (3/4) failed
org.apache.commons.lang3.SerializationException:
java.io.StreamCorruptedException: invalid stream header: 68617769
at
org.apache.commons.lang3.SerializationUtils.deserialize(SerializationUtils.java:232)
at
org.apache.commons.lang3.SerializationUtils.deserialize(SerializationUtils.java:268)
at
org.apache.flink.streaming.util.serialization.JavaDefaultStringSchema.deserialize(JavaDefaultStringSchema.java:40)
at
org.apache.flink.streaming.util.serialization.JavaDefaultStringSchema.deserialize(JavaDefaultStringSchema.java:24)
at
org.apache.flink.streaming.connectors.kafka.api.KafkaSource.run(KafkaSource.java:193)
at
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:49)
at
org.apache.flink.streaming.runtime.tasks.SourceStreamTask.invoke(SourceStreamTask.java:55)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.StreamCorruptedException: invalid stream header: 68617769
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:806)
at java.io.ObjectInputStream.(ObjectInputStream.java:299)
at
org.apache.commons.lang3.SerializationUtils.deserialize(SerializationUtils.java:222)
... 8 more

On Tue, Jun 23, 2015 at 6:31 AM, Márton Balassi 
wrote:

> Dear Hawin,
>
> Sorry, I ahve managed to link to a pom that has been changed in the
> meantime. But we have added a section to our doc clarifying your question.
> [1] Since then Stephan has proposed an even nicer solution that did not
> make it into the doc yet, namely if you start from our quickstart pom and
> add your dependencies to that simply executing 'mvn package -Pbuild-jar'
> you get a jar with all your the code that is needed to run it on the
> cluster, but not more. See [3] for more on the quickstart.
>
> [1]
> http://ci.apache.org/projects/flink/flink-docs-master/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution
> [2]
> https://github.com/apache/flink/blob/master/flink-quickstart/flink-quickstart-java/src/main/resources/archetype-resources/pom.xml
> [3]
> http://ci.apache.org/projects/flink/flink-docs-master/quickstart/java_api_quickstart.html
>
> On Tue, Jun 23, 2015 at 6:48 AM, Ashutosh Kumar <
> ashutosh.disc...@gmail.com> wrote:
>
>> I use following dependencies and it works fine .
>>
>> 
>> org.apache.flink
>> flink-java
>> 0.9-SNAPSHOT
>> 
>> 
>> org.apache.flink
>> flink-clients
>> 0.9-SNAPSHOT
>> 
>> 
>> org.apache.flink
>> flink-streaming-core
>> 0.9-SNAPSHOT
>> 
>> 
>> org.apache.flink
>> flink-connector-kafka
>> 0.9-SNAPSHOT
>> 
>> 
>>
>> On Mon, Jun 22, 2015 at 10:07 PM, Hawin Jiang 
>> wrote:
>>
>>> Hi  Marton
>>>
>>> I have to add whole pom.xml file or just only plugin as below.
>>> I saw L286 to L296 are not correct information in pom.xml.
>>> Thanks.
>>>
>>>
>>>
>>>  org.apache.maven.plugins >> >maven-assembly-plugin 2.4 <
>>> configuration>  jar-with-dependencies>> descriptorRef>   
>>>
>>> On Thu, Jun 11, 2015 at 1:43 AM, Márton Balassi <
>>> balassi.mar...@gmail.com> wrote:
>>>
>>>> As for l

Re: Kafka0.8.2.1 + Flink0.9.0 issue

2015-06-25 Thread Hawin Jiang
Dear Marton


Here are some errors when I run KafkaProducerExample.java from Eclipse.

kafka.common.KafkaException: fetching topic metadata for topics
[Set(flink-kafka-topic)] from broker
[ArrayBuffer(id:0,host:192.168.0.112,port:2181)] failed
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:72)
at
kafka.producer.BrokerPartitionInfo.updateInfo(BrokerPartitionInfo.scala:82)
at
kafka.producer.async.DefaultEventHandler$$anonfun$handle$2.apply$mcV$sp(DefaultEventHandler.scala:78)
at kafka.utils.Utils$.swallow(Utils.scala:172)
at kafka.utils.Logging$class.swallowError(Logging.scala:106)
at kafka.utils.Utils$.swallowError(Utils.scala:45)
at
kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:78)
at kafka.producer.Producer.send(Producer.scala:77)
at kafka.javaapi.producer.Producer.send(Producer.scala:33)
at
org.apache.flink.streaming.connectors.kafka.api.KafkaSink.invoke(KafkaSink.java:183)
at
org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:35)
at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.invoke(OneInputStreamTask.java:102)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException: Received -1 when reading from channel,
socket has likely been closed.
at kafka.utils.Utils$.read(Utils.scala:381)
at
kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
at kafka.network.Receive$class.readCompletely(Transmission.scala:56)
at
kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29)
at kafka.network.BlockingChannel.receive(BlockingChannel.scala:111)
at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:75)
at
kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72)
at kafka.producer.SyncProducer.send(SyncProducer.scala:113)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58)

On Thu, Jun 25, 2015 at 11:06 PM, Hawin Jiang  wrote:

> Dear Marton
>
> I have upgraded my Flink to 0.9.0.  But I could not consume a data from
> Kafka by Flink.
> I have fully followed your example.
> Please help me.
> Thanks.
>
>
> Here is my code
> StreamExecutionEnvironment env =
> StreamExecutionEnvironment.createLocalEnvironment().setParallelism(4);
>
> DataStream kafkaStream = env
> .addSource(new KafkaSource(host + ":" + port, topic, new
> JavaDefaultStringSchema()));
> kafkaStream.print();
>
> env.execute();
>
>
> Here are some errors:
>
> 15/06/25 22:57:52 INFO consumer.ConsumerFetcherThread:
> [ConsumerFetcherThread-flink-group_hawin-1435298168910-10520844-0-1],
> Shutting down
> 15/06/25 22:57:52 INFO consumer.SimpleConsumer: Reconnect due to socket
> error: java.nio.channels.ClosedByInterruptException
> 15/06/25 22:57:52 INFO consumer.ConsumerFetcherThread:
> [ConsumerFetcherThread-flink-group_hawin-1435298168910-10520844-0-1],
> Stopped
> 15/06/25 22:57:52 INFO consumer.ConsumerFetcherThread:
> [ConsumerFetcherThread-flink-group_hawin-1435298168910-10520844-0-1],
> Shutdown completed
> 15/06/25 22:57:52 INFO consumer.ConsumerFetcherManager:
> [ConsumerFetcherManager-1435298169147] All connections stopped
> 15/06/25 22:57:52 INFO zkclient.ZkEventThread: Terminate ZkClient event
> thread.
> 15/06/25 22:57:52 INFO zookeeper.ZooKeeper: Session: 0x14e2e5b2dad000a
> closed
> 15/06/25 22:57:52 INFO consumer.ZookeeperConsumerConnector:
> [flink-group_hawin-1435298168910-10520844], ZKConsumerConnector shutdown
> completed in 40 ms
> 15/06/25 22:57:52 ERROR tasks.SourceStreamTask: Custom Source -> Stream
> Sink (3/4) failed
> org.apache.commons.lang3.SerializationException:
> java.io.StreamCorruptedException: invalid stream header: 68617769
> at
> org.apache.commons.lang3.SerializationUtils.deserialize(SerializationUtils.java:232)
> at
> org.apache.commons.lang3.SerializationUtils.deserialize(SerializationUtils.java:268)
> at
> org.apache.flink.streaming.util.serialization.JavaDefaultStringSchema.deserialize(JavaDefaultStringSchema.java:40)
> at
> org.apache.flink.streaming.util.serialization.JavaDefaultStringSchema.deserialize(JavaDefaultStringSchema.java:24)
> at
> org.apache.flink.streaming.connectors.kafka.api.KafkaSource.run(KafkaSource.java:193)
> at
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:49)
> at
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.invoke(SourceStreamTask.java:55)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.StreamCorruptedException: invalid stream header:
> 68617769
> at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:806)
> at java.io.ObjectInputStream.(ObjectInputStream

Re: Best way to write data to HDFS by Flink

2015-06-25 Thread Hawin Jiang
Hi Stephan

Yes, that is a great idea.  if it is possible,  I will try my best to
contribute some codes to Flink.
But I have to run some flink examples first to understand Apache Flink.
I just run some kafka with flink examples.  No examples working for me.   I
am so sad right now.
I didn't get any troubles to run kafka examples from *kafka*.apache.org so
far.
Please suggest me.
Thanks.



Best regards
Hawin


On Wed, Jun 24, 2015 at 1:02 AM, Stephan Ewen  wrote:

> Hi Hawin!
>
> If you are creating code for such an output into different
> files/partitions, it would be amazing if you could contribute this code to
> Flink.
>
> It seems like a very common use case, so this functionality will be useful
> to other user as well!
>
> Greetings,
> Stephan
>
>
> On Tue, Jun 23, 2015 at 3:36 PM, Márton Balassi 
> wrote:
>
>> Dear Hawin,
>>
>> We do not have out of the box support for that, it is something you would
>> need to implement yourself in a custom SinkFunction.
>>
>> Best,
>>
>> Marton
>>
>> On Mon, Jun 22, 2015 at 11:51 PM, Hawin Jiang 
>> wrote:
>>
>>> Hi  Marton
>>>
>>> if we received a huge data from kafka and wrote to HDFS immediately.  We
>>> should use buffer timeout based on your URL
>>> I am not sure you have flume experience.  Flume can be configured buffer
>>> size and partition as well.
>>>
>>> What is the partition.
>>> For example:
>>> I want to write 1 minute buffer file to HDFS which is
>>> /data/flink/year=2015/month=06/day=22/hour=21.
>>> if the partition(/data/flink/year=2015/month=06/day=22/hour=21) is
>>> there, no need to create it. Otherwise, flume will create it automatically.
>>> Flume knows the coming data will come to right partition.
>>>
>>> I am not sure Flink also provided a similar partition API or
>>> configuration for this.
>>> Thanks.
>>>
>>>
>>>
>>> Best regards
>>> Hawin
>>>
>>> On Wed, Jun 10, 2015 at 10:31 AM, Hawin Jiang 
>>> wrote:
>>>
>>>> Thanks Marton
>>>> I will use this code to implement my testing.
>>>>
>>>>
>>>>
>>>> Best regards
>>>> Hawin
>>>>
>>>> On Wed, Jun 10, 2015 at 1:30 AM, Márton Balassi <
>>>> balassi.mar...@gmail.com> wrote:
>>>>
>>>>> Dear Hawin,
>>>>>
>>>>> You can pass a hdfs path to DataStream's and DataSet's writeAsText and
>>>>> writeAsCsv methods.
>>>>> I assume that you are running a Streaming topology, because your
>>>>> source is Kafka, so it would look like the following:
>>>>>
>>>>> StreamExecutionEnvironment env =
>>>>> StreamExecutionEnvironment.getExecutionEnvironment();
>>>>>
>>>>> env.addSource(PerisitentKafkaSource(..))
>>>>>   .map(/* do you operations*/)
>>>>>
>>>>> .wirteAsText("hdfs://:/path/to/your/file");
>>>>>
>>>>> Check out the relevant section of the streaming docs for more info. [1]
>>>>>
>>>>> [1]
>>>>> http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#connecting-to-the-outside-world
>>>>>
>>>>> Best,
>>>>>
>>>>> Marton
>>>>>
>>>>> On Wed, Jun 10, 2015 at 10:22 AM, Hawin Jiang 
>>>>> wrote:
>>>>>
>>>>>> Hi All
>>>>>>
>>>>>>
>>>>>>
>>>>>> Can someone tell me what is the best way to write data to HDFS when
>>>>>> Flink received data from Kafka?
>>>>>>
>>>>>> Big thanks for your example.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best regards
>>>>>>
>>>>>> Hawin
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Kafka0.8.2.1 + Flink0.9.0 issue

2015-06-26 Thread Hawin Jiang
Hi  Aljoscha

You are the best.
Thank you very much.
Right now, It is working now.



Best regards
Hawin

On Fri, Jun 26, 2015 at 12:28 AM, Aljoscha Krettek 
wrote:

> Hi,
> could you please try replacing JavaDefaultStringSchema() with
> SimpleStringSchema() in your first example. The one where you get this
> exception:
> org.apache.commons.lang3.SerializationException:
> java.io.StreamCorruptedException: invalid stream header: 68617769
>
> Cheers,
> Aljoscha
>
> On Fri, 26 Jun 2015 at 08:21 Hawin Jiang  wrote:
>
>> Dear Marton
>>
>>
>> Here are some errors when I run KafkaProducerExample.java from Eclipse.
>>
>> kafka.common.KafkaException: fetching topic metadata for topics
>> [Set(flink-kafka-topic)] from broker
>> [ArrayBuffer(id:0,host:192.168.0.112,port:2181)] failed
>> at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:72)
>> at
>> kafka.producer.BrokerPartitionInfo.updateInfo(BrokerPartitionInfo.scala:82)
>> at
>> kafka.producer.async.DefaultEventHandler$$anonfun$handle$2.apply$mcV$sp(DefaultEventHandler.scala:78)
>> at kafka.utils.Utils$.swallow(Utils.scala:172)
>> at kafka.utils.Logging$class.swallowError(Logging.scala:106)
>> at kafka.utils.Utils$.swallowError(Utils.scala:45)
>> at
>> kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:78)
>> at kafka.producer.Producer.send(Producer.scala:77)
>> at kafka.javaapi.producer.Producer.send(Producer.scala:33)
>> at
>> org.apache.flink.streaming.connectors.kafka.api.KafkaSink.invoke(KafkaSink.java:183)
>> at
>> org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:35)
>> at
>> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.invoke(OneInputStreamTask.java:102)
>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.io.EOFException: Received -1 when reading from channel,
>> socket has likely been closed.
>> at kafka.utils.Utils$.read(Utils.scala:381)
>> at
>> kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
>> at kafka.network.Receive$class.readCompletely(Transmission.scala:56)
>> at
>> kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29)
>> at kafka.network.BlockingChannel.receive(BlockingChannel.scala:111)
>> at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:75)
>> at
>> kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72)
>> at kafka.producer.SyncProducer.send(SyncProducer.scala:113)
>> at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58)
>>
>> On Thu, Jun 25, 2015 at 11:06 PM, Hawin Jiang 
>> wrote:
>>
>>> Dear Marton
>>>
>>> I have upgraded my Flink to 0.9.0.  But I could not consume a data from
>>> Kafka by Flink.
>>> I have fully followed your example.
>>> Please help me.
>>> Thanks.
>>>
>>>
>>> Here is my code
>>> StreamExecutionEnvironment env =
>>> StreamExecutionEnvironment.createLocalEnvironment().setParallelism(4);
>>>
>>> DataStream kafkaStream = env
>>> .addSource(new KafkaSource(host + ":" + port, topic, new
>>> JavaDefaultStringSchema()));
>>> kafkaStream.print();
>>>
>>> env.execute();
>>>
>>>
>>> Here are some errors:
>>>
>>> 15/06/25 22:57:52 INFO consumer.ConsumerFetcherThread:
>>> [ConsumerFetcherThread-flink-group_hawin-1435298168910-10520844-0-1],
>>> Shutting down
>>> 15/06/25 22:57:52 INFO consumer.SimpleConsumer: Reconnect due to socket
>>> error: java.nio.channels.ClosedByInterruptException
>>> 15/06/25 22:57:52 INFO consumer.ConsumerFetcherThread:
>>> [ConsumerFetcherThread-flink-group_hawin-1435298168910-10520844-0-1],
>>> Stopped
>>> 15/06/25 22:57:52 INFO consumer.ConsumerFetcherThread:
>>> [ConsumerFetcherThread-flink-group_hawin-1435298168910-10520844-0-1],
>>> Shutdown completed
>>> 15/06/25 22:57:52 INFO consumer.ConsumerFetcherManager:
>>> [ConsumerFetcherManager-1435298169147] All connections stopped
>>> 15/06/25 22:57:52 INFO zkclient.ZkEventThread: Terminate ZkClient event
>>> thread.
>>> 15/06/25 22:57:52 INFO zookeeper.ZooKeeper: Session: 0x14e2e5b2dad000a
>>> closed
>>> 15/06/25 22:57:52 INFO consumer.ZookeeperConsumerConnector:
>>> [flink-group_hawin-1435298168910-10520844], ZKConsumerConnector shu

RE: Best way to write data to HDFS by Flink

2015-06-29 Thread Hawin Jiang
Dear  Marton

 

Thanks for your asking.  Yes. it is working now. 

But, the TPS is not very good.   I have met four issues as below

 

1.   My TPS around 2000 events per second.   But I saw some companies 
achieved 132K per second on single node at 2015 Los Angeles big data day 
yesterday.   For two nodes, the TPS is 282K per sec.  them used kafka+Spark. 

As you knew  that I used kafka+Flink. Maybe we have to do more investigations 
from my side.   

 

2.   Regarding my performance testing, I used JMeter to producer data to 
Kafka.  The total messages in JMeter side is not matched HDFS side.   In the 
meantime, I used flink to write data to HDFS.  

 

3.   I found that Flink randomly created 1, 2, 3 and 4 folders. Only 1 and 
4 folders have files.  The 2 and 3 folders don’t have any files. 

 

4.   I am going to develop some codes to write data to 
/data/flink/year/month/day/hour folder.  I think that folder structure is good 
for flink table API in the future. 

 

Please let me know if you have some comments or suggests for me.

Thanks. 

 

 

 

Best regards

Hawin

 

From: Márton Balassi [mailto:balassi.mar...@gmail.com] 
Sent: Sunday, June 28, 2015 9:09 PM
To: user@flink.apache.org
Subject: Re: Best way to write data to HDFS by Flink

 

Dear Hawin,

 

As for your issues with running the Flink Kafka examples: are those resolved 
with Aljoscha's comment in the other thread? :)

 

Best,

 

Marton

 

On Fri, Jun 26, 2015 at 8:40 AM, Hawin Jiang  wrote:

Hi Stephan

 

Yes, that is a great idea.  if it is possible,  I will try my best to 
contribute some codes to Flink. 

But I have to run some flink examples first to understand Apache Flink.

I just run some kafka with flink examples.  No examples working for me.   I am 
so sad right now.

I didn't get any troubles to run kafka examples from kafka.apache.org so far. 

Please suggest me.

Thanks.

 

 

 

Best regards

Hawin

 

 

On Wed, Jun 24, 2015 at 1:02 AM, Stephan Ewen  wrote:

Hi Hawin!

 

If you are creating code for such an output into different files/partitions, it 
would be amazing if you could contribute this code to Flink.

 

It seems like a very common use case, so this functionality will be useful to 
other user as well!

 

Greetings,
Stephan

 

 

On Tue, Jun 23, 2015 at 3:36 PM, Márton Balassi  
wrote:

Dear Hawin,

 

We do not have out of the box support for that, it is something you would need 
to implement yourself in a custom SinkFunction.

 

Best,

 

Marton

 

On Mon, Jun 22, 2015 at 11:51 PM, Hawin Jiang  wrote:

Hi  Marton

 

if we received a huge data from kafka and wrote to HDFS immediately.  We should 
use buffer timeout based on your URL

I am not sure you have flume experience.  Flume can be configured buffer size 
and partition as well.

 

What is the partition.  

For example:

I want to write 1 minute buffer file to HDFS which is 
/data/flink/year=2015/month=06/day=22/hour=21. 

if the partition(/data/flink/year=2015/month=06/day=22/hour=21) is there, no 
need to create it. Otherwise, flume will create it automatically. 

Flume knows the coming data will come to right partition.  

 

I am not sure Flink also provided a similar partition API or configuration for 
this. 

Thanks.

 

 

 

Best regards

Hawin

 

On Wed, Jun 10, 2015 at 10:31 AM, Hawin Jiang  wrote:

Thanks Marton

I will use this code to implement my testing.

 

 

 

Best regards

Hawin

 

On Wed, Jun 10, 2015 at 1:30 AM, Márton Balassi  
wrote:

Dear Hawin,

 

You can pass a hdfs path to DataStream's and DataSet's writeAsText and 
writeAsCsv methods.

I assume that you are running a Streaming topology, because your source is 
Kafka, so it would look like the following:

 

StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();

 

env.addSource(PerisitentKafkaSource(..))

  .map(/* do you operations*/)

  .wirteAsText("hdfs://:/path/to/your/file");


Check out the relevant section of the streaming docs for more info. [1]

 

[1] 
http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#connecting-to-the-outside-world

 

Best,

 

Marton

 

On Wed, Jun 10, 2015 at 10:22 AM, Hawin Jiang  wrote:

Hi All

 

Can someone tell me what is the best way to write data to HDFS when Flink 
received data from Kafka?

Big thanks for your example.

 

 

 

 

Best regards

Hawin

 

 

 

 

 

 

 

 



Re: Benchmark results between Flink and Spark

2015-07-06 Thread Hawin Jiang
Hi  Slim and Fabian

Here is the Spark benchmark. https://amplab.cs.berkeley.edu/benchmark/
Do we have s similar report or comparison like that.
Thanks.



Best regards
Hawin



On Mon, Jul 6, 2015 at 6:32 AM, Slim Baltagi  wrote:

> Hi Fabian
>
> > I could not find which versions of Flink and Spark were compared.
> According to Norman Spangenberg, one of the authors of the conference
> paper,
> the benchmark used *Spark* version was *1.2.0*. and *Flink* version was
> *0.8.0*.
>
> I did ask him a few more questions about the benchmark between Flink and
> Spark.
> I'll share the answers once Norman Spangenberg gets back to me.
>
> Thanks
>
> Slim Baltagi
> Apache Flink Knowledge Base ( Now with over 300 categorized web resources!)
> http://sparkbigdata.com/component/tags/tag/27-flink
>
>
>
>
>
> --
> View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Benchmark-results-between-Flink-and-Spark-tp1940p1957.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>


Re: Benchmark results between Flink and Spark

2015-07-06 Thread Hawin Jiang
Hi Stephan

Yes.  You are correct. It looks like the TPCx-HS is an industry standard
for big data. But how to get a Flink number on that.
I think it is also difficult to get a Spark performance number based on
TPCx-HS.
if you know someone can provide servers for performance testing.  I would
like to put in my best efforts.


@Slim
That link is just for your reference. At least, you know the exact time
them spent it when you run that queries.
BigDataBench is a good guide for big data benchmark.  But how to run these
user cases between Flink and Spark to get that performance number.


@Vasia
Thanks for sharing. if we can do some basic comparisons with Apache Spark.
The red line below will be going up fast.

Thanks.




[image: Inline image 1]

On Mon, Jul 6, 2015 at 11:41 AM, Slim Baltagi  wrote:

> Hi
>
> Vasia, thanks for sharing.
> 1. I would like to add a couple resources about *BigBench*, the Big Data
> benchmark suite that you are referring to:
>  https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench
> and also
>
> http://blog.cloudera.com/blog/2014/11/bigbench-toward-an-industry-standard-benchmark-for-big-data-analytics/
>
> 2. *BigDataBench* is also an open source Big Data Benchmarking suite from
> both industry and academia.  As a subset of BigDataBench, BigDataBench-DCA
> is China’s first industry-standard big data benchmark suite:
> http://prof.ict.ac.cn/BigDataBench/industry-standard-benchmarks/
> It comes with *real-world data sets* and *many workloads*: TeraSort,
> WordCount, PageRank, K-means, NaiveBayes, Aggregation and Read/Write/Scan
> and also a *tool* that uses Hadoop, HBase and Mahout.
> This might be inspiring to build a Big Data Benchmarking suite for Flink!
>
> Regards,
>
> Slim Baltagi
> Apache Flink Knowledge Base ( Now with over 300 categorized web resources!)
> http://sparkbigdata.com/component/tags/tag/27-flink
>
>
>
> --
> View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Benchmark-results-between-Flink-and-Spark-tp1940p1963.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>


Re: TeraSort on Flink and Spark

2015-07-12 Thread Hawin Jiang
Hi  Kim and Stephan

Kim's report is sorting 3360GB per 1427 seconds by Flink 0.9.0.  3360 =
80*42 ((80GB/per node and 42 nodes)
Based on Kim's report.  The TPS is 2.35GB/sec for Flink 0.9.0

Kim was using 42 nodes for testing purposes.  I found that the best Spark
performance result was using 190 nodes from Databricks
The scalability factor is 42:190.   if we are going to use 190 nodes for
this testing.
The Flink TPS should be 10.65GB /sec


Here is the summary table for your reference.
Please let me know if you have any questions about this table.
Thanks.

[image: Inline image 2]


72.93GB/sec  = (1000TB*1024) / (234min*60)


The performance test report from Databricks.
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html






Best regards
Hawin

On Fri, Jul 10, 2015 at 1:33 AM, Stephan Ewen  wrote:

> Hi Dongwon Kim!
>
> Thank you for trying out these changes.
>
> The OptimizedText can be sorted more efficiently, because it generates a
> binary key prefix. That way, the sorting needs to serialize/deserialize
> less and saves on CPU.
>
> In parts of the program, the CPU is then less of a bottleneck and the
> disks and the network can unfold their bandwidth better.
>
> Greetings,
> Stephan
>
>
>
> On Fri, Jul 10, 2015 at 9:35 AM, Dongwon Kim 
> wrote:
>
>> Hi Stephan,
>>
>> I just pushed changes to my github:
>> https://github.com/eastcirclek/terasort.
>> I've modified the TeraSort program so that (A) it can reuse objects
>> and (B) it can exploit OptimizedText as you suggested.
>>
>> I've also conducted few experiments and the results are as follows:
>> ORIGINAL : 1714
>> ORIGINAL+A : 1671
>> ORIGINAL+B : 1467
>> ORIGINAL+A+B : 1427
>> Your advice works as shown above :-)
>>
>> Datasets are now defined as below:
>> - val inputFile = env.readHadoopFile(teraInputFormat, classOf[Text],
>> classOf[Text], inputPath)
>> - val optimizedText = inputFile.map(tp => (new OptimizedText(tp._1),
>> tp._2))
>> - val sortedPartitioned = optimizedText.partitionCustom(partitioner,
>> 0).sortPartition(0, Order.ASCENDING)
>> - sortedPartitioned.map(tp => (tp._1.getText, tp._2)).output(hadoopOF)
>> You can see the two map transformations before and after the function
>> composition partitionCustom.sortPartition.
>>
>> Here is a question regarding the performance improvement.
>> Please see the attached Ganglia image files.
>> - ORIGINAL-[cpu, disks, network].png are for ORIGINAL.
>> - BEST-[cpu, disks, network].png are for ORIGINAL+A+B.
>> Compared to ORIGINAL, BEST shows better utilization of disks and
>> network and shows lower CPU utilization.
>> Is this because OptimizedText objects are serialized into Flink memory
>> layout?
>> What happens when keys are represented in just Text, not
>> OptimziedText? Are there another memory area to hold such objects? or
>> are they serialized anyway but in an inefficient way?
>> If latter, is the CPU utilization in ORIGINAL high because CPUs work
>> hard to serialize Text objects using Java serialization mechanism with
>> DataInput and DataOutput?
>> If true, I can explain the high throughput of network and disks in
>> ORIGINAL+A+B.
>> I, however, observed the similar performance when I do mapping not
>> only on 10-byte keys but also on 90-byte values, which cannot be
>> explained by the above conjecture.
>> Could you make things clear? If so, I would be very appreciated ;-)
>>
>> I'm also wondering whether the two map transformations,
>> (Text, Text) to (OptimizedText, Text) and  (OptimizedText, Text) to
>> (Text, Text),
>> can prevent Flink from performing a lot better.
>> I don't have time to modify TeraInputFormat and TeraOutputFormat to
>> read (String, String) pairs from HDFS and write (String, String) pairs
>> to HDFS.
>> Do you see that one can get a better TeraSort result using an new
>> implementation of FileInputFormat?
>>
>> Regards,
>>
>> Dongwon Kim
>>
>> 2015-07-03 3:29 GMT+09:00 Stephan Ewen :
>> > Hello Dongwon Kim!
>> >
>> > Thanks you for sharing these numbers with us.
>> >
>> > I have gone through your implementation and there are two things you
>> could
>> > try:
>> >
>> > 1)
>> >
>> > I see that you sort Hadoop's Text data type with Flink. I think this
>> may be
>> > less efficient than if you sort String, or a Flink specific data type.
>> >
>> > For efficient byte operations on managed memory, Flink needs to
>> understand
>> > the binary representation of the data type. Flink understands that for
>> > "String" and many other types, but not for "Text".
>> >
>> > There are two things you can do:
>> >   - First, try what happens if you map the Hadoop Text type to a Java
>> String
>> > (only for the tera key).
>> >   - Second, you can try what happens if you wrap the Hadoop Text type
>> in a
>> > Flink type that supports optimized binary sorting. I have pasted code
>> for
>> > that at the bottom of this email.
>> >
>> > 2)
>> >
>> > You can see if it helps performance if you enable object re-use in
>> Flink.
>> > 

Sort Benchmark infrastructure

2015-07-12 Thread Hawin Jiang
Hi Michael and George

 

First of all, congratulation you guys have won the sort game again.  We are
coming from Flink community.  

I am not sure if it is possible to get your test environment to test our
Flink for free.  we saw that Apache spark did a good job as well.  

We want to challenge your records. But we don't have that much servers for
testing. 

Please let me know if you can help us or not. 

Thank you very much.

 

 

 

Best regards

Hawin



Re: Building Big Data Benchmarking suite for Apache Flink

2015-07-13 Thread Hawin Jiang
Hi  Slim

I will follow this and keep you posted.
Thanks.



Best regards
Hawin

On Mon, Jul 13, 2015 at 7:04 PM, Slim Baltagi  wrote:

> Hi
>
> BigDataBench is  an open source Big Data Benchmarking suite from both
> industry and academia.  As a subset of BigDataBench, BigDataBench-DCA  is
> China’s first industry-standard big data benchmark suite:
> http://prof.ict.ac.cn/BigDataBench/industry-standard-benchmarks/
> It comes with real-world data sets and many workloads: TeraSort, WordCount,
> PageRank, K-means, NaiveBayes, Aggregation and Read/Write/Scan and also a
> tool that uses Hadoop, HBase and Mahout.
> This might be inspiring to build a Big Data Benchmarking suite for Flink!
>
> I would like to share with you the news that professor Jianfeng Zhan from
> the Institute of Computing Technology, Chinese Academy of Sciences is
> planning to support Flink in the BigDataBench project! Reference:
> https://www.linkedin.com/grp/home?gid=6777483
>
> Thanks
>
> Slim Baltagi
>
>
>
>
> --
> View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Building-Big-Data-Benchmarking-suite-for-Apache-Flink-tp2035.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>


Re: Sort Benchmark infrastructure

2015-07-15 Thread Hawin Jiang
Hi  George and Mike

Thanks for your information.  Did you use 186 i2.8xlarge servers for
testing?
Total one hour cost = 186 * 6.82 = $1,268.52.
Do you know any person or company can sponsor this?

For our test approach, I have checked an industry standard from big data
bench(http://prof.ict.ac.cn/BigDataBench/industry-standard-benchmarks/)
Maybe we can test TeraSort to see the performance is better than your
record or not.

Please let me know if you have any comments.
Thanks for the support.




Best regards
Hawin



On Tue, Jul 14, 2015 at 9:42 AM, Mike Conley  wrote:

> George is correct.  We used i2.8xlarge with placement groups on Amazon
> EC2.  We ran Amazon Linux, which if I recall correctly is based on Red Hat,
> but optimized for EC2.  OS was essentially unmodified with some packages
> installed for our dependencies.
>
> Thanks,
> Mike
>
> On Tue, Jul 14, 2015 at 9:15 AM, George Porter 
> wrote:
>
>> Hello Hawin,
>>
>> Thanks for reaching out.  We wrote a paper on our efforts, which we'll be
>> posting to our website in a couple of weeks.
>>
>> However in summary, we used a cluster of i2.8xlarge instance types from
>> Amazon, and we made use of the placement group feature to ensure that we'd
>> get good bandwidth between them.  Mike can correct me if I'm wrong, but I
>> believe we used the stock AWS version of Linux (Ubuntu maybe?)
>>
>> So our environment was pretty stock--we didn't get any special support or
>> features from AWS.
>>
>> Best of luck with your profiling and benchmarking.  Do let us know how
>> you perform.  Flink looks like a pretty interesting project, and so let us
>> know if we can help y'all out in some way.
>>
>> Thanks, George
>>
>>
>> On Sun, Jul 12, 2015 at 11:12 PM, Hawin Jiang 
>> wrote:
>>
>>> Hi Michael and George
>>>
>>>
>>>
>>> First of all, congratulation you guys have won the sort game again.  We
>>> are coming from Flink community.
>>>
>>> I am not sure if it is possible to get your test environment to test our
>>> Flink for free.  we saw that Apache spark did a good job as well.
>>>
>>> We want to challenge your records. But we don’t have that much servers
>>> for testing.
>>>
>>> Please let me know if you can help us or not.
>>>
>>> Thank you very much.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Best regards
>>>
>>> Hawin
>>>
>>
>>
>


Re: Sort Benchmark infrastructure

2015-07-16 Thread Hawin Jiang
Hi  George

Thanks for the details.  It looks like I have a long way to go.
For big data benchmark, I would like to use that test cases, test data and
test methodology to test different big data technologies.
BTW, I am agree with you that no one system will necessarily be optimal for
all cases for all workloads.
I hope I can find a good one for our enterprise application.  I will let
you know if I can move forward this.
Good Night.



Best regards
Hawin

On Wed, Jul 15, 2015 at 9:30 AM, George Porter  wrote:

> Hi Hawin,
>
> We used varying numbers of the i2.8xlarge servers, depending on the sort
> record category.  http://sortbenchmark.org/ is really your best source
> for what we did--all the details (should) be on our write-ups.  Note that
> we pro-rated the cost, meaning that if we ran for 15 minutes, we took the
> hourly rate and divided by 4.
>
> In terms of sponsorship, we used a combination of credits donated by
> Amazon, as well as funding form the National Science Foundation.  You can
> submit a grant proposal to Amazon and ask them for credits if you're an
> academic or researcher.  Not sure if being part of an open-source project
> counts, but you might as well try.
>
> In terms of the sort record, that webpage I provided above has all the
> details on the challenge.  Not sure about Big Data benchmark--that term is
> pretty vague.  Often when people say big data, they mean different things.
> Our system is designed for lots of bytes, but not really lots of compute
> over those bytes.  Others pick different design points.  I think you'll
> find that the needs of different users varies quite a bit, and no one
> system will necessarily be optimal for all cases for all workloads.
>
> Good luck on your attempts.
> -George
>
> 
> George Porter
> Assistant Professor, Dept. of Computer Science and Engineering
> Associate Director, UCSD Center for Networked Systems
> UC San Diego, La Jolla CA
> http://www.cs.ucsd.edu/~gmporter/
>
>
>
> On Wed, Jul 15, 2015 at 1:44 AM, Hawin Jiang 
> wrote:
>
>> Hi  George and Mike
>>
>> Thanks for your information.  Did you use 186 i2.8xlarge servers for
>> testing?
>> Total one hour cost = 186 * 6.82 = $1,268.52.
>> Do you know any person or company can sponsor this?
>>
>> For our test approach, I have checked an industry standard from big data
>> bench(http://prof.ict.ac.cn/BigDataBench/industry-standard-benchmarks/)
>> Maybe we can test TeraSort to see the performance is better than your
>> record or not.
>>
>> Please let me know if you have any comments.
>> Thanks for the support.
>>
>>
>>
>>
>> Best regards
>> Hawin
>>
>>
>>
>> On Tue, Jul 14, 2015 at 9:42 AM, Mike Conley  wrote:
>>
>>> George is correct.  We used i2.8xlarge with placement groups on Amazon
>>> EC2.  We ran Amazon Linux, which if I recall correctly is based on Red Hat,
>>> but optimized for EC2.  OS was essentially unmodified with some packages
>>> installed for our dependencies.
>>>
>>> Thanks,
>>> Mike
>>>
>>> On Tue, Jul 14, 2015 at 9:15 AM, George Porter 
>>> wrote:
>>>
>>>> Hello Hawin,
>>>>
>>>> Thanks for reaching out.  We wrote a paper on our efforts, which we'll
>>>> be posting to our website in a couple of weeks.
>>>>
>>>> However in summary, we used a cluster of i2.8xlarge instance types from
>>>> Amazon, and we made use of the placement group feature to ensure that we'd
>>>> get good bandwidth between them.  Mike can correct me if I'm wrong, but I
>>>> believe we used the stock AWS version of Linux (Ubuntu maybe?)
>>>>
>>>> So our environment was pretty stock--we didn't get any special support
>>>> or features from AWS.
>>>>
>>>> Best of luck with your profiling and benchmarking.  Do let us know how
>>>> you perform.  Flink looks like a pretty interesting project, and so let us
>>>> know if we can help y'all out in some way.
>>>>
>>>> Thanks, George
>>>>
>>>>
>>>> On Sun, Jul 12, 2015 at 11:12 PM, Hawin Jiang 
>>>> wrote:
>>>>
>>>>> Hi Michael and George
>>>>>
>>>>>
>>>>>
>>>>> First of all, congratulation you guys have won the sort game again.
>>>>> We are coming from Flink community.
>>>>>
>>>>> I am not sure if it is possible to get your test environment to test
>>>>> our Flink for free.  we saw that Apache spark did a good job as well.
>>>>>
>>>>> We want to challenge your records. But we don’t have that much servers
>>>>> for testing.
>>>>>
>>>>> Please let me know if you can help us or not.
>>>>>
>>>>> Thank you very much.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Best regards
>>>>>
>>>>> Hawin
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Flink Kafka cannot find org/I0Itec/zkclient/serialize/ZkSerializer

2015-07-21 Thread Hawin Jiang
Maybe you can posted your pom.xml file to identify your issue.


Best regards
Hawin

On Tue, Jul 21, 2015 at 2:57 PM, Wendong  wrote:

> also tried using zkclient-0.3.jar in lib/, updated build.sbt and rebuild.
> It
> doesn't help. Still got the same error of NoClassDefFoundError:
> ZkSerializer
> in flink.streaming.connectors.kafka.api.KafkaSource.open().
>
>
>
> --
> View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Kafka-cannot-find-org-I0Itec-zkclient-serialize-ZkSerializer-tp2199p2220.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>


RE: Flink Kafka cannot find org/I0Itec/zkclient/serialize/ZkSerializer

2015-07-22 Thread Hawin Jiang
Hi Wendong

Please make sure you have dependencies as below.
Good luck

*

org.apache.flink
flink-java
0.9.0


org.apache.flink
flink-clients
0.9.0


org.apache.flink
flink-streaming-core
0.9.0


org.apache.flink
flink-streaming-connectors
0.9.0-milestone-1


org.apache.kafka
kafka_2.10
0.8.2.1


org.apache.kafka
kafka-clients
0.8.2.1

**




Best regards
Hawin

-Original Message-
From: Wendong [mailto:wendong@gmail.com] 
Sent: Tuesday, July 21, 2015 5:16 PM
To: user@flink.apache.org
Subject: Re: Flink Kafka cannot find
org/I0Itec/zkclient/serialize/ZkSerializer

Hi Hawin,

I'm using sbt as shown in the original post. I tried using maven and
pom.xml, but got different NoClassDefFoundError: com/yammer/metrics/Metrics.
I've downloaded metrics-core-2.2.0.jar under lib/ but it doesn't help. It
seems the errors from sbt and Maven belong to same nature. Here is my
pom.xml (standard parts are omitted to make it more readable): 



org.apache.flink
flink-scala
0.9.0


org.apache.flink
flink-streaming-scala
0.9.0


org.apache.flink
flink-clients
0.9.0


org.apache.flink
flink-connector-kafka
0.9.0


com.yammer.metrics
metrics-core
2.2.0






 




org.apache.maven.plugins
   
maven-dependency-plugin
2.9


   unpack
   
   prepare-package
   
unpack
   
   
  
   
   
  
org.apache.flink
  
flink-connector-kafka
   0.9.0
   jar
  
false
  
${project.build.directory}/classes
  
org/apache/flink/**
   
   
   
  
org.apache.kafka
  
kafka_2.10
  
0.8.2.0
   jar
  
false
  
${project.build.directory}/classes
  
kafka/**
   
  
   
   




org.apache.maven.plugins
   
maven-compiler-plugin
3.1

1.6
1.6



net.alchim31.maven

Re: FYI: Blog Post on Flink's Streaming Performance and Fault Tolerance

2015-08-05 Thread Hawin Jiang
Great job, Guys

Let me read it carefully.







On Wed, Aug 5, 2015 at 7:25 AM, Stephan Ewen  wrote:

> I forgot the link ;-)
>
>
> http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
>
> On Wed, Aug 5, 2015 at 4:11 PM, Stephan Ewen  wrote:
>
>> Hi all!
>>
>> We just published a blog post about how streaming fault tolerance
>> mechanisms evolved, and what kind of performance Flink gets with its
>> checkpointing mechanism.
>>
>> I think it is a pretty interesting read for people that are interested in
>> Flink or data streaming in general.
>>
>> The blog post talks about:
>>
>>   - Fault tolerance techniques, starting from acknowledgements, over
>> micro batches, to transactional updates and distributed snapshots.
>>
>>   - Performance of Flink, throughput, latency, and tradeoffs.
>>
>>   - A "chaos monkey" experiment where computation continues strongly
>> consistent even when periodically killing workers.
>>
>>
>> Comments welcome!
>>
>> Greetings,
>> Stephan
>>
>>
>>
>


Re: Udf Performance and Object Creation

2015-08-13 Thread Hawin Jiang
Thanks Timo

That is a good interview question



Best regards
Hawin

On Thu, Aug 13, 2015 at 1:11 AM, Michael Huelfenhaus <
m.huelfenh...@davengo.com> wrote:

> Hey Timo,
>
> yes that is what I needed to know.
>
> Thanks
> - Michael
>
> Am 12.08.2015 um 12:44 schrieb Timo Walther :
>
> > Hello Michael,
> >
> > every time you code a Java program you should avoid object creation if
> you want an efficient program, because every created object needs to be
> garbage collected later (which slows down your program performance).
> > You can have small Pojos, just try to avoid the call "new" in your
> functions:
> >
> > Instead of:
> >
> > class Mapper implements MapFunction {
> > public Pojo map(String s) {
> >Pojo p = new Pojo();
> >p.f = s;
> > }
> > }
> >
> > do:
> >
> > class Mapper implements MapFunction {
> > private Pojo p = new Pojo();
> > public Pojo map(String s) {
> >p.f = s;
> > }
> > }
> >
> > Then an object is only created once per Mapper and not per record.
> >
> > Hope this helps.
> >
> > Regards,
> > Timo
> >
> >
> >
> > On 12.08.2015 11:53, Michael Huelfenhaus wrote:
> >> Hello
> >>
> >> I have a question about the programming of user defined functions, is
> it still like in old Stratosphere times the case that object creation
> should be avoided al all cost? Because in some of the examples there are
> now Tuples and other objects created before returning them.
> >>
> >> I gonna have an at least 6 step streaming plan and I am going to use
> Pojos. Is it performance wise a big improvement to define one big pojo that
> can be used by all the steps or better to have smaller ones to send less
> data but create more objects.
> >>
> >> Thanks
> >> Michael
> >
>
>