[jira] [Commented] (BEAM-1439) Beam Example(s) exploring public document datasets

2017-05-15 Thread Kenneth Knowles (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011085#comment-16011085
 ] 

Kenneth Knowles commented on BEAM-1439:
---

This was not selected as a GSOC project, but would still make a superb 
contribution to Beam's examples.

> Beam Example(s) exploring public document datasets
> --
>
> Key: BEAM-1439
> URL: https://issues.apache.org/jira/browse/BEAM-1439
> Project: Beam
>  Issue Type: Wish
>  Components: examples-java
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>Priority: Minor
>  Labels: gsoc2017, java, mentor, python
>
> In Beam, we have examples illustrating counting the occurrences of words and 
> performing a basic TF-IDF analysis on the works of Shakespeare (or whatever 
> you point it at). It would be even cooler to do these analyses, and more, on 
> a much larger data set that is really the subject of current investigations.
> In chatting with professors at the University of Washington, I've learned 
> that scholars of many fields would really like to explore new and highly 
> customized ways of processing the growing body of publicly-available 
> scholarly documents, such as PubMed Central. Queries like "show me documents 
> where chemical compounds X and Y were both used in the 'method' section"
> So I propose a Google Summer of Code project wherein a student writes some 
> large-scale Beam pipelines to perform analyses such as term frequency, bigram 
> frequency, etc.
> Skills required:
>  - Java or Python
>  - (nice to have) Working through the Beam getting started materials



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1439) Beam Example(s) exploring public document datasets

2017-03-27 Thread Kenneth Knowles (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944466#comment-15944466
 ] 

Kenneth Knowles commented on BEAM-1439:
---

And also, please engage with the Beam community early - before applications are 
reviewed!

Here are some ideas for getting engaged:

# Work through Beam's "getting started" materials such as 
https://beam.apache.org/get-started/quickstart-java/
#* Especially get as familiar as you can with the runner that you are 
interested in
# Subscribe to d...@beam.apache.org and/or u...@beam.apache.org
# You are welcome to share your applications for early commentary on 
d...@beam.apache.org to get early feedback and mentorship (this is quite normal 
for GSoC+Apache; even if you don't get selected by GSoC you will learn and make 
new acquaintances)
# Pick up starter bugs to get familiar with the codebase beyond our getting 
started material

> Beam Example(s) exploring public document datasets
> --
>
> Key: BEAM-1439
> URL: https://issues.apache.org/jira/browse/BEAM-1439
> Project: Beam
>  Issue Type: Wish
>  Components: examples-java
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>Priority: Minor
>  Labels: gsoc2017, java, mentor, python
>
> In Beam, we have examples illustrating counting the occurrences of words and 
> performing a basic TF-IDF analysis on the works of Shakespeare (or whatever 
> you point it at). It would be even cooler to do these analyses, and more, on 
> a much larger data set that is really the subject of current investigations.
> In chatting with professors at the University of Washington, I've learned 
> that scholars of many fields would really like to explore new and highly 
> customized ways of processing the growing body of publicly-available 
> scholarly documents, such as PubMed Central. Queries like "show me documents 
> where chemical compounds X and Y were both used in the 'method' section"
> So I propose a Google Summer of Code project wherein a student writes some 
> large-scale Beam pipelines to perform analyses such as term frequency, bigram 
> frequency, etc.
> Skills required:
>  - Java or Python
>  - (nice to have) Working through the Beam getting started materials



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1439) Beam Example(s) exploring public document datasets

2017-03-20 Thread Kenneth Knowles (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933056#comment-15933056
 ] 

Kenneth Knowles commented on BEAM-1439:
---

Hi everyone,

The application period for students is now open. Please submit your very best!

> Beam Example(s) exploring public document datasets
> --
>
> Key: BEAM-1439
> URL: https://issues.apache.org/jira/browse/BEAM-1439
> Project: Beam
>  Issue Type: Wish
>  Components: examples-java
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>Priority: Minor
>  Labels: gsoc2017, java, mentor, python
>
> In Beam, we have examples illustrating counting the occurrences of words and 
> performing a basic TF-IDF analysis on the works of Shakespeare (or whatever 
> you point it at). It would be even cooler to do these analyses, and more, on 
> a much larger data set that is really the subject of current investigations.
> In chatting with professors at the University of Washington, I've learned 
> that scholars of many fields would really like to explore new and highly 
> customized ways of processing the growing body of publicly-available 
> scholarly documents, such as PubMed Central. Queries like "show me documents 
> where chemical compounds X and Y were both used in the 'method' section"
> So I propose a Google Summer of Code project wherein a student writes some 
> large-scale Beam pipelines to perform analyses such as term frequency, bigram 
> frequency, etc.
> Skills required:
>  - Java or Python
>  - (nice to have) Working through the Beam getting started materials



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1439) Beam Example(s) exploring public document datasets

2017-03-19 Thread SungJunyoung (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15932063#comment-15932063
 ] 

SungJunyoung commented on BEAM-1439:


The current Beam example counts the number of occurrences of a word for 
Shakespeare's work. This, of course, is a good indication of how Beam's basic 
pipeline construction works. However, this data is static, and does not show 
the characteristics of Beam that handles streaming data. What about example 
sources with streaming data like Kafka or Spark? For example, you could save 
your computer's input log to Kafka, convert it to a Beam, and then perform 
statistics on your input habits. What do you think about this?

Of course, ideas for large-scale pipelines will continue in processing in 
parallel like **Beam** :).

> Beam Example(s) exploring public document datasets
> --
>
> Key: BEAM-1439
> URL: https://issues.apache.org/jira/browse/BEAM-1439
> Project: Beam
>  Issue Type: Wish
>  Components: examples-java
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>Priority: Minor
>  Labels: gsoc2017, java, mentor, python
>
> In Beam, we have examples illustrating counting the occurrences of words and 
> performing a basic TF-IDF analysis on the works of Shakespeare (or whatever 
> you point it at). It would be even cooler to do these analyses, and more, on 
> a much larger data set that is really the subject of current investigations.
> In chatting with professors at the University of Washington, I've learned 
> that scholars of many fields would really like to explore new and highly 
> customized ways of processing the growing body of publicly-available 
> scholarly documents, such as PubMed Central. Queries like "show me documents 
> where chemical compounds X and Y were both used in the 'method' section"
> So I propose a Google Summer of Code project wherein a student writes some 
> large-scale Beam pipelines to perform analyses such as term frequency, bigram 
> frequency, etc.
> Skills required:
>  - Java or Python
>  - (nice to have) Working through the Beam getting started materials



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1439) Beam Example(s) exploring public document datasets

2017-03-16 Thread khalid bin huda (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928932#comment-15928932
 ] 

khalid bin huda commented on BEAM-1439:
---

Hi, I'm Khalid Bin Huda, I am a Final year undergraduate from the Department of 
Computer Science (University of Karachi). I have programming experience with C 
,Java ,R and love to work on Project related to Data-mining  or Machine 
Learning.  I would like do this project for GSoC 2017. I would like to 
contribute in this Project.

> Beam Example(s) exploring public document datasets
> --
>
> Key: BEAM-1439
> URL: https://issues.apache.org/jira/browse/BEAM-1439
> Project: Beam
>  Issue Type: Wish
>  Components: examples-java
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>Priority: Minor
>  Labels: gsoc2017, java, mentor, python
>
> In Beam, we have examples illustrating counting the occurrences of words and 
> performing a basic TF-IDF analysis on the works of Shakespeare (or whatever 
> you point it at). It would be even cooler to do these analyses, and more, on 
> a much larger data set that is really the subject of current investigations.
> In chatting with professors at the University of Washington, I've learned 
> that scholars of many fields would really like to explore new and highly 
> customized ways of processing the growing body of publicly-available 
> scholarly documents, such as PubMed Central. Queries like "show me documents 
> where chemical compounds X and Y were both used in the 'method' section"
> So I propose a Google Summer of Code project wherein a student writes some 
> large-scale Beam pipelines to perform analyses such as term frequency, bigram 
> frequency, etc.
> Skills required:
>  - Java or Python
>  - (nice to have) Working through the Beam getting started materials



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1439) Beam Example(s) exploring public document datasets

2017-03-07 Thread Milinda Kasun (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900652#comment-15900652
 ] 

Milinda Kasun commented on BEAM-1439:
-

Hi, I'm Milinda Kasun, I am a Final year undergraduate from the Department of 
Computer Science and Engineering (University of Moratuwa, Sri Lanka). I have 
experience on development with Java and Python. I would like do this project 
for GSoC 2017. It would be greatly appreciated if you could help me get started.

Thank You


> Beam Example(s) exploring public document datasets
> --
>
> Key: BEAM-1439
> URL: https://issues.apache.org/jira/browse/BEAM-1439
> Project: Beam
>  Issue Type: Wish
>  Components: examples-java
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>Priority: Minor
>  Labels: gsoc2017, java, mentor, python
>
> In Beam, we have examples illustrating counting the occurrences of words and 
> performing a basic TF-IDF analysis on the works of Shakespeare (or whatever 
> you point it at). It would be even cooler to do these analyses, and more, on 
> a much larger data set that is really the subject of current investigations.
> In chatting with professors at the University of Washington, I've learned 
> that scholars of many fields would really like to explore new and highly 
> customized ways of processing the growing body of publicly-available 
> scholarly documents, such as PubMed Central. Queries like "show me documents 
> where chemical compounds X and Y were both used in the 'method' section"
> So I propose a Google Summer of Code project wherein a student writes some 
> large-scale Beam pipelines to perform analyses such as term frequency, bigram 
> frequency, etc.
> Skills required:
>  - Java or Python
>  - (nice to have) Working through the Beam getting started materials



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1439) Beam Example(s) exploring public document datasets

2017-03-02 Thread SungJunyoung (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15891773#comment-15891773
 ] 

SungJunyoung commented on BEAM-1439:


Hello, I am a third year student in computer engineering at Kyunghee University 
in Korea. I came to know this project through the GSoC list. I am very 
interested in the apache beam project. And I wrote a simple pipeline of 
documentation. Contributing to the project by creating examples and datasets 
that use advanced pipelines seems very interesting. If you have a document or a 
mail address that can be contacted, it would be a great help to me. Thank you!

> Beam Example(s) exploring public document datasets
> --
>
> Key: BEAM-1439
> URL: https://issues.apache.org/jira/browse/BEAM-1439
> Project: Beam
>  Issue Type: Wish
>  Components: examples-java
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>Priority: Minor
>  Labels: gsoc2017, java, mentor, python
>
> In Beam, we have examples illustrating counting the occurrences of words and 
> performing a basic TF-IDF analysis on the works of Shakespeare (or whatever 
> you point it at). It would be even cooler to do these analyses, and more, on 
> a much larger data set that is really the subject of current investigations.
> In chatting with professors at the University of Washington, I've learned 
> that scholars of many fields would really like to explore new and highly 
> customized ways of processing the growing body of publicly-available 
> scholarly documents, such as PubMed Central. Queries like "show me documents 
> where chemical compounds X and Y were both used in the 'method' section"
> So I propose a Google Summer of Code project wherein a student writes some 
> large-scale Beam pipelines to perform analyses such as term frequency, bigram 
> frequency, etc.
> Skills required:
>  - Java or Python
>  - (nice to have) Working through the Beam getting started materials



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1439) Beam Example(s) exploring public document datasets

2017-03-01 Thread MOHAMMAD AFFAN ZAFAR (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890149#comment-15890149
 ] 

MOHAMMAD AFFAN ZAFAR commented on BEAM-1439:


Hi, I am Final Year Student Computer Science at IIT Kharagpur. I would have 
worked in NLP and Machine Translation. I would like to contribute in this 
Project.

> Beam Example(s) exploring public document datasets
> --
>
> Key: BEAM-1439
> URL: https://issues.apache.org/jira/browse/BEAM-1439
> Project: Beam
>  Issue Type: Wish
>  Components: examples-java
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>Priority: Minor
>  Labels: gsoc2017, java, mentor, python
>
> In Beam, we have examples illustrating counting the occurrences of words and 
> performing a basic TF-IDF analysis on the works of Shakespeare (or whatever 
> you point it at). It would be even cooler to do these analyses, and more, on 
> a much larger data set that is really the subject of current investigations.
> In chatting with professors at the University of Washington, I've learned 
> that scholars of many fields would really like to explore new and highly 
> customized ways of processing the growing body of publicly-available 
> scholarly documents, such as PubMed Central. Queries like "show me documents 
> where chemical compounds X and Y were both used in the 'method' section"
> So I propose a Google Summer of Code project wherein a student writes some 
> large-scale Beam pipelines to perform analyses such as term frequency, bigram 
> frequency, etc.
> Skills required:
>  - Java or Python
>  - (nice to have) Working through the Beam getting started materials



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)