[jira] [Commented] (BEAM-1439) Beam Example(s) exploring public document datasets
[ https://issues.apache.org/jira/browse/BEAM-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011085#comment-16011085 ] Kenneth Knowles commented on BEAM-1439: --- This was not selected as a GSOC project, but would still make a superb contribution to Beam's examples. > Beam Example(s) exploring public document datasets > -- > > Key: BEAM-1439 > URL: https://issues.apache.org/jira/browse/BEAM-1439 > Project: Beam > Issue Type: Wish > Components: examples-java >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles >Priority: Minor > Labels: gsoc2017, java, mentor, python > > In Beam, we have examples illustrating counting the occurrences of words and > performing a basic TF-IDF analysis on the works of Shakespeare (or whatever > you point it at). It would be even cooler to do these analyses, and more, on > a much larger data set that is really the subject of current investigations. > In chatting with professors at the University of Washington, I've learned > that scholars of many fields would really like to explore new and highly > customized ways of processing the growing body of publicly-available > scholarly documents, such as PubMed Central. Queries like "show me documents > where chemical compounds X and Y were both used in the 'method' section" > So I propose a Google Summer of Code project wherein a student writes some > large-scale Beam pipelines to perform analyses such as term frequency, bigram > frequency, etc. > Skills required: > - Java or Python > - (nice to have) Working through the Beam getting started materials -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1439) Beam Example(s) exploring public document datasets
[ https://issues.apache.org/jira/browse/BEAM-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944466#comment-15944466 ] Kenneth Knowles commented on BEAM-1439: --- And also, please engage with the Beam community early - before applications are reviewed! Here are some ideas for getting engaged: # Work through Beam's "getting started" materials such as https://beam.apache.org/get-started/quickstart-java/ #* Especially get as familiar as you can with the runner that you are interested in # Subscribe to d...@beam.apache.org and/or u...@beam.apache.org # You are welcome to share your applications for early commentary on d...@beam.apache.org to get early feedback and mentorship (this is quite normal for GSoC+Apache; even if you don't get selected by GSoC you will learn and make new acquaintances) # Pick up starter bugs to get familiar with the codebase beyond our getting started material > Beam Example(s) exploring public document datasets > -- > > Key: BEAM-1439 > URL: https://issues.apache.org/jira/browse/BEAM-1439 > Project: Beam > Issue Type: Wish > Components: examples-java >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles >Priority: Minor > Labels: gsoc2017, java, mentor, python > > In Beam, we have examples illustrating counting the occurrences of words and > performing a basic TF-IDF analysis on the works of Shakespeare (or whatever > you point it at). It would be even cooler to do these analyses, and more, on > a much larger data set that is really the subject of current investigations. > In chatting with professors at the University of Washington, I've learned > that scholars of many fields would really like to explore new and highly > customized ways of processing the growing body of publicly-available > scholarly documents, such as PubMed Central. Queries like "show me documents > where chemical compounds X and Y were both used in the 'method' section" > So I propose a Google Summer of Code project wherein a student writes some > large-scale Beam pipelines to perform analyses such as term frequency, bigram > frequency, etc. > Skills required: > - Java or Python > - (nice to have) Working through the Beam getting started materials -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1439) Beam Example(s) exploring public document datasets
[ https://issues.apache.org/jira/browse/BEAM-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933056#comment-15933056 ] Kenneth Knowles commented on BEAM-1439: --- Hi everyone, The application period for students is now open. Please submit your very best! > Beam Example(s) exploring public document datasets > -- > > Key: BEAM-1439 > URL: https://issues.apache.org/jira/browse/BEAM-1439 > Project: Beam > Issue Type: Wish > Components: examples-java >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles >Priority: Minor > Labels: gsoc2017, java, mentor, python > > In Beam, we have examples illustrating counting the occurrences of words and > performing a basic TF-IDF analysis on the works of Shakespeare (or whatever > you point it at). It would be even cooler to do these analyses, and more, on > a much larger data set that is really the subject of current investigations. > In chatting with professors at the University of Washington, I've learned > that scholars of many fields would really like to explore new and highly > customized ways of processing the growing body of publicly-available > scholarly documents, such as PubMed Central. Queries like "show me documents > where chemical compounds X and Y were both used in the 'method' section" > So I propose a Google Summer of Code project wherein a student writes some > large-scale Beam pipelines to perform analyses such as term frequency, bigram > frequency, etc. > Skills required: > - Java or Python > - (nice to have) Working through the Beam getting started materials -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1439) Beam Example(s) exploring public document datasets
[ https://issues.apache.org/jira/browse/BEAM-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15932063#comment-15932063 ] SungJunyoung commented on BEAM-1439: The current Beam example counts the number of occurrences of a word for Shakespeare's work. This, of course, is a good indication of how Beam's basic pipeline construction works. However, this data is static, and does not show the characteristics of Beam that handles streaming data. What about example sources with streaming data like Kafka or Spark? For example, you could save your computer's input log to Kafka, convert it to a Beam, and then perform statistics on your input habits. What do you think about this? Of course, ideas for large-scale pipelines will continue in processing in parallel like **Beam** :). > Beam Example(s) exploring public document datasets > -- > > Key: BEAM-1439 > URL: https://issues.apache.org/jira/browse/BEAM-1439 > Project: Beam > Issue Type: Wish > Components: examples-java >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles >Priority: Minor > Labels: gsoc2017, java, mentor, python > > In Beam, we have examples illustrating counting the occurrences of words and > performing a basic TF-IDF analysis on the works of Shakespeare (or whatever > you point it at). It would be even cooler to do these analyses, and more, on > a much larger data set that is really the subject of current investigations. > In chatting with professors at the University of Washington, I've learned > that scholars of many fields would really like to explore new and highly > customized ways of processing the growing body of publicly-available > scholarly documents, such as PubMed Central. Queries like "show me documents > where chemical compounds X and Y were both used in the 'method' section" > So I propose a Google Summer of Code project wherein a student writes some > large-scale Beam pipelines to perform analyses such as term frequency, bigram > frequency, etc. > Skills required: > - Java or Python > - (nice to have) Working through the Beam getting started materials -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1439) Beam Example(s) exploring public document datasets
[ https://issues.apache.org/jira/browse/BEAM-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928932#comment-15928932 ] khalid bin huda commented on BEAM-1439: --- Hi, I'm Khalid Bin Huda, I am a Final year undergraduate from the Department of Computer Science (University of Karachi). I have programming experience with C ,Java ,R and love to work on Project related to Data-mining or Machine Learning. I would like do this project for GSoC 2017. I would like to contribute in this Project. > Beam Example(s) exploring public document datasets > -- > > Key: BEAM-1439 > URL: https://issues.apache.org/jira/browse/BEAM-1439 > Project: Beam > Issue Type: Wish > Components: examples-java >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles >Priority: Minor > Labels: gsoc2017, java, mentor, python > > In Beam, we have examples illustrating counting the occurrences of words and > performing a basic TF-IDF analysis on the works of Shakespeare (or whatever > you point it at). It would be even cooler to do these analyses, and more, on > a much larger data set that is really the subject of current investigations. > In chatting with professors at the University of Washington, I've learned > that scholars of many fields would really like to explore new and highly > customized ways of processing the growing body of publicly-available > scholarly documents, such as PubMed Central. Queries like "show me documents > where chemical compounds X and Y were both used in the 'method' section" > So I propose a Google Summer of Code project wherein a student writes some > large-scale Beam pipelines to perform analyses such as term frequency, bigram > frequency, etc. > Skills required: > - Java or Python > - (nice to have) Working through the Beam getting started materials -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1439) Beam Example(s) exploring public document datasets
[ https://issues.apache.org/jira/browse/BEAM-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900652#comment-15900652 ] Milinda Kasun commented on BEAM-1439: - Hi, I'm Milinda Kasun, I am a Final year undergraduate from the Department of Computer Science and Engineering (University of Moratuwa, Sri Lanka). I have experience on development with Java and Python. I would like do this project for GSoC 2017. It would be greatly appreciated if you could help me get started. Thank You > Beam Example(s) exploring public document datasets > -- > > Key: BEAM-1439 > URL: https://issues.apache.org/jira/browse/BEAM-1439 > Project: Beam > Issue Type: Wish > Components: examples-java >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles >Priority: Minor > Labels: gsoc2017, java, mentor, python > > In Beam, we have examples illustrating counting the occurrences of words and > performing a basic TF-IDF analysis on the works of Shakespeare (or whatever > you point it at). It would be even cooler to do these analyses, and more, on > a much larger data set that is really the subject of current investigations. > In chatting with professors at the University of Washington, I've learned > that scholars of many fields would really like to explore new and highly > customized ways of processing the growing body of publicly-available > scholarly documents, such as PubMed Central. Queries like "show me documents > where chemical compounds X and Y were both used in the 'method' section" > So I propose a Google Summer of Code project wherein a student writes some > large-scale Beam pipelines to perform analyses such as term frequency, bigram > frequency, etc. > Skills required: > - Java or Python > - (nice to have) Working through the Beam getting started materials -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1439) Beam Example(s) exploring public document datasets
[ https://issues.apache.org/jira/browse/BEAM-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15891773#comment-15891773 ] SungJunyoung commented on BEAM-1439: Hello, I am a third year student in computer engineering at Kyunghee University in Korea. I came to know this project through the GSoC list. I am very interested in the apache beam project. And I wrote a simple pipeline of documentation. Contributing to the project by creating examples and datasets that use advanced pipelines seems very interesting. If you have a document or a mail address that can be contacted, it would be a great help to me. Thank you! > Beam Example(s) exploring public document datasets > -- > > Key: BEAM-1439 > URL: https://issues.apache.org/jira/browse/BEAM-1439 > Project: Beam > Issue Type: Wish > Components: examples-java >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles >Priority: Minor > Labels: gsoc2017, java, mentor, python > > In Beam, we have examples illustrating counting the occurrences of words and > performing a basic TF-IDF analysis on the works of Shakespeare (or whatever > you point it at). It would be even cooler to do these analyses, and more, on > a much larger data set that is really the subject of current investigations. > In chatting with professors at the University of Washington, I've learned > that scholars of many fields would really like to explore new and highly > customized ways of processing the growing body of publicly-available > scholarly documents, such as PubMed Central. Queries like "show me documents > where chemical compounds X and Y were both used in the 'method' section" > So I propose a Google Summer of Code project wherein a student writes some > large-scale Beam pipelines to perform analyses such as term frequency, bigram > frequency, etc. > Skills required: > - Java or Python > - (nice to have) Working through the Beam getting started materials -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1439) Beam Example(s) exploring public document datasets
[ https://issues.apache.org/jira/browse/BEAM-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890149#comment-15890149 ] MOHAMMAD AFFAN ZAFAR commented on BEAM-1439: Hi, I am Final Year Student Computer Science at IIT Kharagpur. I would have worked in NLP and Machine Translation. I would like to contribute in this Project. > Beam Example(s) exploring public document datasets > -- > > Key: BEAM-1439 > URL: https://issues.apache.org/jira/browse/BEAM-1439 > Project: Beam > Issue Type: Wish > Components: examples-java >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles >Priority: Minor > Labels: gsoc2017, java, mentor, python > > In Beam, we have examples illustrating counting the occurrences of words and > performing a basic TF-IDF analysis on the works of Shakespeare (or whatever > you point it at). It would be even cooler to do these analyses, and more, on > a much larger data set that is really the subject of current investigations. > In chatting with professors at the University of Washington, I've learned > that scholars of many fields would really like to explore new and highly > customized ways of processing the growing body of publicly-available > scholarly documents, such as PubMed Central. Queries like "show me documents > where chemical compounds X and Y were both used in the 'method' section" > So I propose a Google Summer of Code project wherein a student writes some > large-scale Beam pipelines to perform analyses such as term frequency, bigram > frequency, etc. > Skills required: > - Java or Python > - (nice to have) Working through the Beam getting started materials -- This message was sent by Atlassian JIRA (v6.3.15#6346)