Re: [CODE4LIB] A suggested role for text mining in library catalogs?
It's not ironic - my post was musing inspired by your work. I guess I wasn't sure if I understood your results. You were looking at the overall POS usage in the entire texts as a possible way of ranking the texts. I was wondering about POS of particular search terms - those that could take on several POS. A related question - does SOLR use stemming to widen the search to various POS? Then would it be meaningful to rank the given texts by the POS of the actual search terms? And has anyone looked at samples of user search terms - are they almost always noun phrases? Just wanting to understand what you have explored. And I probably should have added to your thread on NGC4LIB, rather than Code4lib - I tend to conflate them. Cindy Harper, Systems Librarian Colgate University Libraries char...@colgate.edu 315-228-7363 On Sat, Feb 19, 2011 at 5:42 PM, Eric Lease Morgan emor...@nd.edu wrote: On Feb 19, 2011, at 11:26 AM, Cindy Harper wrote: I just was testing our discovery engine for any technical issues after a reboot. I was just using random single words, and one word I used was correct. Looking at the first ranked items, I wondered if there's some role for parts-of-speech in ranking hits - are nouns and , in this case, adjectives more indicative of aboutness than verbs? The first items were Miss Manners ... excruciating correctly behavior, then a bunch of govdocs on an act to correct. I don't think there's any reason to prefer nouns over verbs, but I thought I'd throw the thought at you anyway. Ironically, I was playing with parts-of-speech (POS) analysis the other day. [1] Using a pseudo-random sample of texts, I found there to be surprisingly similar POS usage between texts. With such similarity, I thought it would be difficult to use general POS as a means for ranking or sorting. On the other hand, specific POS may be useful. For example, Thoreau was dominated by first-person male pronouns but Austen was dominated by second person female pronouns. I think there is something to be explored here. [1] POS - http://bit.ly/hsxD2i -- Eric Still Counting Tweets and Chats Morgan
Re: [CODE4LIB] Trial run of Virtual Lightning Talks
A couple of clarifications. This is just a trial run to see if the software works; a prepared talk isn't necessary or expected. The time is also 2pm EST. Room for a few more volunteers... Peter On Feb 21, 2011, at 12:10 PM, Peter Murray wrote: All, I'm looking for some volunteers to make a trial run at virtual lightning talks. This is an idea that came to me during Code4Lib earlier this month -- use a webinar tool to replicate the environment of the conference lightning talks. The outline of the concept is at: http://wiki.code4lib.org/index.php/Virtual_Lightning_Talks LYRASIS has a subscription to a 100-seat instance of Centra Saba that we can try. It is Java-based with claimed support for sharing desktops under Mac, Linux and Windows. I'd like to test that support to see if it can be used. So I'm looking for a half dozen volunteers to sign into a test room on Wednesday at 2pm. Please let me know if you can help. Read the presenter guidelines at the URL above to make sure you have the minimum requirements and for links to install the webinar client software. The URL to the trial run space is http://tinyurl.com/5vzd8st and it will be active on Wednesday at 2pm. Thanks, Peter -- Peter Murray peter.mur...@lyrasis.orgtel:+1-678-235-2955 Ass't Director, Technology Services Development http://dltj.org/about/ Lyrasis --Great Libraries. Strong Communities. Innovative Answers. The Disruptive Library Technology Jesterhttp://dltj.org/ Attrib-Noncomm-Share http://creativecommons.org/licenses/by-nc-sa/2.5/
Re: [CODE4LIB] A suggested role for text mining in library catalogs?
And I probably should have added to your thread on NGC4LIB, rather than Code4lib - I tend to conflate them. i'm offended ;)
[CODE4LIB] Job Posting - Scholars' Lab, University of Virginia
http://www.scholarslab.org/announcements/web-applications-specialist/ The Scholars’ Lab at the University of Virginia seeks an enthusiastic web applications specialist with a background in programming and the humanities or cultural heritage. As a Web Applications Specialist reporting to the Head of RD for the Scholars’ Lab, you will be responsible for building, testing, and debugging code. You should possess an extreme attention to detail and a high level of accountability and responsibility. We’re looking for someone who enjoys technical challenges, likes to figure out how things work, and stays involved in the latest Web and digital humanities technologies. You will need to be able to fit in to a creative and collaborative environment. Web Applications Specialist Responsibilities * Build, test, and debug code * Write test cases * Estimate coding projects * Provide consultation on collaborative projects * Develop documentation * Assist in the debugging and system troubleshooting for existing software written in a variety of languages and platform Qualifications * 1+ years full-time experience with web development (Rails and PHP preferred) * 2+ years experience of standards compliant HTML, CSS, and Javascript * Javascript skills (AJAX, JQuery or similar JS framework) * Experience with Test Driven Development (Shoulda, RSpec, PHPUnit) * Experience with relational database management systems (MySQL, Postgresql) * Familiarity with version control systems * Understanding of software life cycle * Strong foundation in OO programming and practices * Experience with Omeka a plus Salary is commensurate with experience, and expected to range between approximately $43,500 and $75,500 per annum. We’re looking to fill this position quickly, so please don’t delay! Consideration of applications will begin immediately and continue until the position is filled. Job posting: http://jobs.virginia.edu/applicants/Central?quickFind=63332
Re: [CODE4LIB] A suggested role for text mining in library catalogs?
Solr _can_ use stemming, but to do it with POS would be flakey I'd think. Is work a verb or noun? Some of the (Solr-using) customers that I work with have done POS tagging (using tools like BasisTech Solr plugins for entity tagging). Payloads can be assigned to terms during indexing and then used to weight the score when query terms match. Lucene supports payloads and scoring based on them natively, but it requires some code to wire together. Solr supports a little in terms of payloads, but to really use them effectively custom coding is needed. See https://issues.apache.org/jira/browse/SOLR-1485 for example. Erik On Feb 22, 2011, at 09:02 , Cindy Harper wrote: It's not ironic - my post was musing inspired by your work. I guess I wasn't sure if I understood your results. You were looking at the overall POS usage in the entire texts as a possible way of ranking the texts. I was wondering about POS of particular search terms - those that could take on several POS. A related question - does SOLR use stemming to widen the search to various POS? Then would it be meaningful to rank the given texts by the POS of the actual search terms? And has anyone looked at samples of user search terms - are they almost always noun phrases? Just wanting to understand what you have explored. And I probably should have added to your thread on NGC4LIB, rather than Code4lib - I tend to conflate them. Cindy Harper, Systems Librarian Colgate University Libraries char...@colgate.edu 315-228-7363 On Sat, Feb 19, 2011 at 5:42 PM, Eric Lease Morgan emor...@nd.edu wrote: On Feb 19, 2011, at 11:26 AM, Cindy Harper wrote: I just was testing our discovery engine for any technical issues after a reboot. I was just using random single words, and one word I used was correct. Looking at the first ranked items, I wondered if there's some role for parts-of-speech in ranking hits - are nouns and , in this case, adjectives more indicative of aboutness than verbs? The first items were Miss Manners ... excruciating correctly behavior, then a bunch of govdocs on an act to correct. I don't think there's any reason to prefer nouns over verbs, but I thought I'd throw the thought at you anyway. Ironically, I was playing with parts-of-speech (POS) analysis the other day. [1] Using a pseudo-random sample of texts, I found there to be surprisingly similar POS usage between texts. With such similarity, I thought it would be difficult to use general POS as a means for ranking or sorting. On the other hand, specific POS may be useful. For example, Thoreau was dominated by first-person male pronouns but Austen was dominated by second person female pronouns. I think there is something to be explored here. [1] POS - http://bit.ly/hsxD2i -- Eric Still Counting Tweets and Chats Morgan
[CODE4LIB] Job Posting: Systems Engineer, Sheridan Libraries, Johns Hopkins University
We’re looking for a sysadmin at Hopkins. Come work with me. It’ll be cool, I promise. -Sean --- https://hrnt.jhu.edu/jhujobs/job_view.cfm?view_req_id=46964 The Systems Engineer will provide systems administration and, to a lesser extent, programming support for the Systems department’s multi-platform - primarily Linux, but also some Windows and Solaris – environment. This position will support services provided by the Systems department, including, but not limited to, library catalog, search interface, federated search tools, library web sites, blogs, file and print shares, desktop applications and mobile interfaces. The Systems department shares server infrastructure with Digital Research and Curation Center (DRCC), and collaborates closely with DRCC systems administrator. Primary Duties and Responsibilities: * Installing, upgrading and patching operating systems; installing, upgrading and maintaining server hardware and peripheral devices (disk arrays, tape libraries). * Working with other systems administrators and programmers to proactively and appropriately monitor hardware, operating systems, and applications in support of services provided by Systems department. * Providing support to programmers in selecting, packaging, deploying and configuring applications across a diverse server environment. * Managing system backup and recovery across all supported servers. * Supporting a virtual machine infrastructure as well as stand-alone servers. * Troubleshooting problems across several areas, including application, network, OS, hardware. * Installing, configuring, maintaining and providing security for all Linux/Unix systems and peripheral devices. * Installing and maintaining small to mid-range UPS equipment. * Configuring and managing infrastructure services, which include DNS, DHCP, SMTP, SSH, FTP and SMB services and software; web servers; servlet containers; database software (MySQL, Postgres, MSSQL). * Serving as the point of contact for software and hardware vendors and vendors' technical support staff. * Participating in the analysis and planning of systems and services, including recommending server configurations and purchasing. * Serving as the liaison to the University IT community on issues related to Unix/Linux and systems administration. * Participating in the Systems Office 24x7 on-call plan – includes being available by cell phone and participating in the on-call pager rotation. * Sharing responsibility for physical and server environment in data center * Programming support for optimizing system performance. * Identifying areas for improvement in server and/or application management, and proposing/implementing solutions to improve processes. Qualifications: * Bachelor’s degree and five years related experience required. Additional education may substitute for required experience and additional related experience may substitute for required education, to the extent permitted by the JHU equivalency formula. * The candidate will support a variety of applications and services running on Linux, Unix (Solaris), and Windows. Individual must work closely with other staff in the Library Systems department, DRCC, central IT department, and with external vendors and developers. Excellent oral and written communication and interpersonal skills are essential. Position may require lifting of materials less than 50 pounds occasionally. Preferred Qualifications: * Working experience with a virtual machine framework, such as XenServer; experience with Windows AD; experience with deploying software packages; experience with Tomcat, MySQL and PostgreSQL; programming experience in Unix shells, Ruby, Java, and Perl; and knowledge or experience with libraries are desirable. The Sheridan Libraries encompass the Milton S. Eisenhower Library and its collections at the John Work Garrett Library, the George Peabody Library, the Albert D. Hutzler Reading Room, and the DC Centers. Its primary constituency is the students and faculty in the schools of Arts Sciences, Engineering, Carey Business School and the School of Education. A key partner in the academic enterprise, the library is a leader in the innovative application of information technology and has implemented notable diversity and organizational development programs. The Sheridan Libraries are strongly committed to diversity. A strategic goal of the Libraries is to 'work toward achieving diversity when recruiting new and promoting existing staff.' The Libraries prize initiative, creativity, professionalism, and teamwork. For information on the Sheridan Libraries, visit www.library.jhu.edu .
[CODE4LIB] Job opening in Atlanta - U.S. Court of Appeals, 11th Circuit
This is primarily a technology training position, within the Circuit Library, but will also involve technology development. Yeah, you'd have to work with me, but don't hold that against the job! ;-) http://www.ca11.uscourts.gov/hr/listings/Information_Services_Specialist_2-2011.pdf -- Carol Bean beanwo...@gmail.com
Re: [CODE4LIB] A suggested role for text mining in library catalogs?
On Tue, Feb 22, 2011 at 3:02 PM, Erik Hatcher erikhatc...@mac.com wrote: Solr _can_ use stemming, but to do it with POS would be flakey I'd think. Is work a verb or noun? First you detect POS on tokens, *then* you stem. The other way around wouldn't work. -Jodi PS-I loved your When Solr is your hammer... post on randomly choosing names, Erik!
Re: [CODE4LIB] A suggested role for text mining in library catalogs?
On Feb 22, 2011, at 9:02 AM, Cindy Harper wrote: It's not ironic - my post was musing inspired by your work. I guess I wasn't sure if I understood your results. You were looking at the overall POS usage in the entire texts as a possible way of ranking the texts. I was wondering about POS of particular search terms - those that could take on several POS Initially I wanted to see if I could classify works based on their POS usage. [1] I was hoping to find lots of action verbs in one work and call it an action story. I was hoping to find lots of nouns in another story and call it... I don't know, something else. Instead, after rudimentary investigation, I discovered that all of of the works I analyzed had the same relative percentage of nouns, pronouns, verbs, adverbs, adjectives, etc. Maybe such a thing is indicative of the English language. On the other hand, I did notice a difference in the use of particular pronouns between works. In Walden by Thoreau, a story about an individual living on the banks of a pond, there was a lot of use of the word I, but in a different story, where the author and his brother canoe down a river, the word we predominated. Similarly, three Jane Austen stories have many words like she and her where those words are less frequent in the works by Thoreau. While my analysis was trivial and thin, I think we might be able to classify some works by gender or speaking voice. Similar things may be possible with other parts-of-speech, like adjectives, specifically colors. For example 214 of the 117,540 words in Walden (0.18%) are colors [1] But only 13 of 121,917 words in Pride and Prejudice (0.01%) are color words. Despite the similar lengths of the works, Walden is 18 times more colorful than Pride. Interesting? This only begs other questions. Is 0.18% a high value or a low value? Is the relative use of colors similar within a particular author or not? Has the use of color changed over time or indicative of genres? Does the use of specific colors actually denote mood? In the past libraries did not have a whole lot of full text in order to evaluate content. That is not true now-a-days. It is now possible to literally count and measure a book's characteristics. Since this metadata is numeric in nature, it lends itself to visualization. (Think Karen C's presentation at Code4Lib.) And this whole thing is good fodder for search, discovery, and evaluation. Too much of our metadata is qualitative. [1] foray's into POS - http://bit.ly/aM2eZx [2] color words in Walden - http://t.co/hlg5ibL [3] color words in Pride - http://t.co/VflNf3n -- Eric Lease Morgan