Re: [CODE4LIB] A suggested role for text mining in library catalogs?

2011-02-22 Thread Cindy Harper
It's not ironic - my post was musing inspired by your work.  I guess I
wasn't sure if I understood your results. You were looking at the overall
POS usage in the entire texts as a possible way of ranking the texts. I was
wondering about POS of particular search terms - those that could take on
several POS. A related question - does SOLR use stemming to widen the search
to various POS?  Then would it be meaningful to rank the given texts by the
POS of the actual search terms?  And has anyone looked at samples of user
search terms - are they almost always noun phrases?  Just wanting to
understand what you have explored.  And I probably should have added to your
thread on NGC4LIB, rather than Code4lib - I tend to conflate them.

Cindy Harper, Systems Librarian
Colgate University Libraries
char...@colgate.edu
315-228-7363



On Sat, Feb 19, 2011 at 5:42 PM, Eric Lease Morgan emor...@nd.edu wrote:

 On Feb 19, 2011, at 11:26 AM, Cindy Harper wrote:

  I just was testing our discovery engine for any technical issues after a
  reboot. I was just using random single words, and one word I used was
  correct.  Looking at the first ranked items, I wondered if there's some
  role for parts-of-speech in ranking hits - are nouns and , in this case,
  adjectives more indicative of aboutness than verbs?  The first items were
  Miss Manners ...  excruciating correctly behavior, then a bunch of
 govdocs
  on an act to correct.  I don't think there's any reason to prefer
  nouns over verbs, but I thought I'd throw the thought at you anyway.



 Ironically, I was playing with parts-of-speech (POS) analysis the other
 day. [1]

 Using a pseudo-random sample of texts, I found there to be surprisingly
 similar POS usage between texts. With such similarity, I thought it would be
 difficult to use general POS as a means for ranking or sorting. On the other
 hand, specific POS may be useful. For example, Thoreau was dominated by
 first-person male pronouns but Austen was dominated by second person female
 pronouns.

 I think there is something to be explored here.

 [1] POS - http://bit.ly/hsxD2i

 --
 Eric Still Counting Tweets and Chats Morgan



Re: [CODE4LIB] Trial run of Virtual Lightning Talks

2011-02-22 Thread Peter Murray
A couple of clarifications.  This is just a trial run to see if the software 
works; a prepared talk isn't necessary or expected.  The time is also 2pm EST.

Room for a few more volunteers...


Peter

On Feb 21, 2011, at 12:10 PM, Peter Murray wrote:
 
 All,
 
 I'm looking for some volunteers to make a trial run at virtual lightning 
 talks.  This is an idea that came to me during Code4Lib earlier this month -- 
 use a webinar tool to replicate the environment of the conference lightning 
 talks.  The outline of the concept is at:
 
  http://wiki.code4lib.org/index.php/Virtual_Lightning_Talks
 
 LYRASIS has a subscription to a 100-seat instance of Centra Saba that we can 
 try.  It is Java-based with claimed support for sharing desktops under Mac, 
 Linux and Windows.  I'd like to test that support to see if it can be used.  
 So I'm looking for a half dozen volunteers to sign into a test room on 
 Wednesday at 2pm.
 
 Please let me know if you can help.  Read the presenter guidelines at the URL 
 above to make sure you have the minimum requirements and for links to install 
 the webinar client software.  The URL to the trial run space is 
 http://tinyurl.com/5vzd8st and it will be active on Wednesday at 2pm.
 
 Thanks,
 
 
 Peter


-- 
Peter Murray peter.mur...@lyrasis.orgtel:+1-678-235-2955
 
Ass't Director, Technology Services Development   http://dltj.org/about/
Lyrasis   --Great Libraries. Strong Communities. Innovative Answers.
The Disruptive Library Technology Jesterhttp://dltj.org/ 
Attrib-Noncomm-Share   http://creativecommons.org/licenses/by-nc-sa/2.5/ 


Re: [CODE4LIB] A suggested role for text mining in library catalogs?

2011-02-22 Thread Rob Casson
And I probably should have added to your thread on NGC4LIB, rather than 
Code4lib - I tend to conflate them.

i'm offended ;)


[CODE4LIB] Job Posting - Scholars' Lab, University of Virginia

2011-02-22 Thread Graham, Wayne (wsg4w)
http://www.scholarslab.org/announcements/web-applications-specialist/

The Scholars’ Lab at the University of Virginia seeks an enthusiastic web 
applications specialist with a background in programming and the humanities or 
cultural heritage.  As a Web Applications Specialist reporting to the Head of 
RD for the Scholars’ Lab, you will be responsible for building, testing, and 
debugging code. You should possess an extreme attention to detail and a high 
level of accountability and responsibility. We’re looking for someone who 
enjoys technical challenges, likes to figure out how things work, and stays 
involved in the latest Web and digital humanities technologies. You will need 
to be able to fit in to a creative and collaborative environment.

Web Applications Specialist Responsibilities

 *   Build, test, and debug code
 *   Write test cases
 *   Estimate coding projects
 *   Provide consultation on collaborative projects
 *   Develop documentation
 *   Assist in the debugging and system troubleshooting for existing software 
written in a variety of languages and platform

Qualifications

 *   1+ years full-time experience with web development (Rails and PHP 
preferred)
 *   2+ years experience of standards compliant HTML, CSS, and Javascript
 *   Javascript skills (AJAX, JQuery or similar JS framework)
 *   Experience with Test Driven Development (Shoulda, RSpec, PHPUnit)
 *   Experience with relational database management systems (MySQL, Postgresql)
 *   Familiarity with version control systems
 *   Understanding of software life cycle
 *   Strong foundation in OO programming and practices
 *   Experience with Omeka a plus

Salary is commensurate with experience, and expected to range between 
approximately $43,500 and $75,500 per annum. We’re looking to fill this 
position quickly, so please don’t delay!

Consideration of applications will begin immediately and continue until the 
position is filled.

Job posting: http://jobs.virginia.edu/applicants/Central?quickFind=63332


Re: [CODE4LIB] A suggested role for text mining in library catalogs?

2011-02-22 Thread Erik Hatcher
Solr _can_ use stemming, but to do it with POS would be flakey I'd think.  Is 
work a verb or noun?

Some of the (Solr-using) customers that I work with have done POS tagging 
(using tools like BasisTech Solr plugins for entity tagging).  Payloads can be 
assigned to terms during indexing and then used to weight the score when query 
terms match.  Lucene supports payloads and scoring based on them natively, but 
it requires some code to wire together.  Solr supports a little in terms of 
payloads, but to really use them effectively custom coding is needed.  See 
https://issues.apache.org/jira/browse/SOLR-1485 for example.

Erik

On Feb 22, 2011, at 09:02 , Cindy Harper wrote:

 It's not ironic - my post was musing inspired by your work.  I guess I
 wasn't sure if I understood your results. You were looking at the overall
 POS usage in the entire texts as a possible way of ranking the texts. I was
 wondering about POS of particular search terms - those that could take on
 several POS. A related question - does SOLR use stemming to widen the search
 to various POS?  Then would it be meaningful to rank the given texts by the
 POS of the actual search terms?  And has anyone looked at samples of user
 search terms - are they almost always noun phrases?  Just wanting to
 understand what you have explored.  And I probably should have added to your
 thread on NGC4LIB, rather than Code4lib - I tend to conflate them.
 
 Cindy Harper, Systems Librarian
 Colgate University Libraries
 char...@colgate.edu
 315-228-7363
 
 
 
 On Sat, Feb 19, 2011 at 5:42 PM, Eric Lease Morgan emor...@nd.edu wrote:
 
 On Feb 19, 2011, at 11:26 AM, Cindy Harper wrote:
 
 I just was testing our discovery engine for any technical issues after a
 reboot. I was just using random single words, and one word I used was
 correct.  Looking at the first ranked items, I wondered if there's some
 role for parts-of-speech in ranking hits - are nouns and , in this case,
 adjectives more indicative of aboutness than verbs?  The first items were
 Miss Manners ...  excruciating correctly behavior, then a bunch of
 govdocs
 on an act to correct.  I don't think there's any reason to prefer
 nouns over verbs, but I thought I'd throw the thought at you anyway.
 
 
 
 Ironically, I was playing with parts-of-speech (POS) analysis the other
 day. [1]
 
 Using a pseudo-random sample of texts, I found there to be surprisingly
 similar POS usage between texts. With such similarity, I thought it would be
 difficult to use general POS as a means for ranking or sorting. On the other
 hand, specific POS may be useful. For example, Thoreau was dominated by
 first-person male pronouns but Austen was dominated by second person female
 pronouns.
 
 I think there is something to be explored here.
 
 [1] POS - http://bit.ly/hsxD2i
 
 --
 Eric Still Counting Tweets and Chats Morgan
 


[CODE4LIB] Job Posting: Systems Engineer, Sheridan Libraries, Johns Hopkins University

2011-02-22 Thread Sean Hannan
We’re looking for a sysadmin at Hopkins.  Come work with me.  It’ll be cool, I 
promise.

-Sean
---

https://hrnt.jhu.edu/jhujobs/job_view.cfm?view_req_id=46964

The Systems Engineer will provide systems administration and, to a lesser 
extent, programming support for the Systems department’s multi-platform - 
primarily Linux, but also some Windows and Solaris – environment. This position 
will support services provided by the Systems department, including, but not 
limited to, library catalog, search interface, federated search tools, library 
web sites, blogs, file and print shares, desktop applications and mobile 
interfaces. The Systems department shares server infrastructure with Digital 
Research and Curation Center (DRCC), and collaborates closely with DRCC systems 
administrator.

Primary Duties and Responsibilities:
* Installing, upgrading and patching operating systems; installing, upgrading 
and maintaining server hardware and peripheral devices (disk arrays, tape 
libraries).
* Working with other systems administrators and programmers to proactively and 
appropriately monitor hardware, operating systems, and applications in support 
of services provided by Systems department.
* Providing support to programmers in selecting, packaging, deploying and 
configuring applications across a diverse server environment.
* Managing system backup and recovery across all supported servers.
* Supporting a virtual machine infrastructure as well as stand-alone servers.
* Troubleshooting problems across several areas, including application, 
network, OS, hardware.
* Installing, configuring, maintaining and providing security for all 
Linux/Unix systems and peripheral devices.
* Installing and maintaining small to mid-range UPS equipment.
* Configuring and managing infrastructure services, which include DNS, DHCP, 
SMTP, SSH, FTP and SMB services and software; web servers; servlet containers; 
database software (MySQL, Postgres, MSSQL).
* Serving as the point of contact for software and hardware vendors and 
vendors' technical support staff.
* Participating in the analysis and planning of systems and services, including 
recommending server configurations and purchasing.
* Serving as the liaison to the University IT community on issues related to 
Unix/Linux and systems administration.
* Participating in the Systems Office 24x7 on-call plan – includes being 
available by cell phone and participating in the on-call pager rotation.
* Sharing responsibility for physical and server environment in data center
* Programming support for optimizing system performance.
* Identifying areas for improvement in server and/or application management, 
and proposing/implementing solutions to improve processes.

Qualifications:
* Bachelor’s degree and five years related experience required. Additional 
education may substitute for required experience and additional related 
experience may substitute for required education, to the extent permitted by 
the JHU equivalency formula.
* The candidate will support a variety of applications and services running on 
Linux, Unix (Solaris), and Windows. Individual must work closely with other 
staff in the Library Systems department, DRCC, central IT department, and with 
external vendors and developers. Excellent oral and written communication and 
interpersonal skills are essential. Position may require lifting of materials 
less than 50 pounds occasionally.

Preferred Qualifications:
* Working experience with a virtual machine framework, such as XenServer; 
experience with Windows AD; experience with deploying software packages; 
experience with Tomcat, MySQL and PostgreSQL; programming experience in Unix 
shells, Ruby, Java, and Perl; and knowledge or experience with libraries are 
desirable.

The Sheridan Libraries encompass the Milton S. Eisenhower Library and its 
collections at the John Work Garrett Library, the George Peabody Library, the 
Albert D. Hutzler Reading Room, and the DC Centers. Its primary constituency is 
the students and faculty in the schools of Arts  Sciences, Engineering, Carey 
Business School and the School of Education. A key partner in the academic 
enterprise, the library is a leader in the innovative application of 
information technology and has implemented notable diversity and organizational 
development programs. The Sheridan Libraries are strongly committed to 
diversity. A strategic goal of the Libraries is to 'work toward achieving 
diversity when recruiting new and promoting existing staff.' The Libraries 
prize initiative, creativity, professionalism, and teamwork. For information on 
the Sheridan Libraries, visit www.library.jhu.edu .


[CODE4LIB] Job opening in Atlanta - U.S. Court of Appeals, 11th Circuit

2011-02-22 Thread Carol Bean
This is primarily a technology training position, within the Circuit
Library, but will also involve technology development. Yeah, you'd have to
work with me, but don't hold that against the job! ;-)

http://www.ca11.uscourts.gov/hr/listings/Information_Services_Specialist_2-2011.pdf

-- 
Carol Bean
beanwo...@gmail.com


Re: [CODE4LIB] A suggested role for text mining in library catalogs?

2011-02-22 Thread Jodi Schneider
On Tue, Feb 22, 2011 at 3:02 PM, Erik Hatcher erikhatc...@mac.com wrote:
 Solr _can_ use stemming, but to do it with POS would be flakey I'd think.  Is 
 work a verb or  noun?

First you detect POS on tokens, *then* you stem. The other way around
wouldn't work.

-Jodi

PS-I loved your When Solr is your hammer... post on randomly
choosing names, Erik!


Re: [CODE4LIB] A suggested role for text mining in library catalogs?

2011-02-22 Thread Eric Lease Morgan
On Feb 22, 2011, at 9:02 AM, Cindy Harper wrote:

 It's not ironic - my post was musing inspired by your work.  I guess I wasn't 
 sure if I understood your results. You were looking at the overall POS usage 
 in the entire texts as a possible way of ranking the texts. I was wondering 
 about POS of particular search terms - those that could take on several 
 POS


Initially I wanted to see if I could classify works based on their POS usage. 
[1] I was hoping to find lots of action verbs in one work and call it an action 
story. I was hoping to find lots of nouns in another story and call it... I 
don't know, something else. Instead, after rudimentary investigation, I 
discovered that all of of the works I analyzed had the same relative percentage 
of nouns, pronouns, verbs, adverbs, adjectives, etc. Maybe such a thing is 
indicative of the English language.

On the other hand, I did notice a difference in the use of particular pronouns 
between works. In Walden by Thoreau, a story about an individual living on the 
banks of a pond, there was a lot of use of the word I, but in a different 
story, where the author and his brother canoe down a river, the word we 
predominated. Similarly, three Jane Austen stories have many words like she 
and her where those words are less frequent in the works by Thoreau. While my 
analysis was trivial and thin, I think we might be able to classify some works 
by gender or speaking voice. 

Similar things may be possible with other parts-of-speech, like adjectives, 
specifically colors. For example 214 of the 117,540 words in Walden (0.18%) are 
colors  [1] But only 13  of 121,917 words in Pride and Prejudice (0.01%) are 
color words. Despite the similar lengths of the works, Walden is 18 times more 
colorful than Pride. Interesting? This only begs other questions. Is 0.18% a 
high value or a low value? Is the relative use of colors similar within a 
particular author or not? Has the use of color changed over time or indicative 
of genres? Does the use of specific colors actually denote mood?

In the past libraries did not have a whole lot of full text in order to 
evaluate content. That is not true now-a-days. It is now possible to literally 
count and measure a book's characteristics. Since this metadata is numeric in 
nature, it lends itself to visualization. (Think Karen C's presentation at 
Code4Lib.) And this whole thing is good fodder for search, discovery, and 
evaluation. Too much of our metadata is qualitative.


[1] foray's into POS - http://bit.ly/aM2eZx
[2] color words in Walden - http://t.co/hlg5ibL
[3] color words in Pride - http://t.co/VflNf3n

-- 
Eric Lease Morgan