Re: [CODE4LIB] Language codes
It is better to refer to BCP-47 instead. https://tools.ietf.org/html/bcp47 An RFC can be updated, when it is, it recieves a new number. For language tagging, the relevant information is split across two RFCs. BCP-47 is a permanent IEFT ifentifier referencing the latest versions of the two RFCs relating to language tagging. Andrew On 2 Jun 2016 9:24 am, "Stuart A. Yeates"wrote: > > I recommend reading https://tools.ietf.org/html/rfc5646 which seems to do > what you need. > > cheers > stuart > > -- > ...let us be heard from red core to black sky > > On Thu, Jun 2, 2016 at 10:59 AM, Greg Lindahl wrote: > > > Some of the Internet Archive's library partners are asking us about > > language metadata for regional languages that don't have standard > > codes. Is there a standard way of dealing with this situation? > > > > Overall we use MARC codes https://www.loc.gov/marc/languages/ which > > were last updated in 2007. LOC also maintains ISO639-2 > > https://www.loc.gov/standards/iso639-2/php/code_list.php last updated > > in 2014. > > > > The languages in question are regional languages which are currently > > lumped together in both standards. With the recent rise in interest > > and funding for regional languages, it's no surprise that some > > catalogers want to split these languages out into separate codes. > > > > Thanks! > > > > -- greg > >
Re: [CODE4LIB] Language codes
On 2 Jun 2016 9:40 am, "Andrew Cunningham" <lang.supp...@gmail.com> wrote: > > > Ultimately it is what a library is working on, if you are cataloguing then all you have is ISO-639-3/B > Opps, meant to input ISO-639-2/B Andrew
Re: [CODE4LIB] Language codes
Outside the library sector, the most common approach to language tagging and matching isn't ISO-639-2 or ISO-639-3, rather BCP-47. Quite a number of ISO-639-2 language tags represent what ISO-639-3 refers to as macro languages. For instance 'kar' in ISO-639-2 resolves to 20 language codes in ISO-639-3 But ISO-639-3 by itself isn't sufficient to fully identify a written language. Eg you could have sr-Cyrl for Serbian in the Cyrillic script. Sr-Latn to represent Serbian written in the Latin orthography, sr-Latn-alalc97 ... Romanised Cyrillic Serbian based on the ALA-LC Cyrillic romanisation table published in 1997. Its worth noting the only ALA-LC romanisation tables that can be specified in BCP-47 are the 1997 editions. Ultimately it is what a library is working on, if you are cataloguing then all you have is ISO-639-3/B If you are working on a digitisation or linked data project it is much better to correctly use BCP-47 which would align your resources more accurately with the rest of the broader information ecosystem in which your resources would exist. Andrew On 2 Jun 2016 9:15 am, "Craig Franklin"wrote: > We've never had any problems sticking to ISO639-2 codes (in cases there > isn't a shorter ISO639-1 code available). I'm interested in what sort of > regional languages you might be dealing with where there are significant > gaps in that standard? > > You might also look at ISO 639-3, which is quite comprehensive but also > introduces a fair chunk of complexity: > > http://www-01.sil.org/iso639-3/download.asp > > Cheers, > Craig Franklin > > On 2 June 2016 at 08:59, Greg Lindahl wrote: > > > Some of the Internet Archive's library partners are asking us about > > language metadata for regional languages that don't have standard > > codes. Is there a standard way of dealing with this situation? > > > > Overall we use MARC codes https://www.loc.gov/marc/languages/ which > > were last updated in 2007. LOC also maintains ISO639-2 > > https://www.loc.gov/standards/iso639-2/php/code_list.php last updated > > in 2014. > > > > The languages in question are regional languages which are currently > > lumped together in both standards. With the recent rise in interest > > and funding for regional languages, it's no surprise that some > > catalogers want to split these languages out into separate codes. > > > > Thanks! > > > > -- greg > > >
[CODE4LIB] Fwd: [camms-ccaam] Common encoding errors
On behalf of Charles Riley: -- Forwarded message -- From: Riley, Charles <charles.ri...@yale.edu> Date: 23 February 2016 at 05:37 Subject: [camms-ccaam] Common encoding errors To: "voyage...@listserv.nd.edu" <voyage...@listserv.nd.edu>, " lit...@lists.ala.org" <lit...@lists.ala.org>, "camms-cc...@lists.ala.org" < camms-cc...@lists.ala.org>, "ol-tech-boun...@archive.org" < ol-tech-boun...@archive.org>, "ole.technical.usergr...@kuali.org" < ole.technical.usergr...@kuali.org>, "auto...@listserv.syr.edu" < auto...@listserv.syr.edu> Hi all, This is something I’ve noticed happening with somewhat regular, and probably increasing occurrence lately: a class of problems with records containing either escaped entity references from HTML or XML (like ‘’), or accented characters that have become corrupted in a data migration (like ‘français <https://openlibrary.org/works/OL10004281W/Les_archets_français>‘). I was asked by another librarian if I could point them to any resources that deal with this class of issues, and rounded up a few that I thought would be good to share. Here’s what I came across, in terms of examples and explanations for some of the more common cases: http://markmcb.com/2011/11/07/replacing-ae%E2%80%9C-ae%E2%84%A2-aeoe-etc-with-utf-8-characters-in-ruby-on-rails/ https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references (But treat this list with caution in using it to search; there will be false positives for a search for ‘amp;’, for example.) http://www.i18nqa.com/debug/utf8-debug.html (See also associated links on this page.) Hope this helps! Charles Riley *Charles Riley* *Interim Librarian for African Studies and Catalog Librarian* *Sterling Memorial Library* *Yale University* *charles.ri...@yale.edu <charles.ri...@yale.edu>* *(203)432-7566 <%28203%29432-7566> or (203)432-9301 <%28203%29432-9301>* -- Andrew Cunningham lang.supp...@gmail.com
Re: [CODE4LIB] Best way to handle non-US keyboard chars in URLs?
i, On Monday, 22 February 2016, Chris Moschini <ch...@brass9.com> wrote: > On Feb 20, 2016 9:33 PM, "Stuart A. Yeates" <syea...@gmail.com> wrote: >> >> 1) With Unicode 8, sign writing and ASL, the American / international >> dichotomy is largely specious. Before that there were American indigenous >> languages (Cheyenne etc.), but in my experience Americans don't usually >> think of them them as American. > > It's not about the label, so don't get too hung up on that. It's about > what's easy to type on a typical US keyboard. > If you are accessing a non-English resource, then having characters outside the basic latin block would seem to be perfectly acceptable to me. There are two types of users involved .. those that xan read the target language and those that can't. Those you can should be able to work with keyboards other than a US English layout. On most devices this is fairly trivial. Not to mention the user may jot actuually have the US English keyboard layout as their default input system. On a multilingual site I prefer the access points to be in the language of the resource. Obviously there are cases where people who can not read the language need to access a resource. In those cases I would look at apis that expose the resource in a different way. Maybe through a transliteration mapping. Rather than having a second URL. Ultimately it comes fown to who the users are and why they are accessing the resource. It seems to me your primary concern is for users who can not read the resource in any event Andrew -- Andrew Cunningham lang.supp...@gmail.com
[CODE4LIB]
Thanks I will look into them. On 9 February 2016 at 03:56, Han, Yan - (yhan) <y...@email.arizona.edu> wrote: > Yes. Use iText or PDFBox > > These are common PDF libraries. > > > > > > On 2/6/16, 2:24 PM, "Code for Libraries on behalf of Andrew Cunningham" < > CODE4LIB@LISTSERV.ND.EDU on behalf of lang.supp...@gmail.com> wrote: > > >Hi all, > > > >I am working with PDF files in some South Asian and South East Asian > >languages. Each PDF has ActualText added for each tag in the PDF. Each PDF > >has ActualText as an alternative forvthe visible text layer in the PDF. > > > >Is anyone aware of tools the will allow me to index and search PDFs based > >on the ActualText content rather than the visible text layers in the PDF? > > > >Andrew > > > >-- > >Andrew Cunningham > >lang.supp...@gmail.com > -- Andrew Cunningham lang.supp...@gmail.com
[CODE4LIB]
Thanks Levy will look at PDFBox and see what i can leverage from it. Andrew On 9 February 2016 at 04:33, Levy, Michael <ml...@ushmm.org> wrote: > There is a method named getActualText() in PDFBox, there are some listserv > postings (circa 2012) that indicate that the command-line PDFBox did not > support extraction of the ActualText contents at that time. That may have > changed. I'd like to know more. > > Thank you Andrew for sending me scurrying to learn about ActualText. I > don't think we have any in any of the PDFs that I'm indexing, but I > wouldn't have known it existed without your posting. > > > On Mon, Feb 8, 2016 at 11:56 AM, Han, Yan - (yhan) <y...@email.arizona.edu > > > wrote: > > > Yes. Use iText or PDFBox > > > > These are common PDF libraries. > > > > > > > > > > > > On 2/6/16, 2:24 PM, "Code for Libraries on behalf of Andrew Cunningham" < > > CODE4LIB@LISTSERV.ND.EDU on behalf of lang.supp...@gmail.com> wrote: > > > > >Hi all, > > > > > >I am working with PDF files in some South Asian and South East Asian > > >languages. Each PDF has ActualText added for each tag in the PDF. Each > PDF > > >has ActualText as an alternative forvthe visible text layer in the PDF. > > > > > >Is anyone aware of tools the will allow me to index and search PDFs > based > > >on the ActualText content rather than the visible text layers in the > PDF? > > > > > >Andrew > > > > > >-- > > >Andrew Cunningham > > >lang.supp...@gmail.com > > > -- Andrew Cunningham lang.supp...@gmail.com
[CODE4LIB]
Hi all, I am working with PDF files in some South Asian and South East Asian languages. Each PDF has ActualText added for each tag in the PDF. Each PDF has ActualText as an alternative forvthe visible text layer in the PDF. Is anyone aware of tools the will allow me to index and search PDFs based on the ActualText content rather than the visible text layers in the PDF? Andrew -- Andrew Cunningham lang.supp...@gmail.com
Re: [CODE4LIB] Library community web standards (was: LibGuides v2 - Templates and Nav)
Hi Brad, An interesting idea, but many potential failure points. I have been in the position of spending considerable time to develop,best practive materials on web internationalisation for our state government, without any prospect of being able to roll it out within our own library. Wether we are discussing corporate or opensource solutions. Web technologies withon library sector are at,the within the long tail of implementation. But best practice should be encouraged. Andrew On 01/10/2014 12:23 AM, Brad Coffield bcoffield.libr...@gmail.com wrote: I agree that it would be a bad idea to endeavor to create our own special standards that deviate from accepted web best practices and standards. My own thought was more towards a guide for librarians, curated by librarians, that provides a summary of best practices. On the one hand, something to help those without a deep tech background to quickly get up to speed with best practices instead of needing to conduct a lot of research and reading. But beyond that, it would also be a resource that went deeper for those who wanted to explore the literature. So, bullet points and short lists of information accompanied by links to additional resources etc. (So, right now, it sounds like a libguide lol) Though I do think there would potentially be additional information that did apply mostly/only to libraries and our particular sites etc. Off the top of my head: a thorough treatment and recommendations regarding libguides v2 and accessibility, customizing common library-used products (like Serial Solutions 360 link, Worldcat Local and all their competitors) so that they are most usable and accessible. At it's core, though, what I'm picturing is something where librarians get together and cut through the noise, pull out best web practices, and display them in a quickly digested format. Everything else would be the proverbial gravy. On Tue, Sep 30, 2014 at 10:01 AM, Michael Schofield mschofi...@nova.edu wrote: I am interested but I am a little hazy about what kind of standards you all are suggesting. I would warn against creating standards that conflict with any actual web standards, because I--and, I think, many others--would honestly recommend that the #libweb should aspire to and adhere more firmly to larger web standards and best practices that conflict with something that's more, ah, librarylike. Although that might not be what you folks have in mind at all : ). Michael S. -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Brad Coffield Sent: Tuesday, September 30, 2014 9:30 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Library community web standards (was: LibGuides v2 - Templates and Nav) Josh, thanks for separating this topic out and starting this new thread. I don't know of any such library standards that exist on the web. I agree that this sounds like a great idea. As for this group or not... why not! It's 2014 and they don't exist yet and they would be incredibly useful for many libraries, if not all. Now all we need is a cool 'working group' title for ourselves and we're halfway done! Right??? But seriously, I'd love to help. Brad -- Brad Coffield, MLIS Assistant Information and Web Services Librarian Saint Francis University 814-472-3315 bcoffi...@francis.edu -- Brad Coffield, MLIS Assistant Information and Web Services Librarian Saint Francis University 814-472-3315 bcoffi...@francis.edu
Re: [CODE4LIB] Natural language programming
Since you maybe looking at Drupal intergratin down the path, I would look at using python znd the NLTK , and develop a web service that coild ghen be used by drupal On 01/07/2014 11:13 PM, Katie konrad.ka...@gmail.com wrote: Hello, Has anyone here experience in the world of natural language programming (while applying information retrieval techniques)? I'm currently trying to develop a tool that will: 1. take a pdf and extract the text (paying no attention to images or formatting) 2. analyze the text via term weighting, inverse document frequency, and other natural language processing techniques 3. assemble a list of suggested terms and concepts that are weighted heavily in that document Step 1 is straightforward and I've had much success there. Step 2 is the problem child. I've played around with a few APIs (like AlchemyAPI) but they have character length limitations or other shortcomings that keep me looking. The background behind this project is that I work for a digital library with a large pre-existing collection of pdfs with rudimentary metadata. The aforementioned tool will be used to classify and group the pdfs according to the themes of the library. Our CMS is Drupal so depending on my level of ambition, this *might* develop into a module. Does this sound like a project that has been done/attempted before? Any suggested tools or reading materials?
Re: [CODE4LIB] Cataloguing Telugu
Stuart, had a quick look at the proposal, not sure cataloguing is an appropriate term, nor are they citations. I suspect that a simple database, web interface, simple search interface and Telugu collation should suffice. No specific tools would be needed. We are talking about a fairly common web infrastructure requirements, the challenge will be integrating it with wikimedia platforms, Best to discuss that with the internationalisaton team at WMF. On 08/04/2014 7:02 AM, Stuart Yeates stuart.yea...@vuw.ac.nz wrote: Currently there is a funding proposal for cataloguing Telugu works up before the Wikimedia foundation. If anyone has experience with Telugu or knows of any tools that are likely to be useful, please give your input: https://meta.wikimedia.org/wiki/Grants:IEG/Making_telugu_ content_accessible cheers stuart
Re: [CODE4LIB] pdf2txt
You may want to consider how best to handle PDF files where the text would contain ligatures and glyph ids rather than the underlying characters. A. On 12/10/2013 4:58 AM, Eric Lease Morgan emor...@nd.edu wrote: On Oct 11, 2013, at 1:49 PM, Matthew Sherman matt.r.sher...@gmail.com wrote: For a limited period of time I am making publicly available a Web-based program called PDF2TXT -- http://bit.ly/1bJRyh8 Very slick, good work. I can see where this tool can be very helpful. It does have some issues with some characters, but this is rather common with most systems. Again, thank you for the support. Yes, there are some escaping issues to be resolved. Release early. Release often. I need help with the graphic design in general. Here's an enhancement I thought of: 1. allow readers to authenticate 2. allow readers to upload documents 3. documents get saved in readers' cache 4. allow interface to list documents in the cache 5. provide text mining services against reader-selected documents 6. go to Step #1 It would also be cool if I could figure out how to finish the installation of Tesseract to enable OCRing. [1] [1] OCRing - http://serials.infomotions.com/code4lib/archive/2013/201303/1554.html -- Eric Morgan
Re: [CODE4LIB] pdf2txt
Hi Mark, I suspect the tool wil only be able to handle select languages, and very doubtful you could develop a tool to handle non-LCG text. For a fully internationalised tool, you would have fo ignore all text layers in a PDF and run all PDFs through OCR to generate text. Then you'd need to apply very sophisticated word boundary identification routines. A. On 12/10/2013 9:40 AM, Mark Pernotto mark.perno...@gmail.com wrote: Very cool tool, thank you! Putting my devil's advocate hat on, it doesn't parse foreign documents well (I got it to break!). I also got inconsistent results feeding it PDF files with tables embedded (but haven't been able to figure out what it is about them it doesn't like). Just from a curiosity standpoint, what encoding is being utilized? I know nothing about Perl. It seemed to have no problem parsing a dash (-) if it was up against another character (2007-2012), but barfs when it's by itself (2007 � 2012). I'm only referring to 'extracted text' mode. If it helps, I can send along *most* of my test PDF files used. Thank you! .m On Fri, Oct 11, 2013 at 10:58 AM, Eric Lease Morgan emor...@nd.edu wrote: On Oct 11, 2013, at 1:49 PM, Matthew Sherman matt.r.sher...@gmail.com wrote: For a limited period of time I am making publicly available a Web-based program called PDF2TXT -- http://bit.ly/1bJRyh8 Very slick, good work. I can see where this tool can be very helpful. It does have some issues with some characters, but this is rather common with most systems. Again, thank you for the support. Yes, there are some escaping issues to be resolved. Release early. Release often. I need help with the graphic design in general. Here's an enhancement I thought of: 1. allow readers to authenticate 2. allow readers to upload documents 3. documents get saved in readers' cache 4. allow interface to list documents in the cache 5. provide text mining services against reader-selected documents 6. go to Step #1 It would also be cool if I could figure out how to finish the installation of Tesseract to enable OCRing. [1] [1] OCRing - http://serials.infomotions.com/code4lib/archive/2013/201303/1554.html -- Eric Morgan
Re: [CODE4LIB] pdf2txt
Perl has its own encoding model, strings vould be unicode or legacy encoding, unicode is Unicode is indicated by the presence of a flag on a string. Out its decided on a string by string basis. If it is a legacy encoding, then it could be any legacy encoding. If your data is truly multilingual, multiscript and in a variety of encodings, it becomes a challenge to manage it in Perl. In our own projects we found perl module to be inadequate and needed our own internal modules to handle encoding issues, radio when you factor in the fact that some cpan modules have the nasty habit of stripping the Unicode flag from strings. Although that said, Perl still has better Unicode support than most languages. A.
Re: [CODE4LIB] Python and Ruby
Both Ruby and Python, have their strengths and weaknesses, and as others have mentioned, it will come down to need and existing projects you want to leverage. We use both Python and Ruby internally. Know your tools and their strengths and weaknesses. My personal interested is more and more revolved around natural language processing, and its potential in library based tools. Purging is quite strong in computational linguistics and has useful libraries for natural language processing. Andrew On 30/07/2013 1:43 AM, Joshua Welker wel...@ucmo.edu wrote: Not intending to start a language flame war/holy war here, but in the library coding community, is there a particular reason to use Ruby over Python or vice-versa? I am personally comfortable with Python, but I have noticed that there is a big Ruby following in Code4Lib and similar communities. Am I going to be able to contribute and work better with the community if I use Ruby rather than Python? I am 100% aware that there is no objective way to answer which of the two languages is the best. I am interested in the much more narrow question of which will work better for library-related scripting projects in terms of the following factors: -existing modules that I can re-use that are related to libraries (MARC tools, XML/RDF tools, modules released by major vendors, etc) -availability of help from others in the community -interest/ability of others to re-use my code Thanks. Josh Welker Information Technology Librarian James C. Kirkpatrick Library University of Central Missouri Warrensburg, MO 64093 JCKL 2260 660.543.8022
Re: [CODE4LIB] Python and Ruby
White space is potentially an illusion it isn't necessarilly there, esp when the whitespace is not a character ... ;) On 30/07/2013 8:02 AM, Michael J. Giarlo leftw...@alumni.rutgers.edu wrote: And you would think Python developers would know how to... ( •_•) ( •_•)⌐■-■ (⌐■_■) read between the (whitespace) lines? YEAH On Mon, Jul 29, 2013 at 2:57 PM, Ross Singer rossfsin...@gmail.com wrote: Muahahahahahahaha! MUAHAHAHAHAHAHA! And you walked right into it! You fools! -Ross. On Monday, July 29, 2013, Jay Luker wrote: On Mon, Jul 29, 2013 at 4:38 PM, Joshua Welker wel...@ucmo.edu javascript:; wrote: And I hate Python whitespace. Ah-ha! A more paranoid pythonista than I might suspect this whole thread was simply an exercise in Ruby shilling. --jay
Re: [CODE4LIB] tiff2pdf, then back to pdf?
Although I do find the persistent myth of PDF/A as an archival format amusing. Under very specific circumstances it can be, but its rare for those circumstances to be deliberatively met. And for many languages it is impossible to use pdf for archival purpuses ever. It is the nature of PDF. On 27/04/2013 8:28 AM, Jason Curtis cur...@sandiego.edu wrote: Hi, Edward: After reading through the string of messages and the options that you list below, I think that #3 is your best option. It seems to best fall in line with good archiving practices as I understand them (have one copy for public use and another for archival purposes). If you really want to convert the TIFF to PDF and ditch the TIFF file, I would suggest using PDF/A, the archival version of PDF, if you can. Best of luck! Sincerely, Jason __ Jason Curtis Technical Services Librarian Legal Research Center University of San Diego 5998 Alcalá Park San Diego, CA 92110 Ph: (619) 260-4600, ext.2875 Fax: (619) 260-7495 cur...@sandiego.edu -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Edward M. Corrado Sent: Friday, April 26, 2013 2:55 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] tiff2pdf, then back to pdf? On Fri, Apr 26, 2013 at 5:29 PM, Ethan Gruber ewg4x...@gmail.com wrote: What's your use case in this scenario? Do you want to provide access to the PDFs over the web or are you using them as your archival format? You probably don't want to use PDF to achieve both objectives. The problem I have is I have multipage TIFF files and I don't currently have a good way for users to view them. I also need to preserve these files. Ideally my use case would be to use PDF files created from the TIFFs for both preservation and an archival format. But, as I said, that depends on if I can recreate the original tiff. I have the option of creating a custom viewer that can deal with the the display of the tiff files, but I'm looking for other options. So I have a few choices that I thought of implementing (that I haven't ruled out): 1) This is what I asked about. Make a PDF from the TIFF files. If I could embed the tiff into a pdf, and then at some point recreate the tiff if needed for archival purposes, I have my solution. 2) Convert the multipage TIFF files to individual TIFF files. This would work for my endusers, but would be more clunky than a PDF for them. The new TIFF fiels could be my archival copy. 3) Convert the multipage TIFF files to PDF (probably in a smaller, compressed? state), use the PDF for display/access, save the TIFF for archival purposes. 4) Convert the multipage TIFFs to PDF (or PDF/A?), and don't worry about being able to recreate the original TIFF files. I should add, the content is what is important in these documents and they are mostly type written or hand written text. Still, I'd like to keep them in as high quality of a format as possible. I'm sure there are some other possible solutions as well. I really would like #1, but it may not be possible. If it isn't, I need to decide (with representatives of my user community) which of the others are better. My guess is it would be #3, but I am not positive. Edward Ethan On Apr 26, 2013 5:11 PM, Edward M. Corrado ecorr...@ecorrado.us wrote: This works sometimes. Well, it does give me a new tiff file from the pdf all of the time, but it is not always anywhere near the same size as the original tiff. My guess is that maybe there is a flag or somethign that woulf help. Here is what I get with one fil: ecorrado@ecorrado:~/Desktop/test$ convert -compress none A001a.tif A001a.pdf ecorrado@ecorrado:~/Desktop/test$ convert -compress none A001a.pdf A001b.tif ecorrado@ecorrado:~/Desktop/test$ ls -al total 361056 drwxrwxr-x 2 ecorrado ecorrado 4096 Apr 26 17:07 . drwxr-xr-x 7 ecorrado ecorrado20480 Apr 26 16:54 .. -rw-rw-r-- 1 ecorrado ecorrado 38497046 Apr 26 17:07 A001a.pdf -rw-r--r-- 1 ecorrado ecorrado 38178650 Apr 26 17:07 A001a.tif -rw-rw-r-- 1 ecorrado ecorrado 5871196 Apr 26 17:07 A001b.tif In this case, the two tif files should be the same size. They are not even close. Maybe there is a flag to convert (besides compress) that I can use. FWIW: I tried three files/ 2 are like this. The other one, the resulting tiff is the same size as the original. Edward On Fri, Apr 26, 2013 at 4:25 PM, Aaron Addison addi...@library.umass.edu wrote: Imagemagick's convert will do it both ways. convert a.tiff b.pdf convert b.pdf a.tiff If the pdf is more than one page, the tiff will be a multipage tiff. Aaron -- Aaron Addison Unix Administrator W. E. B. Du Bois Library UMass Amherst 413 577 2104 On Fri, 2013-04-26 at 16:08 -0400, Edward M. Corrado
Re: [CODE4LIB] From Chinese characters to convert Pinyin and Traditional and Simplified Chinese and Hangul
HI Wataru, very interesting script, although I'd be inclined to suggest an enhancement. It would be useful to add language tagging to the input field and each of the conversions. The page as it stands will not use appropriate fonts for each language, web browsers need appropriate language to facilitate appropriate font fallback behaviours. Andrew On 18 April 2013 19:29, Wataru Ono ono.wataru.p...@gmail.com wrote: Hi, I'm Wataru ONO, librarian at Hitotsubashi University Library in Japanese. This tool is From Chinese characters to convert Pinyin and Traditional and Simplified Chinese and Hangul https://googledrive.com/host/0B_vZSxPrv8xmVnZwSkk0ZmU2Zmc/han2pin.html You can convert between Simplified and Traditional Chinese and Japanese characters. This is made of pure javascript. If you are interested in this tool, please feel free to use and down load. Best regards -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunning...@slv.vic.gov.au lang.supp...@gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/
Re: [CODE4LIB] one tool and/or resource that you recommend to newbie coders in a library?
My 2 cents worth ... and one for each cent: * Komodo Edit * www.w3.org/International On 2 November 2012 07:24, Bohyun Kim k...@fiu.edu wrote: Hi all code4lib-bers, As coders and coding librarians, what is ONE tool and/or resource that you recommend to newbie coders in a library (and why)? I promise I will create and circulate the list and make it into a Code4Lib wiki page for collective wisdom. =) Thanks in advance! Bohyun --- Bohyun Kim, MA, MSLIS Digital Access Librarian bohyun@fiu.edu 305-348-1471 Medical Library, College of Medicine Florida International University http://medlib.fiu.edu http://medlib.fiu.edu/m (Mobile) -- Andrew Cunningham Project Manager, Research and Development Social and Digital Inclusion Unit Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunning...@slv.vic.gov.au lang.supp...@gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/
Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21
on marc4j (which is used heavily by SolrMarc) is that for any significant processing of Marc records the only solution that makes sense is to translate the record data into Unicode characters as it is being read in. Of course as you and others have stated, determining what the data actually is, in order to correctly translate it to Unicode, is no easy task. The leader byte that merely indicates is UTF8 or is not UTF8 is wrong often enough in the real world that it is of little value when it indicates is UTF-8and is even less value when it indicates is not UTF-8 Significant portions of the code I've added to marc4j deal with trying to determine what the encoding of that data actually is and trying to translate the data correctly into Unicode even when the data is incorrect. You also argued in another message that cataloger entry tools should give feedback to help the cataloger not create errors. I agree. I think one possible step towards this would be that the editor must work in Unicode, irrespective of the data format that the underlying system expects the data to be. If the underlying system expects MARC8 then the save as process should be able to translate the data into MARC8 on output. -Robert Haschart -- Andrew Cunningham Senior Project Manager, Research and Development Vicnet State Library of Victoria Australia andr...@vicnet.net.au lang.supp...@gmail.com
Re: [CODE4LIB] Unicode font for PDF generation?
There are no pan Unicode fonts. Last one I saw was for Unicode 2.0 There is a limit to the number of glyphs a font can contain. It is possible to create a subset of unicode and place it in a single font, but you need to be able to identify your current and future character requirements. But not sure why you need a single font, unless your xml to pdf conversion can't process stylesheets. Andrew On Saturday, 17 March 2012, Mark Redar mark.re...@ucop.edu wrote: Hi All, We're having some fun with unicode characters in PDF generation. We have a process that automatically generates a pdf from XML input. The tool stack doesn't support multiple fonts for displaying different codepoints so we need a good pan-unicode font to bundle with the pdfs. Currently, we use the DejaVu font family for creating the pdfs. This has good coverage for latin cyrillic characters but has no CJK (chinese-japanese-korean) coverage. We've looked into licensing a commercial fonts, but for web server use these require annual licensing fees that are substantial (in the thousands of $). A number of our source documents contain CJK characters and some contributors have noticed the lack of support for these characters. Does anyone know of a good pan-unicode free font that includes CJK codepoints that looks good? Gnu unifont has the coverage, but it is not the best looking font. Barring that, we're thinking of rolling our own pan-unicode font. There are good open source fonts for portions of the unicode character sets. We're hoping to find some way to take a number of open source fonts and combine them into one large pan-unicode font. Does anyone have experience with font authoring and merging different fonts? It looks as though FontForge can merge fonts, but it's not clear how to deal with overlapping codepoints in the merged fonts. Thanks, Mark -- Andrew Cunningham Senior Project Manager, Research and Development Vicnet State Library of Victoria Australia andr...@vicnet.net.au lang.supp...@gmail.com
Re: [CODE4LIB] Unicode font for PDF generation?
A couple of additional thoughts: • The most complete cjk font projects require 2 fonts to handle all cjk characters • There are language specific glyph variations between chinese and japanese, so ideal situation is to use diffetent fonts tailored for each On Saturday, 17 March 2012, Mark Redar mark.re...@ucop.edu wrote: Hi All, We're having some fun with unicode characters in PDF generation. We have a process that automatically generates a pdf from XML input. The tool stack doesn't support multiple fonts for displaying different codepoints so we need a good pan-unicode font to bundle with the pdfs. Currently, we use the DejaVu font family for creating the pdfs. This has good coverage for latin cyrillic characters but has no CJK (chinese-japanese-korean) coverage. We've looked into licensing a commercial fonts, but for web server use these require annual licensing fees that are substantial (in the thousands of $). A number of our source documents contain CJK characters and some contributors have noticed the lack of support for these characters. Does anyone know of a good pan-unicode free font that includes CJK codepoints that looks good? Gnu unifont has the coverage, but it is not the best looking font. Barring that, we're thinking of rolling our own pan-unicode font. There are good open source fonts for portions of the unicode character sets. We're hoping to find some way to take a number of open source fonts and combine them into one large pan-unicode font. Does anyone have experience with font authoring and merging different fonts? It looks as though FontForge can merge fonts, but it's not clear how to deal with overlapping codepoints in the merged fonts. Thanks, Mark -- Andrew Cunningham Senior Project Manager, Research and Development Vicnet State Library of Victoria Australia andr...@vicnet.net.au lang.supp...@gmail.com
Re: [CODE4LIB] Unicode font for PDF generation?
For additional CJKV fonts look at: http://en.wikipedia.org/wiki/List_of_CJK_fonts -- Andrew Cunningham Senior Project Manager, Research and Development Vicnet State Library of Victoria Australia andr...@vicnet.net.au lang.supp...@gmail.com
Re: [CODE4LIB] Plea for help from Horowhenua Library Trust to Koha Community
On 23 November 2011 06:32, MJ Ray m...@phonecoop.coop wrote: Mike Taylor m...@indexdata.com 2. Koha means akin to gift. The irony of trying to trademark that word in particular is mindboggling and should shame PTFS in the eyes of everyone who likes sharing information - basically all of us who are involved with libraries at some level, isn't it? I'm wondering if cultural property rights can be use to over turn a trademark. Not only is koha a maori word it is a cultural concept. -- Andrew Cunningham Senior Project Manager, Research and Development Vicnet State Library of Victoria Australia andr...@vicnet.net.au lang.supp...@gmail.com
Re: [CODE4LIB] Plea for help from Horowhenua Library Trust to Koha Community
I'd be inclined to have a quite chat with Maori political activists and see what their feleings are on non-New Zealand companies applying for trademark status on Maori words in New Zealand. -- Andrew Cunningham Senior Project Manager, Research and Development Vicnet State Library of Victoria Australia andr...@vicnet.net.au lang.supp...@gmail.com
Re: [CODE4LIB] MARCXML - What is it for?
I'd suspect that MARCXML isn't going anywhere fast, a shame perhaps. The key difference between MARCXML and MARC is that MARCXML inherits XMLs internationalisation features. It is an aspect at which MARC is very poor. Andrew -- Andrew Cunningham Senior Project Manager, Research and Development Vicnet State Library of Victoria Australia andr...@vicnet.net.au lang.supp...@gmail.com
Re: [CODE4LIB] character-sets for dummies?
Hi 2009/12/17 stuart yeates stuart.yea...@vuw.ac.nz: If, however, you need to deal with characters which don't qualify for inclusion in Unicode (or which do qualify but which haven't yet been assigned code points). I recommend tei:glyph: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-glyph.html We use this to represent typographically interesting but short-lived approaches to the representation of Māori in printed works. See for example the 'wh' ligature (which looks like a 'vh' and is pronounced in modern usage like 'f') in the following text: an interesting approach, although not the only way to address that particular issue. and depends on whether you want to treat it as a ligature or as a character. Other approaches have been to : 1) use PUA assignments, e.g. the MUFI and SIL PUA assignments/registries as examples; or 2) use U+200D to request ligation Both these approaches would require specifically defined or modified fonts. http://www.nzetc.org/tm/scholarly/tei-Auc1911NgaM-t1-body-d4.html for the underlying TEI XML representation see: http://www.nzetc.org/tei-source/Auc1911NgaM.xml cheers stuart -- Stuart Yeates http://www.nzetc.org/ New Zealand Electronic Text Centre http://researcharchive.vuw.ac.nz/ Institutional Repository -- Andrew Cunningham Vicnet Research and Development Coordinator State Library of Victoria Australia andr...@vicnet.net.au lang.supp...@gmail.com