On Tue, 2008-01-29 at 21:20 +0000, Niall Pemberton wrote: > I would be happy to be a mentor for PDFBox - not done mentoring before > so if you can get someone more experienced then I'll bow out no > problem. Also I also haven't used PDFBox or even looked at it, but I > am interested in it (use iText alot) - so not sure whether thats a -ve > or not wrt mentoring.
Your willingness to monitor is a big plus point. There are others here who can guide you where needed. If a more experienced mentor volunteers, you can always share the job - how else would you learn it? Mentoring is about watching and guiding the podling's relationship with the ASF. You don't need to specifically know the code to do that - but you need to be prepared to monitor the lists and watch for anything that might be a concern (e.g "hey, I just committed this cool new GPL library...") So, I say go for it. Regards, Upayavira > I haven't looked at PDFBox (use iText alot) but I am interested in it > and am willing to be a mentor. > > On Jan 29, 2008 5:20 PM, Jukka Zitting <[EMAIL PROTECTED]> wrote: > > Hi, > > > > We're getting closer to finalizing the PDFBox proposal [1] (current > > wiki source included below). I plan to call a vote on the proposal in > > a few days, so any questions, comments and suggestions are welcome. If > > you're interested, there's also a vacant spot for a third mentor. > > > > [1] http://wiki.apache.org/incubator/PDFBoxProposal > > > > BR, > > > > Jukka Zitting > > > > ---- > > > > Proposal Draft > > > > = PDFBox = > > > > === Abstract === > > > > PDFBox is an open source Java PDF library for working with PDF documents. > > > > === Proposal === > > > > The PDFBox library allows creation of new PDF documents, manipulation > > of existing documents and the ability to extract content from > > documents. PDFBox also includes several command line utilities. Future > > development plans include extending PDFBox with advanced data > > extraction and high level PDF creation functionality. > > > > In addition to PDFBox, this proposal also covers the !FontBox and > > !JempBox companion libraries. !FontBox is a Java font library used to > > obtain low level information from font files. !JempBox is a Java > > library that implements Adobe's XMP specification. All these > > components would be incubated as a single Apache PDFBox podling > > project. > > > > === Background === > > > > The PDFBox project started in 2002 and was originally written by Ben > > Litchfield in 2002 and currently lives on SourceForge. The initial > > purpose of PDFBox was to extract text content to be indexed by the > > Lucene search engine. In addition to text extraction the library also > > supports a low level API for PDF creation and manipulation. In the > > past, several developers have helped develop specific features in > > PDFBox but none have continued once their specific needs where met. > > > > In 2006 discussions began with the FOP team to collaborate on a single > > PDF library within the Apache organization. New projects have > > expressed interest in advancing the functionality of PDFBox. > > > > Recently, Tika also expressed interest in advancing the content > > extraction capabilities of PDFBox. > > > > The !FontBox and !JempBox libraries have no dependencies to PDFBox, > > but their primary purpose is to support PDFBox and the development > > community is largely overlapping. It makes sense to include all three > > libraries in a single project. > > > > === Rationale === > > > > The PDF document format is a common format found on internet and > > across industries as a way of sharing documents. Several Apache > > projects utilize PDF technologies but there is not a single > > independent PDF library within the Apache organization. > > > > The Apache XML Graphics project (FOP/Batik) has a write-only PDF > > library and is in need of PDF parsing functionality. Many features > > overlap those of PDFBox. This is currently a duplication of effort, > > bringing PDFBox into Apache and combining our efforts will result in a > > more robust PDF library that will be able to support many more use > > cases for working with PDF technologies. > > > > !FontBox, FOP and Batik all contain font loading/handling code that > > could likely be merged into a single common library either within the > > PDFBox podling or outside it. > > > > === Initial Goals === > > > > The initial goals are: > > > > * Advanced text extraction techniques > > * Increase community involvement > > * Cooperation with existing Apache projects such as XML Graphics > > * Increasing support for PDF document features > > * Adding a high level API for document creation > > * Adding a streaming API for document creation > > * PDF/A creation and validation functionality > > * Review licensing of both bundled and external dependencies > > * Manage export control notices for cryptographic features > > * Figure out how to handle font handling code across !FontBox, FOP, and > > Batik > > * Replace !JempBox with Adobe's XMP library > > > > == Current Status == > > > > === Meritocracy === > > > > Not all initial committers are familiar with the meritocracy > > principles of Apache. It is expected that the committers that are not > > will learn the meritocracy rules and they will be followed through the > > life of the project. > > > > === Community === > > > > PDFBox has existed for several years on SourceForge and has an active > > community and continues to grow each day. There are hundreds of > > existing projects that utilize the current version of PDFBox. > > > > === Core Developers === > > > > Ben Litchfield is the main developer on this project although it is > > expected that developers from a variety of existing Apache projects > > will become part of the team. > > > > === Alignment === > > > > The ability to search PDF documents is a basic requirement for any > > enterprise search solution. PDFBox provides the basic content that is > > needed for content indexing. This functionality aligns with the those > > of Lucene, Nutch, Tika and UIMA and all users of these projects will > > benefit from continued development of PDFBox. > > > > PDFBox shares similar font loading and handling needs as FOP and > > Batik, and the code in the !FontBox companion library could well be > > merged with similar code in the other projects. > > > > == Known Risks == > > > > === Orphaned products === > > > > PDFBox has been in development for over 5 years. The rate of > > development has varied, but the PDFBox user community has grown each > > year. PDFBox implements the PDF specification, which is highly > > utilized by companies across the world. The need for a PDF library is > > strong and is unlikely to change in the near future. > > > > === "Competing" formats === > > > > In recent times, additional paged document formats have been developed > > (or are in development) that have similar goals/functionality: > > > > * Microsoft's [http://www.microsoft.com/whdc/xps/xpsspec.mspx XPS] > > (XML-based ZIP container, proprietary core-functionality) > > * Adobe's [http://labs.adobe.com/technologies/mars/ Mars] (XML-based > > ZIP container, largely based on open standards, extending them where > > necessary) > > > > === Inexperience with Open Source === > > > > All developers have experience with Open Source projects. > > > > === Homogenous Developers === > > > > The initial set of committers is diverse and the project is likely to > > attract new developers. > > > > === Reliance on Salaried Developers === > > > > PDFBox is not the primary job for any of the initial committers. > > > > === Relationships with Other Apache Products === > > > > PDFBox has relationships with the following Apache Products > > > > * [http://lucene.apache.org/java/ Apache Lucene] Lucene users > > typically integrate with PDFBox to add PDF indexing capabilities. > > * [http://lucene.apache.org/nutch/ Lucene Nutch] Nutch currently > > utilizes PDFBox to index PDF documents. > > * [http://incubator.apache.org/tika/ Tika] Tika currently utilizes > > PDFBox for extracting PDF content. > > * [http://incubator.apache.org/uima/ Apache UIMA] UIMA analyzes > > unstructured content and would benefit from PDF content. > > * [http://xmlgraphics.apache.org/fop/ Apache FOP] and > > [http://xmlgraphics.apache.org/batik/ Apache Batik] There's an > > experimental plug-in (currently hosted outside of the project) for FOP > > that uses PDFBox to support embedding of existing PDFs in XSL-FO > > documents for PDF output. Both Batik and FOP have code to parse fonts > > which !FontBox needs to do, too. > > > > === A Excessive Fascination with the Apache Brand === > > > > Many existing Apache developers are already familiar with PDFBox. > > PDFBox was initially written to compliment the functionality of Lucene > > and has worked with it's developers over the past several years. > > PDFBox will benefit from closer cooperation with several existing > > Apache projects. > > > > == Documentation == > > > > * PDFBox ([http://www.pdfbox.org/]) > > * !FontBox ([http://www.fontbox.org/]) > > * !JempBox ([http://www.jempbox.org/]) > > > > == Initial Source == > > > > Initial source will come from the existing SourceForge repositories of > > the PDFBox, !FontBox, and !JempBox projects. > > > > == Source and Intellectual Property Submission Plan == > > > > The initial IP submission will be done as a software grant to the ASF. > > > > == External Dependencies == > > > > The "Adobe AFM License" and the "SUN JAI" licenses described below > > need to be reviewed to ensure they comply with Apache license > > standards. > > > > ||'''Library'''||'''License'''||'''Description'''|| > > ||Adobe AFM||Adobe AFM License||Resources for extracting font > > encoding. Bundled inside PDFBox jar file.|| > > ||Bouncycastle||BSD Variant||Support for encrypting/decrypting PDF > > documents.|| > > ||IKVM||BSD Variant [1]||Support of PDFBox on .NET platform|| > > ||junit||CPL||Unit Testing Framework|| > > ||Lucene||ASL||Provide classes for easy Lucene integration|| > > ||JAI-CMM||Sun JAI||Provides support from color spaces|| > > > > ''[1] IKVM itself is BSD but contains either GNU Classpath or the > > OpenJDK class library (both GPL with exception). This may need to be > > reviewed, too.'' > > > > == Cryptography == > > > > PDFBox implements the RC4 encryption algorithm and utilizes Bouncy > > Castle for additional encryption routines. > > > > == Required Resources == > > > > Mailing lists > > > > * [EMAIL PROTECTED] > > * [EMAIL PROTECTED] > > * [EMAIL PROTECTED] > > > > Subversion Directory > > > > * https://svn.apache.org/repos/asf/incubator/pdfbox > > > > Issue Tracking > > > > * JIRA PDFBox (PDFBOX) > > > > Other Resources > > > > * none > > > > == Initial Committers == > > > > || '''Name''' || '''Email''' || > > '''CLA''' || > > || Ben Litchfield || ben at benlitchfield dot com || No > > || > > || Daniel Wilson || williamstonconsulting at gmail dot com || No > > || > > || Philipp Koch || pkoch at apache dot org || > > Yes || > > > > == Affiliations == > > > > || '''Name''' || '''Affiliation''' || > > || Ben Litchfield || Independent || > > || Daniel Wilson || DV Brown Company || > > || Philipp Koch || Day Software || > > > > == Sponsors == > > > > Champion > > > > * Jukka Zitting > > > > Nominated Mentors > > > > * Jukka Zitting > > * Jeremias Maerki > > * ... > > > > Sponsoring Entity > > > > * Apache Incubator PMC > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]