[jira] Commented: (LUCENE-1482) Replace infoSteram by a logging framework (SLF4J)

2010-04-08 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854919#action_12854919
 ] 

Jukka Zitting commented on LUCENE-1482:
---

We use SLF4J in Jackrabbit, and having logs from the embedded Lucene index 
available through the same mechanism would be quite useful in some situations.

BTW, using isDebugEnabled() is often not necessary with SLF4J, see 
http://www.slf4j.org/faq.html#logging_performance

 Replace infoSteram by a logging framework (SLF4J)
 -

 Key: LUCENE-1482
 URL: https://issues.apache.org/jira/browse/LUCENE-1482
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-1482-2.patch, LUCENE-1482.patch, 
 slf4j-api-1.5.6.jar, slf4j-nop-1.5.6.jar


 Lucene makes use of infoStream to output messages in its indexing code only. 
 For debugging purposes, when the search application is run on the customer 
 side, getting messages from other code flows, like search, query parsing, 
 analysis etc can be extremely useful.
 There are two main problems with infoStream today:
 1. It is owned by IndexWriter, so if I want to add logging capabilities to 
 other classes I need to either expose an API or propagate infoStream to all 
 classes (see for example DocumentsWriter, which receives its infoStream 
 instance from IndexWriter).
 2. I can either turn debugging on or off, for the entire code.
 Introducing a logging framework can allow each class to control its logging 
 independently, and more importantly, allows the application to turn on 
 logging for only specific areas in the code (i.e., org.apache.lucene.index.*).
 I've investigated SLF4J (stands for Simple Logging Facade for Java) which is, 
 as it names states, a facade over different logging frameworks. As such, you 
 can include the slf4j.jar in your application, and it recognizes at deploy 
 time what is the actual logging framework you'd like to use. SLF4J comes with 
 several adapters for Java logging, Log4j and others. If you know your 
 application uses Java logging, simply drop slf4j.jar and slf4j-jdk14.jar in 
 your classpath, and your logging statements will use Java logging underneath 
 the covers.
 This makes the logging code very simple. For a class A the logger will be 
 instantiated like this:
 public class A {
   private static final logger = LoggerFactory.getLogger(A.class);
 }
 And will later be used like this:
 public class A {
   private static final logger = LoggerFactory.getLogger(A.class);
   public void foo() {
 if (logger.isDebugEnabled()) {
   logger.debug(message);
 }
   }
 }
 That's all !
 Checking for isDebugEnabled is very quick, at least using the JDK14 adapter 
 (but I assume it's fast also over other logging frameworks).
 The important thing is, every class controls its own logger. Not all classes 
 have to output logging messages, and we can improve Lucene's logging 
 gradually, w/o changing the API, by adding more logging messages to 
 interesting classes.
 I will submit a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



JDBC access to a Lucene index

2009-10-16 Thread Jukka Zitting
Hi,

Some while ago I implemented a simple JDBC to JCR bridge [1] that
allows one to query a JCR repository from any JDBC client, most
notably various reporting tools.

Now I'm wondering if something similar already exists for a normal
Lucene index. Something that would treat your entire index as one huge
table (or perhaps a set of tables based on some document type field)
and would allow you to use simple SQL SELECTs to query data.

Any pointers would be welcome.

[1] http://dev.day.com/microsling/content/blogs/main/jdbc2jcr.html

BR,

Jukka Zitting

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Lucene icon and Ohloh

2009-08-31 Thread Jukka Zitting
Hi,

I was checking the Lucene Java entry on Ohloh [1] and noticed that the
full green Lucene text logo doesn't work too well in the 64x64 and
16x16 sizes used there.

So I took the liberty of dropping the ucene part of the logo and
coming up with a 64x64 pixel icon containing just the stylished L.
See the result in [2] and the smaller 16x16 version in [3].

WDYT, should we keep this icon or revert Ohloh to use the normal Lucene logo?

PS. I also marked myself as a manager of the Lucene entry in Ohloh .
The manager feature [4] makes it possible to prevent potential
spamming of the Ohloh records. I'd be happy to hand over the role to
someone closer to Lucene Java development.

[1] http://www.ohloh.net/p/lucene
[2] http://bits.ohloh.net/attachments/23787/lucene_med.png
[3] http://bits.ohloh.net/attachments/23787/lucene_tiny.png
[4] https://www.ohloh.net/wiki/ManagingProjects

BR,

Jukka Zitting

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: software grants

2009-07-07 Thread Jukka Zitting
Hi,

On Tue, Jul 7, 2009 at 4:05 PM, Yonik Seeleyyo...@lucidimagination.com wrote:
 Regarding the software grant debate in
 https://issues.apache.org/jira/browse/LUCENE-1567
 IMO, it's pretty subjective what needs a software grant, and I don't
 think we should throw up any hard'n'fast rules about it.  The bottom
 line is that the PMC/committers are responsible for IP oversight for
 everything committed.

Agreed, the important thing is to ensure that we have the right to
publish and distribute the contributed code in our releases. That can
mean an existing license on the contribution, a reference to section 5
of ALv2, a CLA, a software grant, or whatever else that will hold up
under a license review.

There are few people who understand the potential licensing
complexities of code developed by a number of different contributors.
Does the submitter know that the work of the previous developers was
meant to be contributed to Apache? Where's the paper trail for that? A
software grant is a simple and easy way to cover an entire
contribution.

In this case, since all the work was apparently done within IBM (who'd
then be the copyright owner), anyone listed in the Schedule A of an
IBM CCLA could also contribute the code without needing an explicit
software grant.

BR,

Jukka Zitting

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1675) Add a link to the release archive

2009-06-01 Thread Jukka Zitting (JIRA)
Add a link to the release archive
-

 Key: LUCENE-1675
 URL: https://issues.apache.org/jira/browse/LUCENE-1675
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Reporter: Jukka Zitting
Priority: Minor


It would be nice if the [Releases 
page|http://lucene.apache.org/java/docs/releases.html] contained a link to the 
release archive at http://archive.apache.org/dist/lucene/java/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-931) Some files are missing the license headers

2007-06-09 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12503034
 ] 

Jukka Zitting commented on LUCENE-931:
--

Nice, thanks!

 Some files are missing the license headers
 --

 Key: LUCENE-931
 URL: https://issues.apache.org/jira/browse/LUCENE-931
 Project: Lucene - Java
  Issue Type: Wish
  Components: Javadocs
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Trivial
 Fix For: 2.2

 Attachments: lucene-931.patch


 Jukka provided the following list of files that are missing the license 
 headers.
 In addition there might be other files (like build scripts) that don't have 
 the headers.
 src/java/org/apache/lucene/document/MapFieldSelector.java
 src/java/org/apache/lucene/search/PrefixFilter.java
 src/test/org/apache/lucene/TestHitIterator.java
 src/test/org/apache/lucene/analysis/TestISOLatin1AccentFilter.java
 src/test/org/apache/lucene/index/TestAddIndexesNoOptimize.java
 src/test/org/apache/lucene/index/TestBackwardsCompatibility.java
 src/test/org/apache/lucene/index/TestFieldInfos.java
 src/test/org/apache/lucene/index/TestIndexFileDeleter.java
 src/test/org/apache/lucene/index/TestIndexWriter.java
 src/test/org/apache/lucene/index/TestIndexWriterDelete.java
 src/test/org/apache/lucene/index/TestIndexWriterLockRelease.java
 src/test/org/apache/lucene/index/TestIndexWriterMergePolicy.java
 src/test/org/apache/lucene/index/TestNorms.java
 src/test/org/apache/lucene/index/TestParallelTermEnum.java
 src/test/org/apache/lucene/index/TestSegmentTermEnum.java
 src/test/org/apache/lucene/index/TestTerm.java
 src/test/org/apache/lucene/index/TestTermVectorsReader.java
 src/test/org/apache/lucene/search/TestRangeQuery.java
 src/test/org/apache/lucene/search/TestTermScorer.java
 src/test/org/apache/lucene/store/TestBufferedIndexInput.java
 src/test/org/apache/lucene/store/TestWindowsMMap.java
 src/test/org/apache/lucene/store/_TestHelper.java
 src/test/org/apache/lucene/util/_TestUtil.java
 contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/SimpleSloppyPhraseQueryMaker.java
 contrib/gdata-server/src/core/src/java/org/apache/lucene/gdata/server/FeedNotFoundException.java
 contrib/gdata-server/src/core/src/java/org/apache/lucene/gdata/server/registry/ComponentType.java
 contrib/gdata-server/src/core/src/java/org/apache/lucene/gdata/server/registry/RegistryException.java
 contrib/gdata-server/src/core/src/java/org/apache/lucene/gdata/storage/lucenestorage/StorageAccountWrapper.java
 contrib/gdata-server/src/core/src/test/org/apache/lucene/gdata/storage/lucenestorage/TestModifiedEntryFilter.java
 contrib/gdata-server/src/gom/src/test/org/apache/lucene/gdata/gom/core/AtomUriElementTest.java
 contrib/gdata-server/src/gom/src/test/org/apache/lucene/gdata/gom/core/GOMEntryImplTest.java
 contrib/gdata-server/src/gom/src/test/org/apache/lucene/gdata/gom/core/GOMFeedImplTest.java
 contrib/gdata-server/src/gom/src/test/org/apache/lucene/gdata/gom/core/GOMGenereatorImplTest.java
 contrib/gdata-server/src/gom/src/test/org/apache/lucene/gdata/gom/core/GOMSourceImplTest.java
 contrib/highlighter/src/java/org/apache/lucene/search/highlight/TokenSources.java
 contrib/javascript/queryConstructor/luceneQueryConstructor.js
 contrib/javascript/queryEscaper/luceneQueryEscaper.js
 contrib/javascript/queryValidator/luceneQueryValidator.js
 contrib/queries/src/java/org/apache/lucene/search/BooleanFilter.java
 contrib/queries/src/java/org/apache/lucene/search/BoostingQuery.java
 contrib/queries/src/java/org/apache/lucene/search/FilterClause.java
 contrib/queries/src/java/org/apache/lucene/search/FuzzyLikeThisQuery.java
 contrib/queries/src/java/org/apache/lucene/search/TermsFilter.java
 contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThisQuery.java
 contrib/queries/src/test/org/apache/lucene/search/BooleanFilterTest.java
 contrib/regex/src/test/org/apache/lucene/search/regex/TestSpanRegexQuery.java
 contrib/snowball/src/java/net/sf/snowball/Among.java
 contrib/snowball/src/java/net/sf/snowball/SnowballProgram.java
 contrib/snowball/src/java/net/sf/snowball/TestApp.java
 contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestSpellChecker.java
 contrib/surround/src/test/org/apache/lucene/queryParser/surround/query/BooleanQueryTst.java
 contrib/surround/src/test/org/apache/lucene/queryParser/surround/query/ExceptionQueryTst.java
 contrib/surround/src/test/org/apache/lucene/queryParser/surround/query/SingleFieldTestDb.java
 contrib/surround/src/test/org/apache/lucene/queryParser/surround/query/Test01Exceptions.java
 contrib/surround/src/test/org/apache/lucene/queryParser/surround/query/Test02Boolean.java
 contrib/surround/src/test/org/apache/lucene/queryParser/surround/query/Test03Distance.java
 contrib/wordnet/src/java

Re: Please help testing the release files

2007-06-08 Thread Jukka Zitting

Hi,

On 6/5/07, Michael Busch [EMAIL PROTECTED] wrote:

So please help testing the release files on different platforms with
different JVM versions.


Tested on:

  - Windows XP, Sun Java 1.4.2_12
  - Windows XP, Sun Java 1.6.0-b105
  - Ubuntu 7.04, Sun Java 1.6.0-b105

I also ran RAT (http://code.google.com/p/arat/) on the source archive,
and there seem to be some files without license headers. Nothing
really major, but you may want to check at least some of the files.
I've listed the source files below, but I think the best practice
would nowadays be to include license headers also in things like Ant
build scripts, etc.

BR,

Jukka Zitting

src/java/org/apache/lucene/document/MapFieldSelector.java
src/java/org/apache/lucene/search/PrefixFilter.java
src/test/org/apache/lucene/TestHitIterator.java
src/test/org/apache/lucene/analysis/TestISOLatin1AccentFilter.java
src/test/org/apache/lucene/index/TestAddIndexesNoOptimize.java
src/test/org/apache/lucene/index/TestBackwardsCompatibility.java
src/test/org/apache/lucene/index/TestFieldInfos.java
src/test/org/apache/lucene/index/TestIndexFileDeleter.java
src/test/org/apache/lucene/index/TestIndexWriter.java
src/test/org/apache/lucene/index/TestIndexWriterDelete.java
src/test/org/apache/lucene/index/TestIndexWriterLockRelease.java
src/test/org/apache/lucene/index/TestIndexWriterMergePolicy.java
src/test/org/apache/lucene/index/TestNorms.java
src/test/org/apache/lucene/index/TestParallelTermEnum.java
src/test/org/apache/lucene/index/TestSegmentTermEnum.java
src/test/org/apache/lucene/index/TestTerm.java
src/test/org/apache/lucene/index/TestTermVectorsReader.java
src/test/org/apache/lucene/search/TestRangeQuery.java
src/test/org/apache/lucene/search/TestTermScorer.java
src/test/org/apache/lucene/store/TestBufferedIndexInput.java
src/test/org/apache/lucene/store/TestWindowsMMap.java
src/test/org/apache/lucene/store/_TestHelper.java
src/test/org/apache/lucene/util/_TestUtil.java
contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/SimpleSloppyPhraseQueryMaker.java
contrib/gdata-server/src/core/src/java/org/apache/lucene/gdata/server/FeedNotFoundException.java
contrib/gdata-server/src/core/src/java/org/apache/lucene/gdata/server/registry/ComponentType.java
contrib/gdata-server/src/core/src/java/org/apache/lucene/gdata/server/registry/RegistryException.java
contrib/gdata-server/src/core/src/java/org/apache/lucene/gdata/storage/lucenestorage/StorageAccountWrapper.java
contrib/gdata-server/src/core/src/test/org/apache/lucene/gdata/storage/lucenestorage/TestModifiedEntryFilter.java
contrib/gdata-server/src/gom/src/test/org/apache/lucene/gdata/gom/core/AtomUriElementTest.java
contrib/gdata-server/src/gom/src/test/org/apache/lucene/gdata/gom/core/GOMEntryImplTest.java
contrib/gdata-server/src/gom/src/test/org/apache/lucene/gdata/gom/core/GOMFeedImplTest.java
contrib/gdata-server/src/gom/src/test/org/apache/lucene/gdata/gom/core/GOMGenereatorImplTest.java
contrib/gdata-server/src/gom/src/test/org/apache/lucene/gdata/gom/core/GOMSourceImplTest.java
contrib/highlighter/src/java/org/apache/lucene/search/highlight/TokenSources.java
contrib/javascript/queryConstructor/luceneQueryConstructor.js
contrib/javascript/queryEscaper/luceneQueryEscaper.js
contrib/javascript/queryValidator/luceneQueryValidator.js
contrib/queries/src/java/org/apache/lucene/search/BooleanFilter.java
contrib/queries/src/java/org/apache/lucene/search/BoostingQuery.java
contrib/queries/src/java/org/apache/lucene/search/FilterClause.java
contrib/queries/src/java/org/apache/lucene/search/FuzzyLikeThisQuery.java
contrib/queries/src/java/org/apache/lucene/search/TermsFilter.java
contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThisQuery.java
contrib/queries/src/test/org/apache/lucene/search/BooleanFilterTest.java
contrib/regex/src/test/org/apache/lucene/search/regex/TestSpanRegexQuery.java
contrib/snowball/src/java/net/sf/snowball/Among.java
contrib/snowball/src/java/net/sf/snowball/SnowballProgram.java
contrib/snowball/src/java/net/sf/snowball/TestApp.java
contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestSpellChecker.java
contrib/surround/src/test/org/apache/lucene/queryParser/surround/query/BooleanQueryTst.java
contrib/surround/src/test/org/apache/lucene/queryParser/surround/query/ExceptionQueryTst.java
contrib/surround/src/test/org/apache/lucene/queryParser/surround/query/SingleFieldTestDb.java
contrib/surround/src/test/org/apache/lucene/queryParser/surround/query/Test01Exceptions.java
contrib/surround/src/test/org/apache/lucene/queryParser/surround/query/Test02Boolean.java
contrib/surround/src/test/org/apache/lucene/queryParser/surround/query/Test03Distance.java
contrib/wordnet/src/java/org/apache/lucene/wordnet/SynExpand.java
contrib/wordnet/src/java/org/apache/lucene/wordnet/SynLookup.java
contrib/wordnet/src/java/org/apache/lucene/wordnet/Syns2Index.java

-
To unsubscribe, e-mail: [EMAIL

Re: Lucene 2.2 soon?

2007-06-04 Thread Jukka Zitting

Hi,

On 6/1/07, Michael Busch [EMAIL PROTECTED] wrote:

Considering all these improvements I think it's time for a new release,
especially since many of you voted in February to have releases more
frequently.


Big +1 from me! We're doing a big 1.4 release of Jackrabbit in a few
months and many of the improvements you listed would be very much
welcome.

PS. When doing 2.2, it would be nice if you could upload the release
artifacts also in the Maven repository. See the instructions in
http://wiki.apache.org/jakarta-lucene/ReleaseTodo. Lucene 2.1 not
being in the Maven repository is the main blocker for Jackrabbit not
to upgrade right away.

BR,

Jukka Zitting

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene 2.2 soon?

2007-06-04 Thread Jukka Zitting

Hi,

On 6/4/07, Michael Busch [EMAIL PROTECTED] wrote:

 PS. When doing 2.2, it would be nice if you could upload the release
 artifacts also in the Maven repository. See the instructions in
 http://wiki.apache.org/jakarta-lucene/ReleaseTodo. Lucene 2.1 not
 being in the Maven repository is the main blocker for Jackrabbit not
 to upgrade right away.

We're already working on getting the upload into the Maven repository
done right this time.
(See https://issues.apache.org/jira/browse/LUCENE-622)


Nice, thanks a lot to everyone involved!

BR,

Jukka Zitting

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-523) FSDirectory.openFile(String) causes ClassCastException

2007-05-11 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495174
 ] 

Jukka Zitting commented on LUCENE-523:
--

We worked around the issue in Jackrabbit by using the new openInput method. I 
guess the underlying issue (FSDirectory.openFile throws an exception) is still 
there in Lucene, but I'm not sure if people are actually using that method.

 FSDirectory.openFile(String) causes ClassCastException
 --

 Key: LUCENE-523
 URL: https://issues.apache.org/jira/browse/LUCENE-523
 Project: Lucene - Java
  Issue Type: Bug
  Components: Store
Affects Versions: 1.9, 2.0.0
 Environment: Lucene 1.9.1
Reporter: Eric Isakson

 When you call FSDirectory.openFile(String) you get a ClassCastException since 
 FSIndexInput is not an org.apache.lucene.store.InputStream
 The workaround is to reimplement using openInput(String). I personally don't 
 need this to be fixed but wanted to document it here in case anyone else runs 
 into this for any reason.
 The reason I'm calling this is that I have a requirement on my project to 
 create read only indexes and name the index segments consistently from one 
 build to the next. So, after creating and optimizing the index, I rename the 
 files and rewrite the segments file. It would be nice if I had an API that 
 would allow me to say I only want one segment and I want its name to be 
 'foo'. For instance IndexWriter.optimize(String segmentName)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[PROPOSAL] Tika, a content analysis toolkit

2007-03-07 Thread Jukka Zitting

Hi,

[Cross-posting to announce the Tika proposal, please use
general@incubator.apache.org for followup discussion.]

This is a proposal to start a content analysis toolkit project in the
Apache Incubator. The live version of the proposal is available at
http://wiki.apache.org/incubator/TikaProposal.

Comments and questions are welcome. There is also a vacant place for a
third mentor. Once people are satisfied with the proposal I will first
call a vote on the Lucene PMC to sponsor the proposal and then a vote
on the Incubator PMC to accept the project for incubation.

PS. Based on quick Google and USPTO searches there doesn't seem to be
anything that would cause trouble with the Tika name.

BR,

Jukka Zitting


Tika, a content analysis toolkit


Abstract


Tika is a toolkit for detecting and extracting metadata and structured
text content from various documents using existing parser libraries.

Proposal


The Tika content analysis toolkit will include features for detecting
the content types, character encodings, languages, and other characteristics
of existing documents and for extracting structured text content from
the documents.

The toolkit is targeted especially for search engines and other content
indexing and analysis tools, but will be useful also for other applications
that need to extract meaningful information from documents that might
be presented as nothing else than binary streams.

Instead of implementing its own document parsers, Tika will use existing
parser libraries like Jakarta POI [1] and PDFBox [2].

Background
--

The initial idea for the Tika project was voiced in April 2006 by
Jérôme Charron and Chris A. Mattman on the Nutch mailing list. The Nutch
parser framework and other content analysis features were seen as
value-added components that would benefit also other projects. The idea
received positive feedback, but lacked the momentum.

The idea was revisited in August 2006 when Jukka Zitting from the
Jackrabbit project contacted Nutch for possible cooperation with similar
ideas. The original Tika idea gained extra momentum and a Google Code
project was set up as a staging area for prototype code before deciding
how to best handle the setup of a new project. After a few initial
commits the activity again declined.

In January 2007 the idea started gaining more momentum when Rida Benjelloun
offered to contribute the Lius project [3] to Apache Lucene and when Mark
Harwood also started looking for a generic toolkit like Tika.

This proposal is the result of the above efforts and related discussions
both in private and on various public forums. Some alternatives to
incubation, like Apache Labs [4] or Jakarta Commons [5], came up during
the discussions but we believe that taking the project to the Incubator
is the best way to start growing a viable community to sustain the Tika
toolkit.

Rationale
-

There is ever more demand for tools that automatically analyze and index
documents in various formats. Search engines, content repositories, and
other tools often need to extract metadata and text content from documents
given as nothing or little else than a simple octet stream. While there
are a number of existing parser libraries for various document types,
each of them comes with a custom API and there are no generic tools for
automatically determining which parser to use for which documents.
Currently many projects end up creating their custom content analysis
and extraction tools.

The Tika project attempts to remove this duplication of efforts. We
believe that by pooling the efforts of multiple projects we will be able
to create a generic toolkit that exceeds the capabilities and quality of
the custom solutions of any single project. A generic toolkit project
will also provide common ground for the developers of parser libraries
and content applications to interact.

Initial Goals
-

The initial goals of the proposed project are:

   * Viable community around the Tika codebase

   * Active relationships and possible cooperation with related
 projects and communities

   * Generic parser API for extracting structured text content from
 various document formats

   * Flexible metadata detection and extraction API

   * Java implementations of the metadata standards mentioned below


Current Status
==

Meritocracy
---

All the initial committers are familiar with the meritocracy principles
of Apache, and have already worked on the various source codebases. We will
follow the normal meritocracy rules also with other potential contributors.

Community
-

There is not yet a clear Tika community. Instead we have a number of people
and related projects with an understanding that a shared toolkit project
would best serve everyone's interests. The primary goal of the incubating
project is to build a self-sustaining community around this shared vision

Re: [jira] Lius into apache incubator

2007-03-01 Thread Jukka Zitting

Hi,

I am interested in a Lius/Tika project that could be used not only with
Lucene. As mentioned by Mark, there are a number of related efforts which
leads me to believe a application-independent content analysis/parsing tool
would be very helpful for many users.

I'd like to propose taking the project to the Apache Incubator to better
attract interest also from outside Lucene.

BR,

Jukka Zitting

-- 
View this message in context: 
http://www.nabble.com/Lius-into-apache-incubator-tf3145937.html#a9247508
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Lius into apache incubator

2007-03-01 Thread Jukka Zitting

Hi,

On 3/1/07, Rida Benjelloun [EMAIL PROTECTED] wrote:

Lius could be used as a starting point of Tika project, if Tika committers
are interested on it. We can also as mark said decouple Lius's parser logic
from it's indexing logic.


I'm very interested in doing that. Another very useful codebase, among
others, would be the existing parser framework in the Nutch project.


Taking the project into Apache incubator could be also interesting, to get
more people involved on it.


Exactly. I'd like to avoid starting just yet another codebase, and
focus more on bringing the best parts (both code and ideas) of the
existing projects together. The community-building focus of the
Incubator would likely help with that. Another aspect that would
benefit from the Incubator scrutiny are the legal implications of
pulling together multiple document parser libraries under various
different licenses.

Would there be interest within the Lucene PMC in sponsoring a proposal
along such lines? I can volunteer to put together the proposal and act
as the champion and mentor of the project.

BR,

Jukka Zitting

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Lius into apache incubator

2007-03-01 Thread Jukka Zitting

Hi,

On 3/1/07, Grant Ingersoll [EMAIL PROTECTED] wrote:

Is the Droids lab at all related to that parsing project in Nutch?


Partly, yes. I've been looking at Droids and so far I think it's main
focus has been on the crawling part rather than on the analysis of
retrieved content. A generic content analysis toolkit would likely be
a great companion for Droids. In fact I was earlier contemplating
about starting a related effort in Apache Labs (see
http://issues.apache.org/jira/browse/JCR-728), but there seems to be
enough demand for such functionality that a more full-fledged project
might be better.


There seems to be several efforts that are related here that could
probably make for a nice new project under Lucene, IMO.  They all
seem to have to do with getting and preparing text for processing by
some type of consumer of text.


Exactly. It would be great to see some consolidation of efforts.

BR,

Jukka Zitting

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Lius into apache incubator

2007-03-01 Thread Jukka Zitting

Hi,

On 3/1/07, Rida Benjelloun [EMAIL PROTECTED] wrote:

On 3/1/07, Jukka Zitting [EMAIL PROTECTED] wrote:
 Would there be interest within the Lucene PMC in sponsoring a proposal
 along such lines? I can volunteer to put together the proposal and act
 as the champion and mentor of the project.

--  We can put together the proposal and you can be the mentor of the
project.


See below for a quick first draft (filled with TODOs).

PS. Will people mind if we use this list for fleshing out the details?
I've created a Google Group for Tika where we could also take the
discussion if that's preferred.

BR,

Jukka Zitting


Tika Proposal
=

This is an early draft of a possible proposal for a Tika project
within the Apache Incubator. See
http://incubator.apache.org/guides/proposal.html for a description of
the propsal template.

Abstract


Tika is a toolkit for detecting and extracting metadata and text
content from various documents using existing parser libraries.

Proposal


The Tika content analysis toolkit will include features for detecting
the content types, character encodings, languages, and other
characteristics of existing documents and for extracting structured
text content from the documents.

The toolkit is targeted especially for search engines and other
content indexing and analysis tools, but will be useful also for other
applications that need to extract meaningful information from
documents that might be presented as nothing else than binary streams.

Instead of implementing it's own document parsers, Tika will use
existing parser libraries like Jakarta POI and PDFBox.

Background
--

The need for tools that automatically analyze and index content is
increasing as ever more information becomes available.

TODO: Discuss the various related projects and the lack of a common
analysis toolkit. Note how many of the existing tools have grown as
ad-hoc solutions to specific needs, and are often tightly bound to a
specific application or a parser library.

Rationale
-

TODO

Initial Goals
-

TODO

Current Status
--

TODO

Meritocracy
---

TODO

Community
-

TODO

Core Developers
---

TODO

Alignment
-

TODO

Known Risks
---

TODO: There has been on-and-off interest in something like this for
quite a while already. How can we make sure that the current increase
in interest doesn't fade away?

Orphaned products
-

TODO: See the comment above

Inexperience with Open Source
-

TODO: Many of the interested participants have open source background.

Homogenous Developers
-

TODO: There is no central company behind the proposal.

Reliance on Salaried Developers
---

TODO: Some of us are salaried for this, other's are not.

Relationships with Other Apache Products


TODO: Lucene, Nutch, Jackrabbit, Droids, ...

A Excessive Fascination with the Apache Brand
-

TODO

Documentation
-

TODO

Initial Source
--

TODO: Tika, Lius, Nutch?, ...

Source and Intellectual Property Submission Plan


TODO

External Dependencies
-

TODO: Some of the potential parser libraries will be GPL-licensed or
otherwise troublesome for an ASF project. How to best handle such
cases?

Cryptography


TODO: Some of the document formats are involve encryption and features
like DRM. While Tika itself will probably not include any
cryptographic code, the parser dependencies will most likely include
such code.

Required Resources
--

Mailing lists

 * [EMAIL PROTECTED]

Subversion Directory

 * https://svn.apache.org/repos/asf/incubator/tika

Issue Tracking

 * JIRA TIKA

Other Resources

 * none

Initial Committers
--

TODO

Affiliations


TODO

Sponsors


Champion

TODO (I can volunteer)

Nominated Mentors

TODO (Three mentors is the recommendation, I can volunteer as one)

Sponsoring Entity

TODO (Apache Lucene?)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Lius into apache incubator

2007-03-01 Thread Jukka Zitting

Hi,

On 3/1/07, Doug Cutting [EMAIL PROTECTED] wrote:

Jukka Zitting wrote:
 PS. Will people mind if we use this list for fleshing out the details?
 I've created a Google Group for Tika where we could also take the
 discussion if that's preferred.

I think the Incubator Wiki would be the best place for this.

http://wiki.apache.org/incubator/?action=fullsearchvalue=proposaltitlesearch=Titles

Interested folks could subscribe to the proposal page.  You could
announce the proposal page on several lists.  Will that work for you?


Sounds good. I uploaded the early draft to
http://wiki.apache.org/incubator/TikaProposal, I'll announce it in a
moment.


Also, I can probably help as a mentor if needed.


Cool, thanks!

BR,

Jukka Zitting

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-734) Upload Lucene 2.0 artifacts in the Maven 1 repository

2006-11-30 Thread Jukka Zitting (JIRA)
Upload Lucene 2.0 artifacts in the Maven 1 repository
-

 Key: LUCENE-734
 URL: http://issues.apache.org/jira/browse/LUCENE-734
 Project: Lucene - Java
  Issue Type: Task
  Components: Other
Reporter: Jukka Zitting
Priority: Minor


The Lucene 2.0 artifacts can be found in the Maven 2 repository, but not in the 
Maven 1 repository. There are still projects using Maven 1 who might be 
interested in upgrading to Lucene 2, so having the artifacts also in the Maven 
1 repository would be very helpful.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-734) Upload Lucene 2.0 artifacts in the Maven 1 repository

2006-11-30 Thread Jukka Zitting (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-734?page=comments#action_12454774 ] 

Jukka Zitting commented on LUCENE-734:
--

The ReleaseTodo page is immutable so I can't modify it directly.

At least the Maven sync directory information is outdated, the new official 
path (although I think the previous one is still symlinked) is 
/www/people.apache.org/repo/m2-ibiblio-rsync-repository.

You are right in that the artifacts in the Maven 2 repository above should 
(AFAIK) get automatically copied also to the Maven 1 repository. At least it 
works the other way. I'll check that and report back.

 Upload Lucene 2.0 artifacts in the Maven 1 repository
 -

 Key: LUCENE-734
 URL: http://issues.apache.org/jira/browse/LUCENE-734
 Project: Lucene - Java
  Issue Type: Task
  Components: Other
Reporter: Jukka Zitting
Priority: Minor

 The Lucene 2.0 artifacts can be found in the Maven 2 repository, but not in 
 the Maven 1 repository. There are still projects using Maven 1 who might be 
 interested in upgrading to Lucene 2, so having the artifacts also in the 
 Maven 1 repository would be very helpful.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-658) upload major releases to ibiblio

2006-09-03 Thread Jukka Zitting (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-658?page=comments#action_12432389 ] 

Jukka Zitting commented on LUCENE-658:
--

This seems to be a duplicate of LUCENE-551. The releases are available at:

http://www.ibiblio.org/maven2/org/apache/lucene/lucene-core/


 upload major releases to ibiblio
 

 Key: LUCENE-658
 URL: http://issues.apache.org/jira/browse/LUCENE-658
 Project: Lucene - Java
  Issue Type: Task
  Components: Other
Affects Versions: 1.9, 2.0.0
Reporter: Ryan Sonnek

 i'm a current user of maven and the latest 1.9 and 2.0 releases are not 
 available on ibiblio.
 http://www.ibiblio.org/maven2/lucene/lucene/
 Could someone upload the latest versions so that use maven-heads can access 
 the new features?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]