[jira] Commented: (LUCENE-1482) Replace infoSteram by a logging framework (SLF4J)

2010-04-08 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854919#action_12854919
 ] 

Jukka Zitting commented on LUCENE-1482:
---

We use SLF4J in Jackrabbit, and having logs from the embedded Lucene index 
available through the same mechanism would be quite useful in some situations.

BTW, using isDebugEnabled() is often not necessary with SLF4J, see 
http://www.slf4j.org/faq.html#logging_performance

> Replace infoSteram by a logging framework (SLF4J)
> -
>
> Key: LUCENE-1482
> URL: https://issues.apache.org/jira/browse/LUCENE-1482
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
> Fix For: 3.1
>
> Attachments: LUCENE-1482-2.patch, LUCENE-1482.patch, 
> slf4j-api-1.5.6.jar, slf4j-nop-1.5.6.jar
>
>
> Lucene makes use of infoStream to output messages in its indexing code only. 
> For debugging purposes, when the search application is run on the customer 
> side, getting messages from other code flows, like search, query parsing, 
> analysis etc can be extremely useful.
> There are two main problems with infoStream today:
> 1. It is owned by IndexWriter, so if I want to add logging capabilities to 
> other classes I need to either expose an API or propagate infoStream to all 
> classes (see for example DocumentsWriter, which receives its infoStream 
> instance from IndexWriter).
> 2. I can either turn debugging on or off, for the entire code.
> Introducing a logging framework can allow each class to control its logging 
> independently, and more importantly, allows the application to turn on 
> logging for only specific areas in the code (i.e., org.apache.lucene.index.*).
> I've investigated SLF4J (stands for Simple Logging Facade for Java) which is, 
> as it names states, a facade over different logging frameworks. As such, you 
> can include the slf4j.jar in your application, and it recognizes at deploy 
> time what is the actual logging framework you'd like to use. SLF4J comes with 
> several adapters for Java logging, Log4j and others. If you know your 
> application uses Java logging, simply drop slf4j.jar and slf4j-jdk14.jar in 
> your classpath, and your logging statements will use Java logging underneath 
> the covers.
> This makes the logging code very simple. For a class A the logger will be 
> instantiated like this:
> public class A {
>   private static final logger = LoggerFactory.getLogger(A.class);
> }
> And will later be used like this:
> public class A {
>   private static final logger = LoggerFactory.getLogger(A.class);
>   public void foo() {
> if (logger.isDebugEnabled()) {
>   logger.debug("message");
> }
>   }
> }
> That's all !
> Checking for isDebugEnabled is very quick, at least using the JDK14 adapter 
> (but I assume it's fast also over other logging frameworks).
> The important thing is, every class controls its own logger. Not all classes 
> have to output logging messages, and we can improve Lucene's logging 
> gradually, w/o changing the API, by adding more logging messages to 
> interesting classes.
> I will submit a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



JDBC access to a Lucene index

2009-10-16 Thread Jukka Zitting
Hi,

Some while ago I implemented a simple JDBC to JCR bridge [1] that
allows one to query a JCR repository from any JDBC client, most
notably various reporting tools.

Now I'm wondering if something similar already exists for a normal
Lucene index. Something that would treat your entire index as one huge
table (or perhaps a set of tables based on some document type field)
and would allow you to use simple SQL SELECTs to query data.

Any pointers would be welcome.

[1] http://dev.day.com/microsling/content/blogs/main/jdbc2jcr.html

BR,

Jukka Zitting

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Lucene icon and Ohloh

2009-08-31 Thread Jukka Zitting
Hi,

I was checking the Lucene Java entry on Ohloh [1] and noticed that the
full green "Lucene" text logo doesn't work too well in the 64x64 and
16x16 sizes used there.

So I took the liberty of dropping the "ucene" part of the logo and
coming up with a 64x64 pixel icon containing just the stylished "L".
See the result in [2] and the smaller 16x16 version in [3].

WDYT, should we keep this icon or revert Ohloh to use the normal "Lucene" logo?

PS. I also marked myself as a "manager" of the Lucene entry in Ohloh .
The "manager" feature [4] makes it possible to prevent potential
spamming of the Ohloh records. I'd be happy to hand over the role to
someone closer to Lucene Java development.

[1] http://www.ohloh.net/p/lucene
[2] http://bits.ohloh.net/attachments/23787/lucene_med.png
[3] http://bits.ohloh.net/attachments/23787/lucene_tiny.png
[4] https://www.ohloh.net/wiki/ManagingProjects

BR,

Jukka Zitting

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: software grants

2009-07-07 Thread Jukka Zitting
Hi,

On Tue, Jul 7, 2009 at 4:05 PM, Yonik Seeley wrote:
> Regarding the software grant debate in
> https://issues.apache.org/jira/browse/LUCENE-1567
> IMO, it's pretty subjective what needs a software grant, and I don't
> think we should throw up any hard'n'fast rules about it.  The bottom
> line is that the PMC/committers are responsible for IP oversight for
> everything committed.

Agreed, the important thing is to ensure that we have the right to
publish and distribute the contributed code in our releases. That can
mean an existing license on the contribution, a reference to section 5
of ALv2, a CLA, a software grant, or whatever else that will hold up
under a license review.

There are few people who understand the potential licensing
complexities of code developed by a number of different contributors.
Does the submitter know that the work of the previous developers was
meant to be contributed to Apache? Where's the paper trail for that? A
software grant is a simple and easy way to cover an entire
contribution.

In this case, since all the work was apparently done within IBM (who'd
then be the copyright owner), anyone listed in the "Schedule A" of an
IBM CCLA could also contribute the code without needing an explicit
software grant.

BR,

Jukka Zitting

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1675) Add a link to the release archive

2009-06-01 Thread Jukka Zitting (JIRA)
Add a link to the release archive
-

 Key: LUCENE-1675
 URL: https://issues.apache.org/jira/browse/LUCENE-1675
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Reporter: Jukka Zitting
Priority: Minor


It would be nice if the [Releases 
page|http://lucene.apache.org/java/docs/releases.html] contained a link to the 
release archive at http://archive.apache.org/dist/lucene/java/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-931) Some files are missing the license headers

2007-06-09 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12503034
 ] 

Jukka Zitting commented on LUCENE-931:
--

Nice, thanks!

> Some files are missing the license headers
> --
>
> Key: LUCENE-931
> URL: https://issues.apache.org/jira/browse/LUCENE-931
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Javadocs
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Trivial
> Fix For: 2.2
>
> Attachments: lucene-931.patch
>
>
> Jukka provided the following list of files that are missing the license 
> headers.
> In addition there might be other files (like build scripts) that don't have 
> the headers.
> src/java/org/apache/lucene/document/MapFieldSelector.java
> src/java/org/apache/lucene/search/PrefixFilter.java
> src/test/org/apache/lucene/TestHitIterator.java
> src/test/org/apache/lucene/analysis/TestISOLatin1AccentFilter.java
> src/test/org/apache/lucene/index/TestAddIndexesNoOptimize.java
> src/test/org/apache/lucene/index/TestBackwardsCompatibility.java
> src/test/org/apache/lucene/index/TestFieldInfos.java
> src/test/org/apache/lucene/index/TestIndexFileDeleter.java
> src/test/org/apache/lucene/index/TestIndexWriter.java
> src/test/org/apache/lucene/index/TestIndexWriterDelete.java
> src/test/org/apache/lucene/index/TestIndexWriterLockRelease.java
> src/test/org/apache/lucene/index/TestIndexWriterMergePolicy.java
> src/test/org/apache/lucene/index/TestNorms.java
> src/test/org/apache/lucene/index/TestParallelTermEnum.java
> src/test/org/apache/lucene/index/TestSegmentTermEnum.java
> src/test/org/apache/lucene/index/TestTerm.java
> src/test/org/apache/lucene/index/TestTermVectorsReader.java
> src/test/org/apache/lucene/search/TestRangeQuery.java
> src/test/org/apache/lucene/search/TestTermScorer.java
> src/test/org/apache/lucene/store/TestBufferedIndexInput.java
> src/test/org/apache/lucene/store/TestWindowsMMap.java
> src/test/org/apache/lucene/store/_TestHelper.java
> src/test/org/apache/lucene/util/_TestUtil.java
> contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/SimpleSloppyPhraseQueryMaker.java
> contrib/gdata-server/src/core/src/java/org/apache/lucene/gdata/server/FeedNotFoundException.java
> contrib/gdata-server/src/core/src/java/org/apache/lucene/gdata/server/registry/ComponentType.java
> contrib/gdata-server/src/core/src/java/org/apache/lucene/gdata/server/registry/RegistryException.java
> contrib/gdata-server/src/core/src/java/org/apache/lucene/gdata/storage/lucenestorage/StorageAccountWrapper.java
> contrib/gdata-server/src/core/src/test/org/apache/lucene/gdata/storage/lucenestorage/TestModifiedEntryFilter.java
> contrib/gdata-server/src/gom/src/test/org/apache/lucene/gdata/gom/core/AtomUriElementTest.java
> contrib/gdata-server/src/gom/src/test/org/apache/lucene/gdata/gom/core/GOMEntryImplTest.java
> contrib/gdata-server/src/gom/src/test/org/apache/lucene/gdata/gom/core/GOMFeedImplTest.java
> contrib/gdata-server/src/gom/src/test/org/apache/lucene/gdata/gom/core/GOMGenereatorImplTest.java
> contrib/gdata-server/src/gom/src/test/org/apache/lucene/gdata/gom/core/GOMSourceImplTest.java
> contrib/highlighter/src/java/org/apache/lucene/search/highlight/TokenSources.java
> contrib/javascript/queryConstructor/luceneQueryConstructor.js
> contrib/javascript/queryEscaper/luceneQueryEscaper.js
> contrib/javascript/queryValidator/luceneQueryValidator.js
> contrib/queries/src/java/org/apache/lucene/search/BooleanFilter.java
> contrib/queries/src/java/org/apache/lucene/search/BoostingQuery.java
> contrib/queries/src/java/org/apache/lucene/search/FilterClause.java
> contrib/queries/src/java/org/apache/lucene/search/FuzzyLikeThisQuery.java
> contrib/queries/src/java/org/apache/lucene/search/TermsFilter.java
> contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThisQuery.java
> contrib/queries/src/test/org/apache/lucene/search/BooleanFilterTest.java
> contrib/regex/src/test/org/apache/lucene/search/regex/TestSpanRegexQuery.java
> contrib/snowball/src/java/net/sf/snowball/Among.java
> contrib/snowball/src/java/net/sf/snowball/SnowballProgram.java
> contrib/snowball/src/java/net/sf/snowball/TestApp.java
> contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestSpellChecker.java
> contrib/surround/src/test/org/apache/lucene/queryParser/surround/query/BooleanQueryTst.java
> contrib/surround/src/test/org/apache/lucene/queryParser/surround/query/ExceptionQueryTst.java
> contrib/surround/src/test/org/apache/lucene/queryParser/surround/query/SingleFieldTestDb.java
&

Re: Please help testing the release files

2007-06-08 Thread Jukka Zitting

Hi,

On 6/5/07, Michael Busch <[EMAIL PROTECTED]> wrote:

So please help testing the release files on different platforms with
different JVM versions.


Tested on:

  - Windows XP, Sun Java 1.4.2_12
  - Windows XP, Sun Java 1.6.0-b105
  - Ubuntu 7.04, Sun Java 1.6.0-b105

I also ran RAT (http://code.google.com/p/arat/) on the source archive,
and there seem to be some files without license headers. Nothing
really major, but you may want to check at least some of the files.
I've listed the source files below, but I think the best practice
would nowadays be to include license headers also in things like Ant
build scripts, etc.

BR,

Jukka Zitting

src/java/org/apache/lucene/document/MapFieldSelector.java
src/java/org/apache/lucene/search/PrefixFilter.java
src/test/org/apache/lucene/TestHitIterator.java
src/test/org/apache/lucene/analysis/TestISOLatin1AccentFilter.java
src/test/org/apache/lucene/index/TestAddIndexesNoOptimize.java
src/test/org/apache/lucene/index/TestBackwardsCompatibility.java
src/test/org/apache/lucene/index/TestFieldInfos.java
src/test/org/apache/lucene/index/TestIndexFileDeleter.java
src/test/org/apache/lucene/index/TestIndexWriter.java
src/test/org/apache/lucene/index/TestIndexWriterDelete.java
src/test/org/apache/lucene/index/TestIndexWriterLockRelease.java
src/test/org/apache/lucene/index/TestIndexWriterMergePolicy.java
src/test/org/apache/lucene/index/TestNorms.java
src/test/org/apache/lucene/index/TestParallelTermEnum.java
src/test/org/apache/lucene/index/TestSegmentTermEnum.java
src/test/org/apache/lucene/index/TestTerm.java
src/test/org/apache/lucene/index/TestTermVectorsReader.java
src/test/org/apache/lucene/search/TestRangeQuery.java
src/test/org/apache/lucene/search/TestTermScorer.java
src/test/org/apache/lucene/store/TestBufferedIndexInput.java
src/test/org/apache/lucene/store/TestWindowsMMap.java
src/test/org/apache/lucene/store/_TestHelper.java
src/test/org/apache/lucene/util/_TestUtil.java
contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/SimpleSloppyPhraseQueryMaker.java
contrib/gdata-server/src/core/src/java/org/apache/lucene/gdata/server/FeedNotFoundException.java
contrib/gdata-server/src/core/src/java/org/apache/lucene/gdata/server/registry/ComponentType.java
contrib/gdata-server/src/core/src/java/org/apache/lucene/gdata/server/registry/RegistryException.java
contrib/gdata-server/src/core/src/java/org/apache/lucene/gdata/storage/lucenestorage/StorageAccountWrapper.java
contrib/gdata-server/src/core/src/test/org/apache/lucene/gdata/storage/lucenestorage/TestModifiedEntryFilter.java
contrib/gdata-server/src/gom/src/test/org/apache/lucene/gdata/gom/core/AtomUriElementTest.java
contrib/gdata-server/src/gom/src/test/org/apache/lucene/gdata/gom/core/GOMEntryImplTest.java
contrib/gdata-server/src/gom/src/test/org/apache/lucene/gdata/gom/core/GOMFeedImplTest.java
contrib/gdata-server/src/gom/src/test/org/apache/lucene/gdata/gom/core/GOMGenereatorImplTest.java
contrib/gdata-server/src/gom/src/test/org/apache/lucene/gdata/gom/core/GOMSourceImplTest.java
contrib/highlighter/src/java/org/apache/lucene/search/highlight/TokenSources.java
contrib/javascript/queryConstructor/luceneQueryConstructor.js
contrib/javascript/queryEscaper/luceneQueryEscaper.js
contrib/javascript/queryValidator/luceneQueryValidator.js
contrib/queries/src/java/org/apache/lucene/search/BooleanFilter.java
contrib/queries/src/java/org/apache/lucene/search/BoostingQuery.java
contrib/queries/src/java/org/apache/lucene/search/FilterClause.java
contrib/queries/src/java/org/apache/lucene/search/FuzzyLikeThisQuery.java
contrib/queries/src/java/org/apache/lucene/search/TermsFilter.java
contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThisQuery.java
contrib/queries/src/test/org/apache/lucene/search/BooleanFilterTest.java
contrib/regex/src/test/org/apache/lucene/search/regex/TestSpanRegexQuery.java
contrib/snowball/src/java/net/sf/snowball/Among.java
contrib/snowball/src/java/net/sf/snowball/SnowballProgram.java
contrib/snowball/src/java/net/sf/snowball/TestApp.java
contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestSpellChecker.java
contrib/surround/src/test/org/apache/lucene/queryParser/surround/query/BooleanQueryTst.java
contrib/surround/src/test/org/apache/lucene/queryParser/surround/query/ExceptionQueryTst.java
contrib/surround/src/test/org/apache/lucene/queryParser/surround/query/SingleFieldTestDb.java
contrib/surround/src/test/org/apache/lucene/queryParser/surround/query/Test01Exceptions.java
contrib/surround/src/test/org/apache/lucene/queryParser/surround/query/Test02Boolean.java
contrib/surround/src/test/org/apache/lucene/queryParser/surround/query/Test03Distance.java
contrib/wordnet/src/java/org/apache/lucene/wordnet/SynExpand.java
contrib/wordnet/src/java/org/apache/lucene/wordnet/SynLookup.java
contrib/wordnet/src/java/org/apache/lucene/wordnet/Syns2Index.java

-
To unsubscribe, e-m

Re: Lucene 2.2 soon?

2007-06-04 Thread Jukka Zitting

Hi,

On 6/4/07, Michael Busch <[EMAIL PROTECTED]> wrote:

> PS. When doing 2.2, it would be nice if you could upload the release
> artifacts also in the Maven repository. See the instructions in
> http://wiki.apache.org/jakarta-lucene/ReleaseTodo. Lucene 2.1 not
> being in the Maven repository is the main blocker for Jackrabbit not
> to upgrade right away.

We're already working on getting the upload into the Maven repository
done right this time.
(See https://issues.apache.org/jira/browse/LUCENE-622)


Nice, thanks a lot to everyone involved!

BR,

Jukka Zitting

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene 2.2 soon?

2007-06-04 Thread Jukka Zitting

Hi,

On 6/1/07, Michael Busch <[EMAIL PROTECTED]> wrote:

Considering all these improvements I think it's time for a new release,
especially since many of you voted in February to have releases more
frequently.


Big +1 from me! We're doing a big 1.4 release of Jackrabbit in a few
months and many of the improvements you listed would be very much
welcome.

PS. When doing 2.2, it would be nice if you could upload the release
artifacts also in the Maven repository. See the instructions in
http://wiki.apache.org/jakarta-lucene/ReleaseTodo. Lucene 2.1 not
being in the Maven repository is the main blocker for Jackrabbit not
to upgrade right away.

BR,

Jukka Zitting

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-523) FSDirectory.openFile(String) causes ClassCastException

2007-05-11 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495174
 ] 

Jukka Zitting commented on LUCENE-523:
--

We worked around the issue in Jackrabbit by using the new openInput method. I 
guess the underlying issue (FSDirectory.openFile throws an exception) is still 
there in Lucene, but I'm not sure if people are actually using that method.

> FSDirectory.openFile(String) causes ClassCastException
> --
>
> Key: LUCENE-523
> URL: https://issues.apache.org/jira/browse/LUCENE-523
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Store
>Affects Versions: 1.9, 2.0.0
> Environment: Lucene 1.9.1
>Reporter: Eric Isakson
>
> When you call FSDirectory.openFile(String) you get a ClassCastException since 
> FSIndexInput is not an org.apache.lucene.store.InputStream
> The workaround is to reimplement using openInput(String). I personally don't 
> need this to be fixed but wanted to document it here in case anyone else runs 
> into this for any reason.
> The reason I'm calling this is that I have a requirement on my project to 
> create read only indexes and name the index segments consistently from one 
> build to the next. So, after creating and optimizing the index, I rename the 
> files and rewrite the segments file. It would be nice if I had an API that 
> would allow me to say "I only want one segment and I want its name to be 
> 'foo'". For instance IndexWriter.optimize(String segmentName)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[PROPOSAL] Tika, a content analysis toolkit

2007-03-07 Thread Jukka Zitting

Hi,

[Cross-posting to announce the Tika proposal, please use
general@incubator.apache.org for followup discussion.]

This is a proposal to start a content analysis toolkit project in the
Apache Incubator. The live version of the proposal is available at
http://wiki.apache.org/incubator/TikaProposal.

Comments and questions are welcome. There is also a vacant place for a
third mentor. Once people are satisfied with the proposal I will first
call a vote on the Lucene PMC to sponsor the proposal and then a vote
on the Incubator PMC to accept the project for incubation.

PS. Based on quick Google and USPTO searches there doesn't seem to be
anything that would cause trouble with the "Tika" name.

BR,

Jukka Zitting


Tika, a content analysis toolkit


Abstract


Tika is a toolkit for detecting and extracting metadata and structured
text content from various documents using existing parser libraries.

Proposal


The Tika content analysis toolkit will include features for detecting
the content types, character encodings, languages, and other characteristics
of existing documents and for extracting structured text content from
the documents.

The toolkit is targeted especially for search engines and other content
indexing and analysis tools, but will be useful also for other applications
that need to extract meaningful information from documents that might
be presented as nothing else than binary streams.

Instead of implementing its own document parsers, Tika will use existing
parser libraries like Jakarta POI [1] and PDFBox [2].

Background
--

The initial idea for the Tika project was voiced in April 2006 by
Jérôme Charron and Chris A. Mattman on the Nutch mailing list. The Nutch
parser framework and other content analysis features were seen as
value-added components that would benefit also other projects. The idea
received positive feedback, but lacked the momentum.

The idea was revisited in August 2006 when Jukka Zitting from the
Jackrabbit project contacted Nutch for possible cooperation with similar
ideas. The original Tika idea gained extra momentum and a Google Code
project was set up as a staging area for prototype code before deciding
how to best handle the setup of a new project. After a few initial
commits the activity again declined.

In January 2007 the idea started gaining more momentum when Rida Benjelloun
offered to contribute the Lius project [3] to Apache Lucene and when Mark
Harwood also started looking for a generic toolkit like Tika.

This proposal is the result of the above efforts and related discussions
both in private and on various public forums. Some alternatives to
incubation, like Apache Labs [4] or Jakarta Commons [5], came up during
the discussions but we believe that taking the project to the Incubator
is the best way to start growing a viable community to sustain the Tika
toolkit.

Rationale
-

There is ever more demand for tools that automatically analyze and index
documents in various formats. Search engines, content repositories, and
other tools often need to extract metadata and text content from documents
given as nothing or little else than a simple octet stream. While there
are a number of existing parser libraries for various document types,
each of them comes with a custom API and there are no generic tools for
automatically determining which parser to use for which documents.
Currently many projects end up creating their custom content analysis
and extraction tools.

The Tika project attempts to remove this duplication of efforts. We
believe that by pooling the efforts of multiple projects we will be able
to create a generic toolkit that exceeds the capabilities and quality of
the custom solutions of any single project. A generic toolkit project
will also provide common ground for the developers of parser libraries
and content applications to interact.

Initial Goals
-

The initial goals of the proposed project are:

   * Viable community around the Tika codebase

   * Active relationships and possible cooperation with related
 projects and communities

   * Generic parser API for extracting structured text content from
 various document formats

   * Flexible metadata detection and extraction API

   * Java implementations of the metadata standards mentioned below


Current Status
==

Meritocracy
---

All the initial committers are familiar with the meritocracy principles
of Apache, and have already worked on the various source codebases. We will
follow the normal meritocracy rules also with other potential contributors.

Community
-

There is not yet a clear Tika community. Instead we have a number of people
and related projects with an understanding that a shared toolkit project
would best serve everyone's interests. The primary goal of the incubating
project is to build a self-sustaining communit

Re: [jira] Lius into apache incubator

2007-03-01 Thread Jukka Zitting

Hi,

On 3/1/07, Doug Cutting <[EMAIL PROTECTED]> wrote:

Jukka Zitting wrote:
> PS. Will people mind if we use this list for fleshing out the details?
> I've created a Google Group for Tika where we could also take the
> discussion if that's preferred.

I think the Incubator Wiki would be the best place for this.

http://wiki.apache.org/incubator/?action=fullsearch&value=proposal&titlesearch=Titles

Interested folks could subscribe to the proposal page.  You could
announce the proposal page on several lists.  Will that work for you?


Sounds good. I uploaded the early draft to
http://wiki.apache.org/incubator/TikaProposal, I'll announce it in a
moment.


Also, I can probably help as a mentor if needed.


Cool, thanks!

BR,

Jukka Zitting

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Lius into apache incubator

2007-03-01 Thread Jukka Zitting

Hi,

On 3/1/07, Rida Benjelloun <[EMAIL PROTECTED]> wrote:

On 3/1/07, Jukka Zitting <[EMAIL PROTECTED]> wrote:
> Would there be interest within the Lucene PMC in sponsoring a proposal
> along such lines? I can volunteer to put together the proposal and act
> as the champion and mentor of the project.

-- >> We can put together the proposal and you can be the mentor of the
project.


See below for a quick first draft (filled with TODOs).

PS. Will people mind if we use this list for fleshing out the details?
I've created a Google Group for Tika where we could also take the
discussion if that's preferred.

BR,

Jukka Zitting


Tika Proposal
=

This is an early draft of a possible proposal for a Tika project
within the Apache Incubator. See
http://incubator.apache.org/guides/proposal.html for a description of
the propsal template.

Abstract


Tika is a toolkit for detecting and extracting metadata and text
content from various documents using existing parser libraries.

Proposal


The Tika content analysis toolkit will include features for detecting
the content types, character encodings, languages, and other
characteristics of existing documents and for extracting structured
text content from the documents.

The toolkit is targeted especially for search engines and other
content indexing and analysis tools, but will be useful also for other
applications that need to extract meaningful information from
documents that might be presented as nothing else than binary streams.

Instead of implementing it's own document parsers, Tika will use
existing parser libraries like Jakarta POI and PDFBox.

Background
--

The need for tools that automatically analyze and index content is
increasing as ever more information becomes available.

TODO: Discuss the various related projects and the lack of a common
analysis toolkit. Note how many of the existing tools have grown as
ad-hoc solutions to specific needs, and are often tightly bound to a
specific application or a parser library.

Rationale
-

TODO

Initial Goals
-

TODO

Current Status
--

TODO

Meritocracy
---

TODO

Community
-

TODO

Core Developers
---

TODO

Alignment
-

TODO

Known Risks
---

TODO: There has been on-and-off interest in something like this for
quite a while already. How can we make sure that the current increase
in interest doesn't fade away?

Orphaned products
-

TODO: See the comment above

Inexperience with Open Source
-

TODO: Many of the interested participants have open source background.

Homogenous Developers
-

TODO: There is no central company behind the proposal.

Reliance on Salaried Developers
---

TODO: Some of us are salaried for this, other's are not.

Relationships with Other Apache Products


TODO: Lucene, Nutch, Jackrabbit, Droids, ...

A Excessive Fascination with the Apache Brand
-

TODO

Documentation
-

TODO

Initial Source
--

TODO: Tika, Lius, Nutch?, ...

Source and Intellectual Property Submission Plan


TODO

External Dependencies
-

TODO: Some of the potential parser libraries will be GPL-licensed or
otherwise troublesome for an ASF project. How to best handle such
cases?

Cryptography


TODO: Some of the document formats are involve encryption and features
like DRM. While Tika itself will probably not include any
cryptographic code, the parser dependencies will most likely include
such code.

Required Resources
--

Mailing lists

 * [EMAIL PROTECTED]

Subversion Directory

 * https://svn.apache.org/repos/asf/incubator/tika

Issue Tracking

 * JIRA TIKA

Other Resources

 * none

Initial Committers
--

TODO

Affiliations


TODO

Sponsors


Champion

TODO (I can volunteer)

Nominated Mentors

TODO (Three mentors is the recommendation, I can volunteer as one)

Sponsoring Entity

TODO (Apache Lucene?)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Lius into apache incubator

2007-03-01 Thread Jukka Zitting

Hi,

On 3/1/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

Is the Droids lab at all related to that parsing project in Nutch?


Partly, yes. I've been looking at Droids and so far I think it's main
focus has been on the crawling part rather than on the analysis of
retrieved content. A generic content analysis toolkit would likely be
a great companion for Droids. In fact I was earlier contemplating
about starting a related effort in Apache Labs (see
http://issues.apache.org/jira/browse/JCR-728), but there seems to be
enough demand for such functionality that a more full-fledged project
might be better.


There seems to be several efforts that are related here that could
probably make for a nice new project under Lucene, IMO.  They all
seem to have to do with getting and preparing text for processing by
some type of consumer of text.


Exactly. It would be great to see some consolidation of efforts.

BR,

Jukka Zitting

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Lius into apache incubator

2007-03-01 Thread Jukka Zitting

Hi,

On 3/1/07, Rida Benjelloun <[EMAIL PROTECTED]> wrote:

Lius could be used as a starting point of Tika project, if Tika committers
are interested on it. We can also as mark said decouple Lius's parser logic
from it's indexing logic.


I'm very interested in doing that. Another very useful codebase, among
others, would be the existing parser framework in the Nutch project.


Taking the project into Apache incubator could be also interesting, to get
more people involved on it.


Exactly. I'd like to avoid starting just yet another codebase, and
focus more on bringing the best parts (both code and ideas) of the
existing projects together. The community-building focus of the
Incubator would likely help with that. Another aspect that would
benefit from the Incubator scrutiny are the legal implications of
pulling together multiple document parser libraries under various
different licenses.

Would there be interest within the Lucene PMC in sponsoring a proposal
along such lines? I can volunteer to put together the proposal and act
as the champion and mentor of the project.

BR,

Jukka Zitting

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Lius into apache incubator

2007-03-01 Thread Jukka Zitting

Hi,

I am interested in a Lius/Tika project that could be used not only with
Lucene. As mentioned by Mark, there are a number of related efforts which
leads me to believe a application-independent content analysis/parsing tool
would be very helpful for many users.

I'd like to propose taking the project to the Apache Incubator to better
attract interest also from outside Lucene.

BR,

Jukka Zitting

-- 
View this message in context: 
http://www.nabble.com/Lius-into-apache-incubator-tf3145937.html#a9247508
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-734) Upload Lucene 2.0 artifacts in the Maven 1 repository

2006-12-17 Thread Jukka Zitting (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-734?page=comments#action_12459126 ] 

Jukka Zitting commented on LUCENE-734:
--

Digging deeper I found that the artifacts are actually located in the Maven 1 
repository thanks to some URL rewrite magic, i.e. 
http://repo1.maven.org/maven/org.apache.lucene/jars/lucene-core-2.0.0.jar 
exists even though http://repo1.maven.org/maven/org.apache.lucene/jars/ returns 
a 404 error. So from my perspective it's OK to resolve this issue as Invalid.

> FYI: anyone can edit the wiki if you create an account and login.

Yes, thanks. I probably had the page locally cached since I still got the 
"immutable" message on the page after creating an account and logging in. Now 
it shows up as editable, I'll update the instructions.

> Upload Lucene 2.0 artifacts in the Maven 1 repository
> -
>
> Key: LUCENE-734
> URL: http://issues.apache.org/jira/browse/LUCENE-734
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Other
>Reporter: Jukka Zitting
>Priority: Minor
>
> The Lucene 2.0 artifacts can be found in the Maven 2 repository, but not in 
> the Maven 1 repository. There are still projects using Maven 1 who might be 
> interested in upgrading to Lucene 2, so having the artifacts also in the 
> Maven 1 repository would be very helpful.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-734) Upload Lucene 2.0 artifacts in the Maven 1 repository

2006-11-30 Thread Jukka Zitting (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-734?page=comments#action_12454774 ] 

Jukka Zitting commented on LUCENE-734:
--

The ReleaseTodo page is immutable so I can't modify it directly.

At least the Maven sync directory information is outdated, the new official 
path (although I think the previous one is still symlinked) is 
/www/people.apache.org/repo/m2-ibiblio-rsync-repository.

You are right in that the artifacts in the Maven 2 repository above should 
(AFAIK) get automatically copied also to the Maven 1 repository. At least it 
works the other way. I'll check that and report back.

> Upload Lucene 2.0 artifacts in the Maven 1 repository
> -
>
> Key: LUCENE-734
> URL: http://issues.apache.org/jira/browse/LUCENE-734
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Other
>Reporter: Jukka Zitting
>Priority: Minor
>
> The Lucene 2.0 artifacts can be found in the Maven 2 repository, but not in 
> the Maven 1 repository. There are still projects using Maven 1 who might be 
> interested in upgrading to Lucene 2, so having the artifacts also in the 
> Maven 1 repository would be very helpful.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-734) Upload Lucene 2.0 artifacts in the Maven 1 repository

2006-11-30 Thread Jukka Zitting (JIRA)
Upload Lucene 2.0 artifacts in the Maven 1 repository
-

 Key: LUCENE-734
 URL: http://issues.apache.org/jira/browse/LUCENE-734
 Project: Lucene - Java
  Issue Type: Task
  Components: Other
Reporter: Jukka Zitting
Priority: Minor


The Lucene 2.0 artifacts can be found in the Maven 2 repository, but not in the 
Maven 1 repository. There are still projects using Maven 1 who might be 
interested in upgrading to Lucene 2, so having the artifacts also in the Maven 
1 repository would be very helpful.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-619) Lucene 1.9.1 and 2.0.0 Maven 2 packages are incorrectly deployed

2006-09-03 Thread Jukka Zitting (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-619?page=comments#action_12432390 ] 

Jukka Zitting commented on LUCENE-619:
--

The jars seem to be in place now.

> Lucene 1.9.1 and 2.0.0 Maven 2 packages are incorrectly deployed
> 
>
> Key: LUCENE-619
> URL: http://issues.apache.org/jira/browse/LUCENE-619
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 1.9, 2.0.0
> Environment: 
> http://www.ibiblio.org/maven2/org/apache/lucene/lucene-core/
>Reporter: Jordan Christensen
>
> The lucene JARs at the URL listed in the Environment field only contain the 
> maven 2 POMs, and not the actual compiled classes. The correct JARs need to 
> be uploaded so that Lucene 1.9.1. and 2.0 can work in Maven 2.
> This was listed as fixed in http://issues.apache.org/jira/browse/LUCENE-551, 
> but was not properly done. The JARs in the Apache Maven repo are incorrect as 
> well. 
> (http://www.apache.org/dist/maven-repository/org/apache/lucene/lucene-core/)
> This issue was raised and confirmed on the mailing list as well: 
> http://www.gossamer-threads.com/lists/lucene/java-user/37169

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-658) upload major releases to ibiblio

2006-09-03 Thread Jukka Zitting (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-658?page=comments#action_12432389 ] 

Jukka Zitting commented on LUCENE-658:
--

This seems to be a duplicate of LUCENE-551. The releases are available at:

http://www.ibiblio.org/maven2/org/apache/lucene/lucene-core/


> upload major releases to ibiblio
> 
>
> Key: LUCENE-658
> URL: http://issues.apache.org/jira/browse/LUCENE-658
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Other
>Affects Versions: 1.9, 2.0.0
>Reporter: Ryan Sonnek
>
> i'm a current user of maven and the latest 1.9 and 2.0 releases are not 
> available on ibiblio.
> http://www.ibiblio.org/maven2/lucene/lucene/
> Could someone upload the latest versions so that use maven-heads can access 
> the new features?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-523) FSDirectory.openFile(String) causes ClassCastException

2006-03-19 Thread Jukka Zitting (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-523?page=comments#action_12371018 ] 

Jukka Zitting commented on LUCENE-523:
--

The related Jackrabbit issue is http://issues.apache.org/jira/browse/JCR-352

> FSDirectory.openFile(String) causes ClassCastException
> --
>
>  Key: LUCENE-523
>  URL: http://issues.apache.org/jira/browse/LUCENE-523
>  Project: Lucene - Java
> Type: Bug
>   Components: Store
> Versions: 1.9, 2.0
>  Environment: Lucene 1.9.1
> Reporter: Eric Isakson

>
> When you call FSDirectory.openFile(String) you get a ClassCastException since 
> FSIndexInput is not an org.apache.lucene.store.InputStream
> The workaround is to reimplement using openInput(String). I personally don't 
> need this to be fixed but wanted to document it here in case anyone else runs 
> into this for any reason.
> The reason I'm calling this is that I have a requirement on my project to 
> create read only indexes and name the index segments consistently from one 
> build to the next. So, after creating and optimizing the index, I rename the 
> files and rewrite the segments file. It would be nice if I had an API that 
> would allow me to say "I only want one segment and I want its name to be 
> 'foo'". For instance IndexWriter.optimize(String segmentName)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]