Re: Resolving term vector even when not stored?

2007-03-16 Thread Doron Cohen
"Mike Klaas" <[EMAIL PROTECTED]> wrote on 16/03/2007 14:26:46:

> On 3/15/07, karl wettin <[EMAIL PROTECTED]> wrote:
> > I propose a change of the current IndexReader.getTermFreqVector/s-
> > code so that it /always/ return the vector space model of a document,
> > even when set fields are set as Field.TermVector.NO.
> >
> > Is that crazy? Could be really slow, but except for that.. And if it
> > is cached then that information is known by inspecting the fields.
> > People don't go fetching term vectors without knowing what thay are
> > doing, are they?
>
> The highlighting contrib code does this: attempt to retrieve the
> termvector, catch InvalidArgumentException, fall back to re-analysis
> of the data.

This way makes more sense to me.  IndexReader.getTermFreqVector() means its
there, just bring it, while the fall-back is more a
computeTermFreqVector(), which takes much more time.  Users would likely
prefer getting an exception for the get() (oops, term vectors were not
saved..) rather then auto falling back to an expensive computation.

This functionality seems proper as a utility, so it can be reused, I think
perhaps in contrib?

>
> I'm not sure if that is crazy, but that is what is currently implemented.
>
> -Mike


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Indexing time taken is too long - Help Appreciated.

2007-03-16 Thread Lokeya

Hi,

I am trying to index the content from XML files which are basically the
metadata collected from a website which have a huge collection of documents.
This metadata xml has control characters which causes errors while trying to
parse using the DOM parser. I tried to use encoding = UTF-8 but looks like
it doesn't cover all the unicode characters and I get error. Also when I
tried to use UTF-16, I am getting Prolog content not allowed here. So my
guess is there is no enoding which is going to cover almost all unicode
characters. So I tried to split my metadata files into small files and
processing records which doesnt throw parsing error.

But by breaking metadata file into smaller files I get, 10,000 xml files per
metadata file. I have 70 metadata files, so altogether it becomes 7,00,000
files. Processing them individually takes really long time using Lucene, my
guess is I/O is time consuing, like opening every small xml file loading in
DOM extracting required data and processing.

Qn  1: Any suggestion to get this indexing time reduced? It would be really
great.

Qn 2 : Am I overlooking something in Lucene with respect to indexing?

Right now 12 metadata files take 10 hrs nearly which is really a long time.

Help Appreciated.

Much Thanks.
-- 
View this message in context: 
http://www.nabble.com/Indexing-time-taken-is-too-long---Help-Appreciated.-tf3418090.html#a9526539
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: TestBackwardsCompatibility test relies on being in a certain directory

2007-03-16 Thread Daniel John Debrunner

Daniel John Debrunner wrote:
I'm building lucene in an continuous integration model using 
CruiseControl. Every build fails due to TestBackwardsCompatibility
failing. This is because it expects to run in the root directory of a 
lucene source tree, e.g. see line 96 in the test. The current directory 
for CruiseControl is several levels higher than the lucene source tree 
since that how it works.


Never mind, I found that CruiseControl can set the working directory of 
where it executes build tasks, so no need for any change in lucene.


Dan.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



TestBackwardsCompatibility test relies on being in a certain directory

2007-03-16 Thread Daniel John Debrunner
I'm building lucene in an continuous integration model using 
CruiseControl. Every build fails due to TestBackwardsCompatibility
failing. This is because it expects to run in the root directory of a 
lucene source tree, e.g. see line 96 in the test. The current directory 
for CruiseControl is several levels higher than the lucene source tree 
since that how it works.


Anyone else seeing this, any ideas for making the test more independent 
of its environment?


Thanks,
Dan.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Monster lucene geosearch

2007-03-16 Thread Peter Keegan

I will move/respond to this to an existing java-user thread.
Peter

On 3/16/07, Eric Cone <[EMAIL PROTECTED]> wrote:


Hello Peter,

Now that the monster lucene search is live, is performance pretty good?
Are
you still running it on a single 8 core server? Can you give me a rough
idea
on the number of queries you can handle/second and the number of docs in
the
index? Are you using dotLucene or a webservice tier and java?

How did you implement your bounding box for the searching? It sounds like
you do this outside of lucene and return a custom hitcollector. Why not
use
a rangequery or functionquery for the basic bounding before sorting?

Thanks,
Eric



Re: Resolving term vector even when not stored?

2007-03-16 Thread Mike Klaas

On 3/15/07, karl wettin <[EMAIL PROTECTED]> wrote:

I propose a change of the current IndexReader.getTermFreqVector/s-
code so that it /always/ return the vector space model of a document,
even when set fields are set as Field.TermVector.NO.

Is that crazy? Could be really slow, but except for that.. And if it
is cached then that information is known by inspecting the fields.
People don't go fetching term vectors without knowing what thay are
doing, are they?


The highlighting contrib code does this: attempt to retrieve the
termvector, catch InvalidArgumentException, fall back to re-analysis
of the data.

I'm not sure if that is crazy, but that is what is currently implemented.

-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Monster lucene geosearch

2007-03-16 Thread Eric Cone

Hello Peter,

Now that the monster lucene search is live, is performance pretty good? Are
you still running it on a single 8 core server? Can you give me a rough idea
on the number of queries you can handle/second and the number of docs in the
index? Are you using dotLucene or a webservice tier and java?

How did you implement your bounding box for the searching? It sounds like
you do this outside of lucene and return a custom hitcollector. Why not use
a rangequery or functionquery for the basic bounding before sorting?

Thanks,
Eric


[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-03-16 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12481734
 ] 

Paul Elschot commented on LUCENE-584:
-

Hoss,

>Paul: I notice Filter.getMatcher returns null, and IndexSearcher tests for 
>that and uses
> it to decide whether or not to iterator over the (non null) Matcher, or over 
> the BitSet
> from Filter.bits. is there any reason that logic can't be put in getMatcher, 
> so that if
> subclasses of Filter don't override the getMatcher method it will call bits 
> and then
> return a Matcher that iterates over the set Bits?

Two reasons:
- uncertainty over performance of a Matcher instead of a BitSet,
- this way backward compatibility very easily guaranteed.

There is also LUCENE-730, which may interfere with the removal of BitSet,
since it allows documents to be scored out of order. However, LUCENE-730
should only be used at the top level of a query search and without a Filter.
I cannot think of an actual case in which there might be interference, but
I may not have not looked into that deep enough.

> we could even change Filter.bits so it's no longer abstract ... it could have
> an implementation that would call getMatcher, and iterate over all of the 
> matched
> docs setting bits on a BitSet that is then returned ... the class would still 
> be
> abstract, and the class javadocs would make it clear that subclasses must 
> override
> at least one of the methods...

I must say that creating a BitSet from a Matcher never occurred to me.
Anyway, when Filter.bits() is deprecated I have no preference about how
it is actually removed.


> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: BitsMatcher.java, Filter-20060628.patch, 
> HitCollector-20060628.patch, IndexSearcher-20060628.patch, 
> MatchCollector.java, Matcher.java, Matcher20070226.patch, 
> Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, 
> Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: New Jira Hudson plugin

2007-03-16 Thread Doron Cohen
Nigel Daley <[EMAIL PROTECTED]> wrote on 14/03/2007 11:22:44:

> I've updated Hudson with a new Jira plugin provided by Kohsuke:
>
>http://lucene.zones.apache.org:8080/
>
> Jira issue numbers should now be hyper-linked to Jira.  Also, in the
> Hadoop-Nightly build I'm experimenting with a feature of the plugin
> that will update the Jira with a link back to the Hudson build in
> which it was integrated.

I personally like this feature - allows to easily figure out the answer
for:
- did it (issue patch) work well after "integrated"?

However it is not clear to me what "integrated" means though - is it the
first time this issue number appeared in CHANGES.txt? Or the first time
this issue is resolved? Sometimes there is more than a single commit for a
single issue (not best practice I guess, but sometimes an issue is
reopened) - well there be two integration points now?

Also, how long are these old builds saved in the Hudson server? If after
some limit some old builds are removed, the links added to Jira would
become broken?

> For example https://issues.apache.org/jira/
> browse/HADOOP-1115 now has such a link.  If other projects (Lucene,
> Nutch, Solr) want this feature turned on then please let me know.

I think the Hudson link added to HADOOP-1115 is broken, because of a
trailing ')'.

>
> More on the new plugin here:
> http://weblogs.java.net/blog/kohsuke/archive/2007/03/
> hudsonjira_inte.html
>
> Comments welcome.
>
> Cheers,
> Nige


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using FindBugs, JLint, or PMD?

2007-03-16 Thread Chris Hostetter

: I'm a software researcher at MIT. We are developing an algorithm to
: reprioritize warnings from FindBugs, JLint, and PMD using the software
: change history. I was wondering if you (or your project) use any of
: bug finding tools including FindBugs, JLint, and PMD in the Lucene
: development cycle.

i played arround with integrating PMD into the Solr build
process (with an eye towards doing something similar with java-lucene if
itworks out), but i'm not actively working on it right now...

http://issues.apache.org/jira/browse/SOLR-143




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using FindBugs, JLint, or PMD?

2007-03-16 Thread Erik Hatcher
However, Fortify runs automated analysis of Lucene and many other  
codebases:




nabble/google up more details from Brian Chess on this forum  
regarding the details if you're curious.


Erik


On Mar 16, 2007, at 11:09 AM, Otis Gospodnetic wrote:


I don't think we use any of those tools.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Sung Kim <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Thursday, March 15, 2007 10:54:52 PM
Subject: Using FindBugs, JLint, or PMD?

Dear developers,

I'm a software researcher at MIT. We are developing an algorithm to
reprioritize warnings from FindBugs, JLint, and PMD using the software
change history. I was wondering if you (or your project) use any of
bug finding tools including FindBugs, JLint, and PMD in the Lucene
development cycle.

Thanks in advance.
Sung Kim <[EMAIL PROTECTED]>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-828) Term's equals() throws ClassCastException if passed something other than a Term

2007-03-16 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-828.
-

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Patch applied, thanks Paul!


> Term's equals() throws ClassCastException if passed something other than a 
> Term
> ---
>
> Key: LUCENE-828
> URL: https://issues.apache.org/jira/browse/LUCENE-828
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.1
>Reporter: Paul Cowan
>Priority: Trivial
> Attachments: termequals.patch
>
>
> Term.equals(Object) does a cast to Term without checking if the other object 
> is a Term.
> It's unlikely that this would ever crop up but it violates the implied 
> contract of Object.equals().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using FindBugs, JLint, or PMD?

2007-03-16 Thread Otis Gospodnetic
I don't think we use any of those tools.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Sung Kim <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Thursday, March 15, 2007 10:54:52 PM
Subject: Using FindBugs, JLint, or PMD?

Dear developers,

I'm a software researcher at MIT. We are developing an algorithm to
reprioritize warnings from FindBugs, JLint, and PMD using the software
change history. I was wondering if you (or your project) use any of
bug finding tools including FindBugs, JLint, and PMD in the Lucene
development cycle.

Thanks in advance.
Sung Kim <[EMAIL PROTECTED]>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Payloads Todo

2007-03-16 Thread Grant Ingersoll
I started http://wiki.apache.org/lucene-java/Payload_Planning via  
http://wiki.apache.org/lucene-java/LucenePlanning to help us plan out  
what is needed for payloads support.  This is just a draft, please  
feel free to edit/chop, etc. in the name of improvement.


-Grant


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]