: Subject: [DISCUSS] Do away with Contrib Committers and make core committers
+1
-Hoss
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
: prime-time as the new solr trunk! Lucene and Solr need to move to a
: common trunk for a host of reasons, including single patches that can
: cover both, shared tags and branches, and shared test code w/o a test
: jar.
Without a clearer picture of how people envision development "overhead"
wor
: with, "if id didn't happen on the lists, it didn't happen". Its the same as
+1
But as the IRC channel gets used more and more, it would *also* be nice if
there was an archive of the IRC channel so that there is a place to go
look to understand the back story behind an idea once it's synthesi
: build and nicely gets all dependencies to Lucene and Tika whenever I build
: or release, no problem there and certainly no need to have it merged into
: Lucene's svn!
The key distinction is that Solr is allready in "Lucene's svn" -- The
question is how reorg things in a way that makes it easier
: In addition to what Shai mentioned, I wanted to say that there are
: other oddities about how the contrib tests run in ant. For example,
: I'm not sure why we create the junitfailed.flag files (I think it has
: something to do with detecting top-level that a single contrib
: failed).
Correct ..
: I was wondering yesterday why aren't the required libs checked in to SVN? We
Licensing issues.
we can't redistribute them (but we can provide the build.xml code to fetch
them)
-Hoss
-
To unsubscribe, e-mail: java-dev-unsu
: No, no, no, Lucene still has no need for maven or ivy for dependency
management.
: We can just hack around all issues with ant scripts.
it doesn't really matter if it's ant scripts, or ivy declarations, or
maven pom entries -- the point is the same.
We can't distribute the jars, but we can d
: > Is it possible to change it? If not, what is the policy here? To open a
: > new issue and close the old one?
...
: In this case, that would mean either closing this issue and opening a new one,
: or taking the discussion to the mailing list where subject headers may be
: modified as th
: I disabled the account by assigning a dummy eMail and gave it a random
password.
:
: I was not able to unassign the issues, as most issues were "Closed",
: where no modifications can be done anymore. Reopening and changing
Uwe: it may be too late (depending on wether you remember the dummy
Yonik and I have been looking at the memory requirements of an application
we've got. We use a lot of indexed fields, primarily so I can do a lot
of numeric tests (using RangeFilter). When I say "a lot" I mean
arround 8,000 -- many of which are not used by all documents in the index.
Now there
: 2) Can you think of a clean way for individual applications to eliminate
: norms (via subclassing the lucene code base - ie: no patching)
For completeness, I should mention that one thing I briefly considered was
writing a new Directory implimentation that would proxy to FSDirectory,
but
Paul, thanx for your suggestions. It seems like they mostly address the
issue of improving search time, by eliminting the need to read the norm
files from disk -- but the spead of the query isn't as big of a concern
for us as the memory footprint.
As I understand it, the point when we are reall
: I'm looking to store some additional information in a Lucene index
: and I'm looking for an advise on how to implement the functionality.
: Specifically, I'm planning to store 1) collection frequency count for
: each term, 2) actual document length for each document (yes, I looked
: at the norm f
: > Doesn't this cause a problem for highly interactive and large indexes? Since
: > every update to the index requires the rewriting of the norms, and
: > constructing a new array.
:
: The original complaint was primarily about search-time memory size, not
: update speed. I like the proposed pat
: A more general solution would be to use a subclass of BooleanQuery that
: provides a Weight that flattens all the weights of the subqueries, for example
: to the maximum weight, and for the rest works like the usual Weight of
: BooleanQuery.
I'm not grasping all of the ideas in this thread comp
I've never really looked at the IntegerRangeQuery submission, but if you
think you've found a bug, you should attach your test to the JIRA issue
that the orriginal patch bug has been migrated to, so that it's clear to
anyone looking at applying it that it may have problems...
http://issues.apache
: Last week I proposed to the Lucene PMC that we make Yonik Seeley a
: committer on Lucene Java. I am pleased to announce that other PMC
: members agreed. Welcome, Yonik!
1) Wah-Hoo! Yonik is definitely one of the smartest guys I've worked with
in the past few years.
2) On the subject of comm
: Formally, the process is that someone nominates, and the PMC votes.
: When Lucene was part of Jakarta we used to just have the committers
: vote, since we had little contact with the Jakarta PMC. But now that
: Lucene has its own PMC we can do it the official Apache way.
Ahh, I see ... I didn'
: I am trying to perform a sort by title field search, but am receiving
: the following error. The search seems to have problems with field
: values that have multiple words. It sorts single word values with no
: problem. Any help will be appreciated. I indexed the title field as
: Field.text(
: probably you'll need http client module (commons-httpclient or something)
More specifically: when dealing with lucene, the concept of a "document"
is very specific: it is an instance of
org.apache.lucene.document.Document. how you construct one of these
Document objects in your application is
: Perhaps something like @since is what we should be using
: on that file formats page.
It's a little late now for older versions, but it might make sense to move
that documentation directly into the code base, where it can be locally
linked to from the javadocs, and included directly into jars
:
: Taking this to java-dev: Since this is such a common issue, would it
: be feasible for Lucene to have some sort of capability to be told
: what field is the unique one and automatically update (delete, and
: add) a document added with a duplicate of a unique field? This
: would probably requi
: And what's the command line to do the svn checkout? It's not apparent
: from the Lucene web site. I have the svn client installed.
the info is in the wiki, i've linked to it from the FAQ...
http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-abe69adac45ac2e9b5c04db87666a6757631
-Hoss
The first thing that occurs to me, is that if the fields you are talking
about "sharing" are allways indexed, then you can leave them UnStored, and
use a FieldCache.StringIndex to get the values.
-Hoss
-
To unsubscribe, e-mai
: > Should we dynamically decide to switch to FieldNormQuery when
: > BooleanQuery.maxClauseCount is exceeded? That way queries that
: Why not leave that decision to the program using the query?
: Something like this:
: - catch the TooManyClauses exception,
: - adapt (the offending parts of) th
: It's fixed now.
: Sorry bout that... I've already set up a test script to switch my JDK
: to 14 before running "ant test".
I don't remember the specifics, but isn't there an attribute for the and
taks that you can use to tell it wether you want it to compile as
1.4 code or 1.5 code? ... I thou
Replying here to a thread from java-user.
Is there any reason not to "cvs remove" all of the files from the old CVS
repository for lucene, and check in a single README file explaining that
the CVS repository is no longer used, and where they can find the SVN
repository?
: Date: Fri, 18 Nov 2005
: I think you are thinking of target="1.4" type of thing. I always
: thought this was about binary compatibility of complied code, not the
: language syntax, but I'm not sure. Erik will know.
Actually, I'd forgotten about "target" ... I checked and the option i was
thinking of is "source"...
I'm extremely stoked to see this topic come up, but very sad that I didn't
have time to read any Lucene mail this past weekend. I'll have to
catchup.
First off...
: Again, we're talking machine-to-machine communication here, not human-
: machine.
: While there have been several different topic
: Though, I'd be careful with proposing a variety of equivalent
: syntaxes as it may easily lead to more confusion than good. Let's
: start with one canonical syntax. If desired, other (more pleasant)
: syntaxes may then be converted to that as part of a preprocessing step.
Experience has taught
: I have seen this issue come up several times (perhaps the following is
: an oversimplification):
: Someone will suggest a performance enhancement and perhaps supply the
: code. Then there will be a general discussion about the merits of the
: change and the validity of the results, with question
:
: Query.extractTerms throws an exception if called with a non-rewritten
: query. Is it enough to document that (I could do that) or is that
: something that should be fixed (if possible)?
That seems like something that should be a checked Exception (not a
RuntimeException)
Alternately, extractT
Anyone know what happened? these URLs are 403ing...
http://lucene.apache.org/java/docs/api/
http://lucene.apache.org/java/docs/api/index.html
-Hoss
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail
I finally got a chance to look at this code today (the best part about the
last day before vacation, is no one expects you to get anything done, so
you can ignore your "real work" and spend time on things that are more
important in the long run) and while I still havne't wrapped my head
arround al
: > I think that the ideal API wouldn't require people
: > writing ObjectBuilders
: > to know anything about sax, or to ever need to
: > import anything from
: > org.xml.** or javax.xml.**
:
: Fair enough. I presume we want to maintain the
: position that Lucene should not have any dependencies
: o
: I'm personally happier to stick with one approach,
: preferably with an existing, standardized interface
: which lets me switch implementations. I didn't really
: want to have to design a general API for parsing XML
: as part of this project.
I'm not suggesting that, I'm just saying that the AP
: I'd still like to keep the parser core reasonably generic (ie
: java.lang.Object rather than Query or Filter) because I can see it being
: used for instantiating many different types of objects eg requests for
: GroupBy , highlighting, indexing, testing etc.
: As for your type-safety requiremen
: This example code looks interesting. If I understand
: correctly using this approach requires that builders
: like the "q" QueryObjectBuilder instance must be
: explicitly registered with each and every builder that
: consumes its type of output eg BQOB and FQOB. An
correct.
: provider for the
: Is there some reason not to store all field attributes in one place (*.fnm) ?
: Some of them are stored as a one byte-bit mask
: in the field infos file (*.fnm),
:
: isIndexed (IS_INDEXED)
: storeTermVector (STORE_TERMVECTOR)
: storePositionsWithTermVector (STORE_POSITIONS_WITH_TERMVECTOR)
:
I thought the purpose of this method was for applications to specify the
largest possible BooleanQuery that could be created in their application
(either progromaticaly, via QueryParser, or as a result of rewriting a
non-primative).
Changing this to be non-static would (besides breaking existing
: IMO, there's no reason to allow field definitions to be spec'd more
: often than once per IndexWriter. Need to add a new field for docs
: 501-1000 of a 1000-doc indexing pass? No problem: create a new
: IndexWriter, define new fields, and you're off and running.
If I understand your argument,
: Option 1: Merge field definitions at the segment level rather than
: the Document level. The defs stay stored with individual segments,
: but everything gets moved into the .fnm file, including
: IS_COMPRESSED, IS_BINARY, etc (as I believe Robert was proposing).
:
: Option 2: Centralize the field
: 1.) We now have DateField and DateTools which use different formats. So
: QueryParser needs to know which one has been used during indexing. I've a
: local patch that adds an appropriate set... method.
A much as i dislike the "standard" mechanism for indexing Dates, I'm of
the opinion that if p
if you are flexible in the syntax you are willing to support, you can tell
your users that they need to escape the colons that aren't ment as field
identifiers...
ID:CI\:123
...alternately, you can tell them they have to quote colons...
ID:"CI:123"
...then you can avoid the who
The subject of revamping the Filter API to support more compact filter
representations has come up in the past ... At least one patch comes to
mind that helps with the issue...
https://issues.apache.org/jira/browse/LUCENE-328
...i'm not intimitely familiar with that code, but if i recall corr
r out of the box
public interface DocIterator {
public int doc();
public boolean next();
public boolean skipTo(int target);
}
:
: -Original Message-
: From: [EMAIL PROTECTED]
: [mailto:[EMAIL PROTECTED] Behalf Of Chris Hostetter
: Sent: Thursday, January 26, 2006
: > > public interface DocIterator {
: > > public int doc();
: > > public boolean next();
: > > public boolean skipTo(int target);
: > > }
: Btw. the DocNrSkipper referred to earlier has this DocIterator functionality
: in one method:
:
: int nextDocNr(int)
:
Mark,
I know you've already commited a patch along these lines (LUCENE-494) and
I can see how in a lot of cases that would be a great solution, but i'm
still interested in the orriginal idea you proposed (a 'maxDf' in
TermQuery) because i anticipate situations in which you don't want to
ignore th
: Chris, although I suggested it initially, I'm now a
: little uncomfortable in controlling this issue with a
: static variable in TermQuery because it doesnt let me
: have different settings for different queries, indexes
: or fields.
Oh i totally agree ... it's the kind of thing you'd only want
: This is a great time to improve the javadoc. I see lots of blank boxes
: which could use a bit of descriptive text, for example:
That reminds me about a documentation/release issue that's been rolling
arround in the back of my mind that seems like it's only going to get
worse as future release
: care about having contribute to the score of the hit. Along those lines I
: was thinking about adding some functionality to the code that expands prefix
: queries to create a filter and use that instead of just expanding the
: individual terms. Can anyone see any major issues with doing it this
I just noticed the IndexReader.setNorm method(s) today and was extremely
stoked -- after rebuilding my dev index from scratch three times last week
becuase I wanted to try out tweaks to Similarity.lengthNorm the idea of
being able to directly change the norms without rebuildign from scratch is
loo
: I'd like to push out a 1.9 release candidate in the next week or so.
I'm not sure what the ASF/Lucene policy is on keeping Copyright/License
statements in source files up to date, but should they all be updated to
say "Copyright 2006 The Apache Software Foundation" prior to a 1.9
release?
I've
: > in the case where doc boosts and field boosts aren't used, it seems like
: > it would be very easy to write a maintenance app that did something
: > like...
: > ...does anyone see anything wrong with the overall appraoch?
:
: Looks good to me.
Implimented and submitted in LUCENE-496. So far
: Anyone using those addresses, even the new ones, without first
: signing up for the list is going to have some issues anyway. I
: moderate in a fair number of these sorts of messages, but I also
: reject recurring ones and request the sender sign up.
Perhaps the best course of action would be
: of query). Under the previous versions of QueryParser, I could simply
: specify 'riot???' and capture all of those variants.
I don't have a strong opinion on this issue, but it seems clear to me that
this was a bug in 1.4.3 not a change in the orriginally intended behavior.
queryparsersyntax.h
: In either case, what I'm arguing is that the current behavior makes more
: sense in the real world of query expressions (that is, makes the most
: common query expressions simpler), so why not continue it?
I disagree with that statment. People familiar with shell globing are
going to be confus
: FYI, I think all of the commits to trunk since the RC1 release are safe
: to merge to the 1.9 branch. They're mostly documentation improvements.
: So my plan is currently, on Monday, to merge these changes to the 1.9
: branch, then make a 1.9-final release. I'll again announce it to the
...
: Further to our discussions some time ago I've had some time to put
: together an XML-based query parser with support for many "advanced"
: query types not supported in the current Query parser.
:
: More details and code here: http://www.inperspective.com/lucene/LXQuery2.htm
So I *finally* got a
: > A) generate an XML representation of a given
: > Query/Filter object. This would solve the current
: > parser.parse(Query.toString())round-tripping problem.
:
: This would be very useful, but couldn't it be added after this was in
: contrib? You might reorganize things so that it fits in mor
: But doesn't sticking with w3c.dom.Element allow the possibility of
: standards based tools (eg XPath implementations) to be used by builders
: if they so wish?
Hmmm... that isn't something i'd considered. You've convinced me.
: >3) I'm still confused about how state information could/would be
: : DOMUtils.getAttributeWithInheritance instead. My one scenario I came
: : across where I wanted some context passed down was "fieldName" and this
: : is handled simply by leaf nodes walking up the w3c.dom.Node tree until
: : you find an Element with this attribute set.
:
: Hmm, i can see how th
: distribution, we should start documenting their changes. I suggest that we
: add a file contrib/CHANGES.txt. This way we don't pollute the top-level
: changes file. Having one changes file per contrib project on the other
: hand makes it more difficult to get an overview, so one in contrib seems
Someone with the neccessary permisions to update the javadocs on the
website might want to do so, they currently say "Lucene 1.9-rc1 API" which
might confuse people (even if the API is exactly the same as 1.9.1)
http://lucene.apache.org/java/docs/api/
-Hoss
--
[email protected] is the appropriate email list to consult with
questions about using/configuring/customizing nutch.
[EMAIL PROTECTED] is for discussing the core lucene java library.
: Date: Sat, 4 Mar 2006 18:36:25 -0800 (PST)
: From: Michael Ji <[EMAIL PROTECTED]>
: Reply-To: java-
: Any suggestions on what to do then, as the following query exhibits the same
behavior
:
: (+cat) (-dog)
:
: Due to the implied AND. Removing the parenthesis allows it to work. It
: doesn't seem that adding parenthesis in this case should cause the query
: to fail???
Adding parens causes QueryPa
: Shouldn't FilterIndexReader in 1.9.1 override IndexReader.getVersion() and
: IndexReader.isCurrent()? Currently it doesn't, so getVersion() gives a
: NullPointerException, segmentInfos is null.
I think you are right, it looks like FilterIndexReader just wasn't updated
when those methods were ad
: Maybe I'm going about this the wrong way. If you think I am, let me
: know. I now realize that this question should be in the lucene users
: list but I started it here because I was going to write a new module for
: doing this because I couldn't get lucene to do it for me. I'm going to
: look
There is a FAQ thta covers it, I just updated it since it was somewhat out
of date and lacked some of the newest (bestest?) info about dealing with
this problem...
http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831
In the future, questions about using
1) not only does ConstantScoreRangeQuery uses a RangeFilter, but
TestConstantScoreRangeQuery and TestRangeFilter share a base class that
creates the index.
2) perhaps the issue is that corruption is happening when segments are
merged -- and most tests don't surface the problem becuse they tend to
: Lucene is completely new to me. I just downloaded 1.9.1 and started
: experimenting with it. I am a bit confused though. I want to use the
: MoreLikeThis class, which appears in the javadoc, but does not exist in
: code. Where can I find it?
if you look at the way the main javadoc index is ar
: For some reason, there is a disagreement between the order the
: Documents are returned in hits, and the Documents are referenced (via
: order number, starting from 0) in the Spans?
When dealing with a Hits instance, documents are iterated over in "results
order" -- which may be by score, or ma
As marvin mentioned, there are some UTF-8 incompatabilities between java
lucene and Plucene.
Incidently: your best bet for getting assistence with Plucene is the
Plucene mailing lists, as identified at the bottom of "perldoc Plucene" ...
http://kasei.com/mailman/listinfo/plucene
...perl
: The question is when I get Spans, I get start/end positions and a
: Document order (starting from 0), not the Document object itself from
Are you sure about that? Spans.doc() should return you the internal
document Identifier which you can pass to indexReader.doc(int)
: which I could get a fi
: I should have been more clear: I'm not asking for new feature requests.
: Rather for known, high-priority, bugs.
I don't know if it's high priority, but LUCENE-546 seems to be a trivial
bug with a trivial fix ("seems to be", i'm judging purely by the patch)
2.0 also seems like the best time
: Anyway, i am sending you TurkishAnalyzer as attachment.I will be VERY
: happy if you upload these codes to:
Emre, I don't know anything about Turkish -- but It's allways good to have
new analyzers: thanks for the contribution. Uploading it to Jira was
definitely the best way to submit it.
One
: I am having problems running span queries with more than one
: negative clauses:
i believe you mean when the exclude clause contains a SpanNear query
correct?
: Is the span query nested correctly?
I'm not very good at reading SpanQuery.toString() output ... but i believe
i encountered the s
A couple of responses to various comments in this thread...
: > Unless it object identity is what is being tested or intern is an
: > invariant, I think it is dangerous. It is easy to forget to intern or to
: > propagate the pattern via cut and paste to an inappropriate context.
interning the St
: It's got one difference from yours, in that the terms are allowed to
: occur in any order in the sub-phrases (so phrase "C B" from your
: original example is scored like "B C").
there's a much bigger differnece, in that your technique won't reqard
documents where B and C are "near" eachother, b
: One of the reasons I am looking at this is because I often need just
: yes/no (matches/doesn't match) answers, and don't care for the score.
I didn't realize that was an option -- i thought you wanted integer
scoring, and the best advice i had for that was to search and replace.
But if you jus
: We found if we were using 2 IndexSearcher, we would get 10% performance
: benefit.
: But if we increased the number of IndexSearcher from 2, the performance
: improvement became slight even worse.
Why use more then 2 IndexSearchers?
Typically 1 is all you need, except for when you want to
: > I am fairly certain his code is ok, since it rechecks the initialized state
: > in the synchronized block before initializing.
:
: That "recheck" is why the pattern (or anti-pattern) is called
: double-checked locking :-)
More specificly, this is functionally half way between example labeled
: I think you could use a volatile primitive boolean to control whether or not
: the index needs to be read, and also mark the index data volatile and it
: SHOULD PROBABLY work.
:
: But as stated, I don't think the performance difference is worth it.
My understanding is:
1) volatile will only h
, I'm not expert i was just going based on what i've read, and
aparently i forgot to paste the URL in my last email...
http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#dcl
: -Original Message-
: From: Chris Hostetter [mailto:[EMAIL PROTECTED]
: Sent: Wednesday, May
I'm looking into some of the issues with LUCENE-557 and it seems that a
lot of them are triggered by the way BooleanWeight.normalize is
implimented...
public void normalize(float norm) {
norm *= getBoost(); // incorporate boost
for (int i = 0 ; i < weights.
: > Does anyone know why normalize ignores the prohibited clauses? was that
: > just intended to be an optimization (save time calculating stuff for
: > clauses we don't care about scoring in depth) ... ?
:
: A prohibited clause will never occur in any matching document, so it
: will never need t
I'm really confused by your example ... I'm assuming eField is a
Map.Entry, and eField.getKey() is returning a FieldInfo (allthough i'm not
sure why there's no explicit cast in your code) ... but what is the return
type of "eField.getValue()" ?
Without understanding what that object is, i can onl
: If class Explanation would have a boolean attribute indicating whether
: or not there was a match, the Explanation for BooleanQuery could
: simply use this value from the Explanation of the prohibited clause.
I've definitely thought about that a lot initially. But my gut reaction
was to try an
: However what significantly slows us down is the hits.id(i) function.
: Can we accelerate it somehow "cleaning" Lucene code itself from
: scoring?
you said in your last message...
: We don't need any scoring in our application domain, but
: efficiency is the key because we are getting tens
: >Boolean match = null;
:
: As for the thoughts question below: this java-dev, not c-dev :)
i could not for the life of me understand this comment untill i got to the
end of your message...
: null for false: long time no see...
...i'm not trying to use null for false, i'm using null to ind
I'm curious: does the exception only occur if both the field and the value
are empty? ... are the Field.Store and Field.Index options you listed
neccessary for this condition as well?
Is it clear why this situation causes the exception?
(I don't have any obejction to rejecting blank field names
Is there a documented or unspoken policy about the "Resolved" vs "Closed"
bug statuses?
How/when should a resolved bug be closed?
(In my experience policy has tended towards the person fixing the bug to
"resolve" it, and the person who opened the bug to "close" once they're
verified the fix -- b
SpanNodeQuery's hashCode method makes two refrences to include.hashCode(),
but none to exclude.hashCode() ... this is a mistake yes/no?
-Hoss
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL
: I measured also on different densities, and it looks about the same.
: When I find a few spare minutes will make one PerfTest that generates
: gnuplot diagrams. Wold be interesting to see how all key methods behave
: as a function of density/size.
I was thinking the same thing ... i just haven'
: I put Lock in IndexReader.indexExists function, and testes for a few days
: It worked fine. I never had that mistery problem.
:
: How can put the patch in a JIRA issue?
Please take a look at the recently added FAQ "How do I contribute an
improvement?"...
http://wiki.apache.org/jakarta-lucene/L
: I don't see anything related to searching using non-indexed fields. Could
: you maybe point me at the class(es) that implement this functionality?
I think Erik was refering more specificly to the statement...
: > it is just
: > very difficult to perform some complex queries efficiently without
: Could someone enumerate what needs to be done before 2.0 is released.
: From following this thread, it was stated that 2.0 was 1.9 with
: deprecations removed.
: Recently it appears to be becoming much more than that.
I believe Doug's suggestion was to hold off just long enough to fix any
egre
: I wouldn't seeing 415 being fixed, but I seem to be missing a way one
: changes "Fix Version".
it's a property that can be changed from the edit screen .. but 415 is
weird, there is no "Edit" link in the Operations nav (as opposed to every
other LUCENE issue i've ever looked at )
-Hoss
: Looks like QueryParser doesn't handle escaped quotes when inside a phrase:
I believe you are correct. could you file a Jira issue for this,
preferably with your main function converted to a JUnit test function that
can be added to TestQueryParser?
(it doesn't take much to write a JUnit test f
: In case Explanation is also to explain what a Filter does, it would need to
: have both a match flag and a score value.
that's a good point, i hadn't considered hte possibility of "explaining"
filters much ... but there's no reason why the "valueO 'f an explanation
couldn't be an optional part
1 - 100 of 628 matches
Mail list logo