[jira] [Commented] (LUCENE-5327) Expose getNumericDocValues and getBinaryDocValues at toplevel reader and searcher levels
[ https://issues.apache.org/jira/browse/LUCENE-5327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818392#comment-13818392 ] John Wang commented on LUCENE-5327: --- Perhaps I should have explained our usecase, which is to build up the search results. After a search, you essentially get a set of internal docids, they are not useful to the application. On the IndexSearcher API, the apis you use to build up the search result, as your said, are document() call. Calling document to extract stored fields essentially is too expensive for us. Instead, we keep an application per document UID in a numeric docvalue. Our search result is basically a list of these UIDs and further result decoration is done higher up in the application logic. I have seen this pattern in numerous Lucene applications. This is essentially the motivation behind this ticket. Currently, to do this I am essentially writing this code: Get in the underlying IndexReader for the IndexSearcher and for each ScoreDoc: find the atomicReader for ScoreDoc.doc return atomicReader.getNumericDocValue(ScoreDoc.doc - base); This is a little cumbersome, it would be nice to allow the IndexSearcher return the uid in the likeness of the document call. I am happy to close this ticket if you guys don't think this API is useful. Thanks -John Expose getNumericDocValues and getBinaryDocValues at toplevel reader and searcher levels Key: LUCENE-5327 URL: https://issues.apache.org/jira/browse/LUCENE-5327 Project: Lucene - Core Issue Type: Improvement Components: core/search Affects Versions: 4.5 Reporter: John Wang Attachments: patch.diff Expose NumericDocValues and BinaryDocValues in both IndexReader and IndexSearcher apis. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5327) Expose getNumericDocValues and getBinaryDocValues at toplevel reader and searcher levels
[ https://issues.apache.org/jira/browse/LUCENE-5327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818394#comment-13818394 ] Shai Erera commented on LUCENE-5327: I don't think the API isn't useful, just that it's wrong to have it on IndexSearcher where there's no matching API on IndexReader. If you use MultiDocValues, you shouldn't write that complicated code, you could instead do: {code} NumericDocValues uid = MultiDocValues.getNumericDocValues(searcher.getIndexReader(), uid); for (ScoreDoc sd : topDocs.scoreDocs) { long uidValue = uid.get(sd.doc); } {code} That's not so bad I think? I mean, IndexSearcher.getNumericDocValues() will essentially save you just the first call, so I don't see any great benefits in having the API there. If you want to avoid the binary search, you should re-sort the topDocs by increasing doc IDs, then iterate on reader.leaves(), obtain the NDV from each AtomicReader and pull the right values. First, I don't think you should do that, unless you're asking for thousands of hits. Second, this won't be solved by adding IndexSearcher.getNDV(). I agree w/ Uwe, I think we should close that issue as Won't Fix. Expose getNumericDocValues and getBinaryDocValues at toplevel reader and searcher levels Key: LUCENE-5327 URL: https://issues.apache.org/jira/browse/LUCENE-5327 Project: Lucene - Core Issue Type: Improvement Components: core/search Affects Versions: 4.5 Reporter: John Wang Attachments: patch.diff Expose NumericDocValues and BinaryDocValues in both IndexReader and IndexSearcher apis. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5327) Expose getNumericDocValues and getBinaryDocValues at toplevel reader and searcher levels
[ https://issues.apache.org/jira/browse/LUCENE-5327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818402#comment-13818402 ] John Wang commented on LUCENE-5327: --- done, closed. Expose getNumericDocValues and getBinaryDocValues at toplevel reader and searcher levels Key: LUCENE-5327 URL: https://issues.apache.org/jira/browse/LUCENE-5327 Project: Lucene - Core Issue Type: Improvement Components: core/search Affects Versions: 4.5 Reporter: John Wang Attachments: patch.diff Expose NumericDocValues and BinaryDocValues in both IndexReader and IndexSearcher apis. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (LUCENE-5327) Expose getNumericDocValues and getBinaryDocValues at toplevel reader and searcher levels
[ https://issues.apache.org/jira/browse/LUCENE-5327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Wang closed LUCENE-5327. - Resolution: Won't Fix Expose getNumericDocValues and getBinaryDocValues at toplevel reader and searcher levels Key: LUCENE-5327 URL: https://issues.apache.org/jira/browse/LUCENE-5327 Project: Lucene - Core Issue Type: Improvement Components: core/search Affects Versions: 4.5 Reporter: John Wang Attachments: patch.diff Expose NumericDocValues and BinaryDocValues in both IndexReader and IndexSearcher apis. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5374) Support user configured doc-centric versioning rules
[ https://issues.apache.org/jira/browse/SOLR-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818410#comment-13818410 ] Anshum Gupta commented on SOLR-5374: Just a thought, should we change the commented logging to log.debug? I'm assuming that's the intention behind leaving it in there. Support user configured doc-centric versioning rules Key: SOLR-5374 URL: https://issues.apache.org/jira/browse/SOLR-5374 Project: Solr Issue Type: Improvement Reporter: Hoss Man Assignee: Hoss Man Fix For: 4.6, 5.0 Attachments: SOLR-5374.patch, SOLR-5374.patch, SOLR-5374.patch, SOLR-5374.patch, SOLR-5374.patch, SOLR-5374.patch The existing optimistic concurrency features of Solr can be very handy for ensuring that you are only updating/replacing the version of the doc you think you are updating/replacing, w/o the risk of someone else adding/removing the doc in the mean time -- but I've recently encountered some situations where I really wanted to be able to let the client specify an arbitrary version, on a per document basis, (ie: generated by an external system, or perhaps a timestamp of when a file was last modified) and ensure that the corresponding document update was processed only if the new version is greater then the old version -- w/o needing to check exactly which version is currently in Solr. (ie: If a client wants to index version 101 of a doc, that update should fail if version 102 is already in the index, but succeed if the currently indexed version is 99 -- w/o the client needing to ask Solr what the current version) The idea Yonik brought up in SOLR-5298 (letting the client specify a {{\_new\_version\_}} that would be used by the existing optimistic concurrency code to control the assignment of the {{\_version\_}} field for documents) looked like a good direction to go -- but after digging into the way {{\_version\_}} is used internally I realized it requires a uniqueness constraint across all update commands, that would make it impossible to allow multiple independent documents to have the same {{\_version\_}}. So instead I've tackled the problem in a different way, using an UpdateProcessor that is configured with user defined field to track a DocBasedVersion and uses the RTG logic to figure out if the update is allowed. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2894) Implement distributed pivot faceting
[ https://issues.apache.org/jira/browse/SOLR-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818422#comment-13818422 ] Elran Dvir commented on SOLR-2894: -- I didn't manage to make ditributed pivot on date field to blow up with toObject. Can you please attach an example query that blows Solr up and I'll adjust it to my environment? Thanks. Implement distributed pivot faceting Key: SOLR-2894 URL: https://issues.apache.org/jira/browse/SOLR-2894 Project: Solr Issue Type: Improvement Reporter: Erik Hatcher Fix For: 4.6 Attachments: SOLR-2894-reworked.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch Following up on SOLR-792, pivot faceting currently only supports undistributed mode. Distributed pivot faceting needs to be implemented. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5336) Add a simple QueryParser to parse human-entered queries.
[ https://issues.apache.org/jira/browse/LUCENE-5336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818431#comment-13818431 ] Paul Elschot commented on LUCENE-5336: -- A realistic query parser is not likely to be any simpler than this, so why not call it simple? Add a simple QueryParser to parse human-entered queries. Key: LUCENE-5336 URL: https://issues.apache.org/jira/browse/LUCENE-5336 Project: Lucene - Core Issue Type: Improvement Reporter: Jack Conradson Attachments: LUCENE-5336.patch I would like to add a new simple QueryParser to Lucene that is designed to parse human-entered queries. This parser will operate on an entire entered query using a specified single field or a set of weighted fields (using term boost). All features/operations in this parser can be enabled or disabled depending on what is necessary for the user. A default operator may be specified as either 'MUST' representing 'and' or 'SHOULD' representing 'or.' The features/operations that this parser will include are the following: * AND specified as '+' * OR specified as '|' * NOT specified as '-' * PHRASE surrounded by double quotes * PREFIX specified as '*' * PRECEDENCE surrounded by '(' and ')' * WHITESPACE specified as ' ' '\n' '\r' and '\t' will cause the default operator to be used * ESCAPE specified as '\' will allow operators to be used in terms The key differences between this parser and other existing parsers will be the following: * No exceptions will be thrown, and errors in syntax will be ignored. The parser will do a best-effort interpretation of any query entered. * It uses minimal syntax to express queries. All available operators are single characters or pairs of single characters. * The parser is hand-written and in a single Java file making it easy to modify. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-4.x-MacOSX (64bit/jdk1.6.0) - Build # 978 - Failure!
Build: http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-MacOSX/978/ Java: 64bit/jdk1.6.0 -XX:+UseCompressedOops -XX:+UseParallelGC All tests passed Build Log: [...truncated 22541 lines...] BUILD FAILED /Users/jenkins/workspace/Lucene-Solr-4.x-MacOSX/build.xml:428: The following error occurred while executing this line: /Users/jenkins/workspace/Lucene-Solr-4.x-MacOSX/build.xml:67: The following error occurred while executing this line: /Users/jenkins/workspace/Lucene-Solr-4.x-MacOSX/lucene/build.xml:188: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.init(String.java:215) at java.lang.StringBuffer.toString(StringBuffer.java:585) at de.thetaphi.forbiddenapis.asm.commons.Method.toString(Unknown Source) at java.lang.String.valueOf(String.java:2826) at java.lang.StringBuilder.append(StringBuilder.java:115) at de.thetaphi.forbiddenapis.Checker$1$1.checkMethodAccess(Checker.java:475) at de.thetaphi.forbiddenapis.Checker$1$1.visitMethodInsn(Checker.java:527) at de.thetaphi.forbiddenapis.asm.ClassReader.a(Unknown Source) at de.thetaphi.forbiddenapis.asm.ClassReader.b(Unknown Source) at de.thetaphi.forbiddenapis.asm.ClassReader.accept(Unknown Source) at de.thetaphi.forbiddenapis.asm.ClassReader.accept(Unknown Source) at de.thetaphi.forbiddenapis.Checker.checkClass(Checker.java:378) at de.thetaphi.forbiddenapis.Checker.run(Checker.java:563) at de.thetaphi.forbiddenapis.AntTask.execute(AntTask.java:166) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291) at sun.reflect.GeneratedMethodAccessor459.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) at org.apache.tools.ant.Task.perform(Task.java:348) at org.apache.tools.ant.Target.execute(Target.java:390) at org.apache.tools.ant.Target.performTasks(Target.java:411) at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1399) at org.apache.tools.ant.helper.SingleCheckExecutor.executeTargets(SingleCheckExecutor.java:38) at org.apache.tools.ant.Project.executeTargets(Project.java:1251) at org.apache.tools.ant.taskdefs.Ant.execute(Ant.java:442) at org.apache.tools.ant.taskdefs.SubAnt.execute(SubAnt.java:302) at org.apache.tools.ant.taskdefs.SubAnt.execute(SubAnt.java:221) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291) at sun.reflect.GeneratedMethodAccessor459.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) Total time: 95 minutes 39 seconds Build step 'Invoke Ant' marked build as failure Description set: Java: 64bit/jdk1.6.0 -XX:+UseCompressedOops -XX:+UseParallelGC Archiving artifacts Recording test results Email was triggered for: Failure Sending email for trigger: Failure - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement
[ https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818443#comment-13818443 ] Shai Erera commented on LUCENE-5316: Mike, I looked at the different runs results and the QPS column (e.g. QPS Base) varies dramatically between runs. I'm not talking about base vs comp, but base vs itself in all runs. E.g. in the last run when you compared ALL_BUT_DIM and NO_PARENTS, AndHighLow of ALL_BUT_DIM was 70.5 vs 41.14 respectively. In the ALL_BUT_DIM run after Gilad's patch with getChildren() returning null, it's 103.19. Can you explain the great differences? Note that I don't compare that to the easy run (w/ 7 dims only) as it does not run the same thing. But I wonder if the changes in absolute QPS may hint at some instability (maybe temporal) with the machine or the test? Still, when comp is slowest than base so ultimately I think it shows the abstraction hurts us, but I'd feel better if the test was more stable across runs. Separately, I'm torn about what we should do here. On one hand the abstraction hurts us, but on the other hand, it eliminates any chance of doing anything smart in the taxonomy in-memory representation. For example, if a dimension is flat and some taxonomy implementation manages to assign successive ordinals to its children, we don't even need to materialize all children in an int[], and rather hold a start/end range (a'la SortedSetDocValuesReaderState.OrdRange) and implement ChildrenIterator on top. If we commit to an int[] on the API, it immediately kills any chance to further optimize that in the future (e.g. PackedInts even). I know Gilad is making progress w/ returning an int[] per ord, so I wonder what the performance will be with it. I really wish we can make that API abstraction without losing much - it feels the right thing to do ... and I'd hate to do it, knowing that we lose :). Taxonomy tree traversing improvement Key: LUCENE-5316 URL: https://issues.apache.org/jira/browse/LUCENE-5316 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Gilad Barkai Priority: Minor Attachments: LUCENE-5316.patch, LUCENE-5316.patch, LUCENE-5316.patch The taxonomy traversing is done today utilizing the {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays which hold for each ordinal it's (array #1) youngest child and (array #2) older sibling. This is a compact way of holding the tree information in memory, but it's not perfect: * Large (8 bytes per ordinal in memory) * Exposes internal implementation * Utilizing these arrays for tree traversing is not straight forward * Lose reference locality while traversing (the array is accessed in increasing only entries, but they may be distant from one another) * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size) This issue is about making the traversing more easy, the code more readable, and open it for future improvements (i.e memory footprint and NRT cost) - without changing any of the internals. A later issue(s?) could be opened to address the gaps once this one is done. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement
[ https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818457#comment-13818457 ] Michael McCandless commented on LUCENE-5316: I'm not sure why there's such a difference; I do fix the static seed in the test, so it's running the same queries every time (otherwise it would pick a different set of queries out of each category). Let me go re-run ... maybe I messed something up. It would be best if others could run too, to avoid a stupid mistake on my part causing us to abandon what would have been a good change! Taxonomy tree traversing improvement Key: LUCENE-5316 URL: https://issues.apache.org/jira/browse/LUCENE-5316 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Gilad Barkai Priority: Minor Attachments: LUCENE-5316.patch, LUCENE-5316.patch, LUCENE-5316.patch The taxonomy traversing is done today utilizing the {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays which hold for each ordinal it's (array #1) youngest child and (array #2) older sibling. This is a compact way of holding the tree information in memory, but it's not perfect: * Large (8 bytes per ordinal in memory) * Exposes internal implementation * Utilizing these arrays for tree traversing is not straight forward * Lose reference locality while traversing (the array is accessed in increasing only entries, but they may be distant from one another) * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size) This issue is about making the traversing more easy, the code more readable, and open it for future improvements (i.e memory footprint and NRT cost) - without changing any of the internals. A later issue(s?) could be opened to address the gaps once this one is done. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5333) Support sparse faceting for heterogeneous indices
[ https://issues.apache.org/jira/browse/LUCENE-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818458#comment-13818458 ] Michael McCandless commented on LUCENE-5333: bq. Why is it an overkill? Well, I think the facet module already has too many classes / abstractions: aggregators, accumulators, ordinal policies, search params, indexing params, cat paths, encoders, decoders, etc. I think this (huge API surface area) is a big impediment to users adopting it and devs contributing to it. So, I really don't want to make this worse, by adding yet another Accumulator, that has static factory methods, to create yet other Accumulators that are subclasses of existing Accumulators. I think it's too much. I also don't like separating concerns: I think that's a sign that something is wrong. I don't think a single class (AllFA) should be expected to handle both taxonomy based and SSDV based cases. We already have classes that count facets using those two methods, so I think we should just add this capability to each of those classes. And, if we add the enum facet method (and others), then the natural place to add sparse handling for it would be to its own class, I think. bq. So I'm curious - did you try a dedicated class and ran into troubles? No, I haven't tried: I just didn't really like that approach... so I focused on the impl instead ... bq. Is there a reason to not allocating the CFRs up front and setting them on the FSP? I really don't like the approach of create CFR for every possible dim. I realize this is a simple way to implement it, but it seems wrong. And I especially don't want the API to expose that we are somehow doing this: it's an impl detail. So I wanted to get closer to not creating all CFRs up-front, and doing it transiently seemed at least a bit better than bringing the entire list into existence. But I think I can improve on the patch so that we don't even make a CFR until we see that any labels had non-zero count ... I'll work towards that. bq. You sort the FacetResult based on the FResNode.value (their root). Does SortedSet always assign a value to the root of a FacetResult.node? Yes, it does, in the sparse case (I ignore the ord policy). Support sparse faceting for heterogeneous indices - Key: LUCENE-5333 URL: https://issues.apache.org/jira/browse/LUCENE-5333 Project: Lucene - Core Issue Type: New Feature Components: modules/facet Reporter: Michael McCandless Attachments: LUCENE-5333.patch In some search apps, e.g. a large e-commerce site, the index can have a mix of wildly different product categories and facet dimensions, and the number of dimensions could be huge. E.g. maybe the index has shirts, computer memory, hard drives, etc., and each of these many categories has different attributes. In such an index, when someone searches for so dimm, which should match a bunch of laptop memory modules, you can't (easily) know up front which facet dimensions will be important. But, I think this is very easy for the facet module, since ords are stored row stride (each doc lists all facet labels it has), we could simply count all facets that the hits actually saw, and then in the end see which ones got traction and return facet results for these top dims. I'm not sure what the API would look like, but conceptually this should work very well, because of how the facet module works. You shouldn't have to state up front exactly which facet dimensions to count... -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5333) Support sparse faceting for heterogeneous indices
[ https://issues.apache.org/jira/browse/LUCENE-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818466#comment-13818466 ] Shai Erera commented on LUCENE-5333: bq. Well, I think the facet module already has too many classes That's unrelated. It's like saying Lucene has many APIs: IndexWriter, IndexWriterConfig, Document, Field, MergePolicy, Query, QueryParser, Collector, IndexReader, IndexSearcher... just to name a few :). What's important here is FacetAccumulator and FacetRequest .. that's it. The rest are *totally* unrelated. This scenario fits into another accumulator. Or else, we'll end up with facet code diverging left and right. Even now, for really no good reason, if you choose to index facets using SortedSetDV, you can only count them. Why? What prevents these ords from weighted by SumScore or a ValueSource? Nothing I think? So I'm worried that if you add this to only SortedSetDV, it will increase the difference between the two. Rather, I prefer to pick the right API. We say that FacetsAccumulator is your entry point to accumulating facets. So far we've made FacetsAccumulator.create adhere to all existing FacetRequests and accumulators and return the proper one. I think that's a good API? And if all an AllFA needs to do is create dummy requests and filter out the not interesting ones, why complicate the code of all other accumulators (existing and future ones)? Won't it be simpler to add EnumFacetsAccumulator support to AllFA? Look, this is not a rocket science feature. Besides that I don't think it's such an important or common feature, I think the app doesn't really need to go out of its way to support it -- it can easily create all possible FRs using very simple API, and filter out FacetResults whose FRN.subResults is empty. Can we make a simple utility for these apps - I'm all for it! But I prefer that we don't complicate the code of existing FAs. Support sparse faceting for heterogeneous indices - Key: LUCENE-5333 URL: https://issues.apache.org/jira/browse/LUCENE-5333 Project: Lucene - Core Issue Type: New Feature Components: modules/facet Reporter: Michael McCandless Attachments: LUCENE-5333.patch In some search apps, e.g. a large e-commerce site, the index can have a mix of wildly different product categories and facet dimensions, and the number of dimensions could be huge. E.g. maybe the index has shirts, computer memory, hard drives, etc., and each of these many categories has different attributes. In such an index, when someone searches for so dimm, which should match a bunch of laptop memory modules, you can't (easily) know up front which facet dimensions will be important. But, I think this is very easy for the facet module, since ords are stored row stride (each doc lists all facet labels it has), we could simply count all facets that the hits actually saw, and then in the end see which ones got traction and return facet results for these top dims. I'm not sure what the API would look like, but conceptually this should work very well, because of how the facet module works. You shouldn't have to state up front exactly which facet dimensions to count... -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement
[ https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818469#comment-13818469 ] Michael McCandless commented on LUCENE-5316: I re-ran ALL_BUT_DIM and NO_PARENTS on the last patch: ALL_BUT_DIM: {noformat} TaskQPS base StdDevQPS comp StdDev Pct diff LowSloppyPhrase 195.79 (6.4%) 160.76 (6.5%) -17.9% ( -28% - -5%) MedSpanNear 189.11 (6.1%) 155.88 (6.5%) -17.6% ( -28% - -5%) AndHighLow 171.05 (5.4%) 142.46 (5.9%) -16.7% ( -26% - -5%) HighPhrase 165.56 (5.6%) 140.32 (6.0%) -15.2% ( -25% - -3%) HighSloppyPhrase 135.86 (4.7%) 117.90 (5.3%) -13.2% ( -22% - -3%) HighSpanNear 98.69 (4.1%) 88.28 (4.5%) -10.5% ( -18% - -2%) MedPhrase 89.68 (4.3%) 81.23 (3.7%) -9.4% ( -16% - -1%) OrNotHighLow 93.45 (5.5%) 85.07 (4.9%) -9.0% ( -18% -1%) LowTerm 87.06 (3.4%) 79.50 (3.8%) -8.7% ( -15% - -1%) Fuzzy1 63.87 (2.5%) 59.39 (2.9%) -7.0% ( -12% - -1%) AndHighMed 53.60 (1.9%) 50.49 (2.6%) -5.8% ( -10% - -1%) OrHighLow 54.32 (2.2%) 51.18 (2.4%) -5.8% ( -10% - -1%) OrNotHighHigh 62.71 (5.5%) 59.11 (5.0%) -5.7% ( -15% -5%) OrNotHighMed 47.72 (3.4%) 45.35 (3.1%) -5.0% ( -11% -1%) Fuzzy2 48.40 (2.2%) 46.07 (2.4%) -4.8% ( -9% -0%) AndHighHigh 31.48 (1.6%) 30.33 (1.5%) -3.7% ( -6% -0%) MedTerm 35.33 (2.0%) 34.06 (1.9%) -3.6% ( -7% -0%) MedSloppyPhrase 17.17 (4.4%) 16.67 (4.3%) -2.9% ( -11% -6%) Prefix3 27.73 (1.6%) 26.93 (1.2%) -2.9% ( -5% -0%) OrHighNotMed 24.31 (2.4%) 23.79 (1.1%) -2.1% ( -5% -1%) LowPhrase 14.56 (4.2%) 14.28 (4.0%) -1.9% ( -9% -6%) LowSpanNear 11.25 (2.4%) 11.04 (1.7%) -1.9% ( -5% -2%) OrHighHigh 17.63 (1.6%) 17.38 (1.1%) -1.4% ( -4% -1%) OrHighNotLow 18.97 (1.8%) 18.69 (0.9%) -1.4% ( -4% -1%) Wildcard 13.21 (1.4%) 13.03 (0.9%) -1.4% ( -3% -0%) HighTerm 16.34 (1.8%) 16.14 (1.9%) -1.3% ( -4% -2%) OrHighMed 18.11 (1.6%) 17.93 (1.4%) -1.0% ( -3% -2%) Respell 89.31 (2.8%) 88.78 (2.2%) -0.6% ( -5% -4%) OrHighNotHigh9.09 (2.0%)9.08 (1.4%) -0.1% ( -3% -3%) IntNRQ4.87 (1.2%)4.90 (1.2%) 0.7% ( -1% -3%) {noformat} NO_PARENTS: {noformat} TaskQPS base StdDevQPS comp StdDev Pct diff LowSloppyPhrase 98.63 (4.7%) 28.73 (2.9%) -70.9% ( -74% - -66%) MedSpanNear 97.31 (4.7%) 28.54 (2.9%) -70.7% ( -74% - -66%) AndHighLow 91.63 (3.9%) 28.04 (2.9%) -69.4% ( -73% - -65%) HighPhrase 90.81 (3.6%) 27.94 (2.9%) -69.2% ( -73% - -65%) HighSloppyPhrase 80.24 (3.2%) 26.90 (3.1%) -66.5% ( -70% - -62%) HighSpanNear 65.93 (2.7%) 24.97 (3.3%) -62.1% ( -66% - -57%) OrNotHighLow 64.00 (3.3%) 24.74 (3.2%) -61.3% ( -65% - -56%) MedPhrase 62.06 (4.1%) 24.52 (3.3%) -60.5% ( -65% - -55%) LowTerm 61.33 (2.6%) 24.40 (3.3%) -60.2% ( -64% - -55%) OrNotHighHigh 48.27 (2.8%) 21.97 (3.4%) -54.5% ( -58% - -49%) Fuzzy1 47.61 (2.2%) 21.90 (3.5%) -54.0% ( -58% - -49%) OrHighLow 43.63 (2.6%) 21.07 (3.4%) -51.7% ( -56% - -46%) AndHighMed 42.86 (2.6%) 20.75 (3.4%) -51.6% ( -56% - -46%) OrNotHighMed 39.23 (2.0%) 19.93 (3.3%) -49.2% ( -53% - -44%) Fuzzy2 38.49 (2.3%) 19.76 (3.3%) -48.6% ( -53% - -44%)
[jira] [Commented] (LUCENE-5212) java 7u40 causes sigsegv and corrupt term vectors
[ https://issues.apache.org/jira/browse/LUCENE-5212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818482#comment-13818482 ] Dawid Weiss commented on LUCENE-5212: - I confirm that this patch fixes the problem. I've tested svn rev. 1523179 (trunk) against jdk8-b114 with and without Vladimir's patch. Without the patch the test sequence ends about 50% of the time in a sigsegv. With the patch all executions ended without any errors. Note that the problem only affects CPUs with the AVX extension. A workaround for affected VMs is to disable vectorization with -XX:-UseSuperWord. java 7u40 causes sigsegv and corrupt term vectors - Key: LUCENE-5212 URL: https://issues.apache.org/jira/browse/LUCENE-5212 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Attachments: crashFaster.patch, crashFaster2.0.patch, hs_err_pid32714.log, jenkins.txt -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement
[ https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818494#comment-13818494 ] Shai Erera commented on LUCENE-5316: Gilad still hasn't uploaded a new patch w/ the bugfix. About the results, again the absolute QPS differs a lot? I compared that run to the one before. Taxonomy tree traversing improvement Key: LUCENE-5316 URL: https://issues.apache.org/jira/browse/LUCENE-5316 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Gilad Barkai Priority: Minor Attachments: LUCENE-5316.patch, LUCENE-5316.patch, LUCENE-5316.patch The taxonomy traversing is done today utilizing the {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays which hold for each ordinal it's (array #1) youngest child and (array #2) older sibling. This is a compact way of holding the tree information in memory, but it's not perfect: * Large (8 bytes per ordinal in memory) * Exposes internal implementation * Utilizing these arrays for tree traversing is not straight forward * Lose reference locality while traversing (the array is accessed in increasing only entries, but they may be distant from one another) * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size) This issue is about making the traversing more easy, the code more readable, and open it for future improvements (i.e memory footprint and NRT cost) - without changing any of the internals. A later issue(s?) could be opened to address the gaps once this one is done. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: [JENKINS] Lucene-Solr-4.x-MacOSX (64bit/jdk1.6.0) - Build # 978 - Failure!
It looks like the whole set of all class files in complete is too much on MacOSX with default heap size... Does anybody know the default heap size on MacOSX? Is this anywhere documented? Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Policeman Jenkins Server [mailto:jenk...@thetaphi.de] Sent: Sunday, November 10, 2013 2:13 PM To: dev@lucene.apache.org Subject: [JENKINS] Lucene-Solr-4.x-MacOSX (64bit/jdk1.6.0) - Build # 978 - Failure! Build: http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-MacOSX/978/ Java: 64bit/jdk1.6.0 -XX:+UseCompressedOops -XX:+UseParallelGC All tests passed Build Log: [...truncated 22541 lines...] BUILD FAILED /Users/jenkins/workspace/Lucene-Solr-4.x-MacOSX/build.xml:428: The following error occurred while executing this line: /Users/jenkins/workspace/Lucene-Solr-4.x-MacOSX/build.xml:67: The following error occurred while executing this line: /Users/jenkins/workspace/Lucene-Solr-4.x-MacOSX/lucene/build.xml:188: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.init(String.java:215) at java.lang.StringBuffer.toString(StringBuffer.java:585) at de.thetaphi.forbiddenapis.asm.commons.Method.toString(Unknown Source) at java.lang.String.valueOf(String.java:2826) at java.lang.StringBuilder.append(StringBuilder.java:115) at de.thetaphi.forbiddenapis.Checker$1$1.checkMethodAccess(Checker.java:4 75) at de.thetaphi.forbiddenapis.Checker$1$1.visitMethodInsn(Checker.java:527) at de.thetaphi.forbiddenapis.asm.ClassReader.a(Unknown Source) at de.thetaphi.forbiddenapis.asm.ClassReader.b(Unknown Source) at de.thetaphi.forbiddenapis.asm.ClassReader.accept(Unknown Source) at de.thetaphi.forbiddenapis.asm.ClassReader.accept(Unknown Source) at de.thetaphi.forbiddenapis.Checker.checkClass(Checker.java:378) at de.thetaphi.forbiddenapis.Checker.run(Checker.java:563) at de.thetaphi.forbiddenapis.AntTask.execute(AntTask.java:166) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291) at sun.reflect.GeneratedMethodAccessor459.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) at org.apache.tools.ant.Task.perform(Task.java:348) at org.apache.tools.ant.Target.execute(Target.java:390) at org.apache.tools.ant.Target.performTasks(Target.java:411) at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1399) at org.apache.tools.ant.helper.SingleCheckExecutor.executeTargets(SingleChe ckExecutor.java:38) at org.apache.tools.ant.Project.executeTargets(Project.java:1251) at org.apache.tools.ant.taskdefs.Ant.execute(Ant.java:442) at org.apache.tools.ant.taskdefs.SubAnt.execute(SubAnt.java:302) at org.apache.tools.ant.taskdefs.SubAnt.execute(SubAnt.java:221) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291) at sun.reflect.GeneratedMethodAccessor459.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:25) Total time: 95 minutes 39 seconds Build step 'Invoke Ant' marked build as failure Description set: Java: 64bit/jdk1.6.0 -XX:+UseCompressedOops -XX:+UseParallelGC Archiving artifacts Recording test results Email was triggered for: Failure Sending email for trigger: Failure - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4753) Make forbidden API checks per-module
[ https://issues.apache.org/jira/browse/LUCENE-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-4753: -- Fix Version/s: 4.6 Make forbidden API checks per-module Key: LUCENE-4753 URL: https://issues.apache.org/jira/browse/LUCENE-4753 Project: Lucene - Core Issue Type: Improvement Components: general/build Reporter: Uwe Schindler Assignee: Uwe Schindler Priority: Critical Fix For: 4.6 After the forbidden API checker was released separately from Lucene as a Google Code project (forked and improved), including Maven support, the checks on Lucene should be changed to work per-module. The reason for this is: The improved checker is more picky about e.g. extending classes that are forbidden or overriding methods and calling super.method() if they are on the forbidden signatures list. For these checks, it is not enough to have the class files and the rt.jar, you need the whole classpath. The forbidden APIs 1.0 now by default complains if classes are missing from the classpath. It is very hard with the module architecture of Lucene/Solr, to make a uber-classpath, instead the checks should be done per module, so the default compile/test classpath of the module can be used and no crazy path statements with **/*.jar are needed. This needs some refactoring in the exclusion lists, but the Lucene checks could be done by a macro in common-build, that allows custom exclusion lists for specific modules. Currently, the strict checking is disabled for Solr, so the checker only complains about missing classes but does not fail the build: {noformat} -check-forbidden-java-apis: [forbidden-apis] Reading bundled API signatures: jdk-unsafe-1.6 [forbidden-apis] Reading bundled API signatures: jdk-deprecated-1.6 [forbidden-apis] Reading bundled API signatures: commons-io-unsafe-2.1 [forbidden-apis] Reading API signatures: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr3\lucene\tools\forbiddenApis\executors.txt [forbidden-apis] Reading API signatures: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr3\lucene\tools\forbiddenApis\servlet-api.txt [forbidden-apis] Loading classes to check... [forbidden-apis] Scanning for API signatures and dependencies... [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProviderFactory' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProviderFactory' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProvider' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.collation.ICUCollationKeyAnalyzer' cannot be loaded. Please fix the classpath! [forbidden-apis] Scanned 2177 (and 1222 related) class file(s) for forbidden API invocations (in 1.80s), 0 error(s). {noformat} I added almost all missing jars, but those do not seem to be in the solr part of the source tree (i think they are only copied when building artifacts). With making the whole thing per module, we can use the default classpath of the module which makes it much easier. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4753) Make forbidden API checks per-module
[ https://issues.apache.org/jira/browse/LUCENE-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-4753: -- Priority: Critical (was: Major) Recently on MacOSX, with the default heap size, we get OOMs while running forbidden-checker. So we should really do this now. My proposal: Move the forbidden targets into commons-build.xml on Lucene and Solr. Inside commons-build, also define some properties to make some excludes, so we can define per module, which patterns/filesets should be checked. Make forbidden API checks per-module Key: LUCENE-4753 URL: https://issues.apache.org/jira/browse/LUCENE-4753 Project: Lucene - Core Issue Type: Improvement Components: general/build Reporter: Uwe Schindler Assignee: Uwe Schindler Priority: Critical Fix For: 4.6 After the forbidden API checker was released separately from Lucene as a Google Code project (forked and improved), including Maven support, the checks on Lucene should be changed to work per-module. The reason for this is: The improved checker is more picky about e.g. extending classes that are forbidden or overriding methods and calling super.method() if they are on the forbidden signatures list. For these checks, it is not enough to have the class files and the rt.jar, you need the whole classpath. The forbidden APIs 1.0 now by default complains if classes are missing from the classpath. It is very hard with the module architecture of Lucene/Solr, to make a uber-classpath, instead the checks should be done per module, so the default compile/test classpath of the module can be used and no crazy path statements with **/*.jar are needed. This needs some refactoring in the exclusion lists, but the Lucene checks could be done by a macro in common-build, that allows custom exclusion lists for specific modules. Currently, the strict checking is disabled for Solr, so the checker only complains about missing classes but does not fail the build: {noformat} -check-forbidden-java-apis: [forbidden-apis] Reading bundled API signatures: jdk-unsafe-1.6 [forbidden-apis] Reading bundled API signatures: jdk-deprecated-1.6 [forbidden-apis] Reading bundled API signatures: commons-io-unsafe-2.1 [forbidden-apis] Reading API signatures: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr3\lucene\tools\forbiddenApis\executors.txt [forbidden-apis] Reading API signatures: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr3\lucene\tools\forbiddenApis\servlet-api.txt [forbidden-apis] Loading classes to check... [forbidden-apis] Scanning for API signatures and dependencies... [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProviderFactory' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProviderFactory' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProvider' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.collation.ICUCollationKeyAnalyzer' cannot be loaded. Please fix the classpath! [forbidden-apis] Scanned 2177 (and 1222 related) class file(s) for forbidden API invocations (in 1.80s), 0 error(s). {noformat} I added almost all missing jars, but those do not seem to be in the solr part of the source tree (i think they are only copied when building artifacts). With making the whole thing per module, we can use the default classpath of the module which makes it much easier. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4753) Make forbidden API checks per-module
[ https://issues.apache.org/jira/browse/LUCENE-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818499#comment-13818499 ] Uwe Schindler commented on LUCENE-4753: --- The Maven builds are already per module! So we should get the file patterns and targets also synchronized with the definitions in the maven POMs - I have to say: in this case, the maven build is better than our ANT build :-( Thanks [~steve_rowe]!!! Make forbidden API checks per-module Key: LUCENE-4753 URL: https://issues.apache.org/jira/browse/LUCENE-4753 Project: Lucene - Core Issue Type: Improvement Components: general/build Reporter: Uwe Schindler Assignee: Uwe Schindler Priority: Critical Fix For: 4.6 After the forbidden API checker was released separately from Lucene as a Google Code project (forked and improved), including Maven support, the checks on Lucene should be changed to work per-module. The reason for this is: The improved checker is more picky about e.g. extending classes that are forbidden or overriding methods and calling super.method() if they are on the forbidden signatures list. For these checks, it is not enough to have the class files and the rt.jar, you need the whole classpath. The forbidden APIs 1.0 now by default complains if classes are missing from the classpath. It is very hard with the module architecture of Lucene/Solr, to make a uber-classpath, instead the checks should be done per module, so the default compile/test classpath of the module can be used and no crazy path statements with **/*.jar are needed. This needs some refactoring in the exclusion lists, but the Lucene checks could be done by a macro in common-build, that allows custom exclusion lists for specific modules. Currently, the strict checking is disabled for Solr, so the checker only complains about missing classes but does not fail the build: {noformat} -check-forbidden-java-apis: [forbidden-apis] Reading bundled API signatures: jdk-unsafe-1.6 [forbidden-apis] Reading bundled API signatures: jdk-deprecated-1.6 [forbidden-apis] Reading bundled API signatures: commons-io-unsafe-2.1 [forbidden-apis] Reading API signatures: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr3\lucene\tools\forbiddenApis\executors.txt [forbidden-apis] Reading API signatures: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr3\lucene\tools\forbiddenApis\servlet-api.txt [forbidden-apis] Loading classes to check... [forbidden-apis] Scanning for API signatures and dependencies... [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProviderFactory' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProviderFactory' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProvider' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.collation.ICUCollationKeyAnalyzer' cannot be loaded. Please fix the classpath! [forbidden-apis] Scanned 2177 (and 1222 related) class file(s) for forbidden API invocations (in 1.80s), 0 error(s). {noformat} I added almost all missing jars, but those do not seem to be in the solr part of the source tree (i think they are only copied when building artifacts). With making the whole thing per module, we can use the default classpath of the module which makes it much easier. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [JENKINS] Lucene-Solr-4.x-MacOSX (64bit/jdk1.6.0) - Build # 978 - Failure!
Does anybody know the default heap size on MacOSX? Is this anywhere documented? I think it's a heuristic that depends on the environment (didn't inspect openjdk sources)? We could just dump memory limits via mx bean -- it'd provide an interesting insight into the defaults on different systems... Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5333) Support sparse faceting for heterogeneous indices
[ https://issues.apache.org/jira/browse/LUCENE-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-5333: --- Attachment: LUCENE-5333.patch Here's the simple way I thought about - AllFacetsAccumulator takes no requests, has two ctors - one for SSDV and another for TaxoReader and initializes the proper FA underneath. To which it delegates the .accumulate, and later filters out any FRes with no children. It's just a means for showing how I think it should be done. Still need to integrate it into FA.create, if we want to simplify an app's life even more, though I'd prefer to wait for some feedback from anyone that actually uses it first. Support sparse faceting for heterogeneous indices - Key: LUCENE-5333 URL: https://issues.apache.org/jira/browse/LUCENE-5333 Project: Lucene - Core Issue Type: New Feature Components: modules/facet Reporter: Michael McCandless Attachments: LUCENE-5333.patch, LUCENE-5333.patch In some search apps, e.g. a large e-commerce site, the index can have a mix of wildly different product categories and facet dimensions, and the number of dimensions could be huge. E.g. maybe the index has shirts, computer memory, hard drives, etc., and each of these many categories has different attributes. In such an index, when someone searches for so dimm, which should match a bunch of laptop memory modules, you can't (easily) know up front which facet dimensions will be important. But, I think this is very easy for the facet module, since ords are stored row stride (each doc lists all facet labels it has), we could simply count all facets that the hits actually saw, and then in the end see which ones got traction and return facet results for these top dims. I'm not sure what the API would look like, but conceptually this should work very well, because of how the facet module works. You shouldn't have to state up front exactly which facet dimensions to count... -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5333) Support sparse faceting for heterogeneous indices
[ https://issues.apache.org/jira/browse/LUCENE-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818525#comment-13818525 ] Shai Erera commented on LUCENE-5333: There's also a third option: * We add the getDimensions API to SSDVReaderState * We put an example under demo/ExploreFacetsExample (or better name) ** We basically demonstrate how to create a ListFR for all available dimensions using either TaxoReader or SSDVReaderState ** And we show how to filter out the empty ones If one day someone will ask how to do it and the example won't be enough, we can think about porting it as an FA or inlined into the other FAs. But until then, it's really a simple example. Support sparse faceting for heterogeneous indices - Key: LUCENE-5333 URL: https://issues.apache.org/jira/browse/LUCENE-5333 Project: Lucene - Core Issue Type: New Feature Components: modules/facet Reporter: Michael McCandless Attachments: LUCENE-5333.patch, LUCENE-5333.patch In some search apps, e.g. a large e-commerce site, the index can have a mix of wildly different product categories and facet dimensions, and the number of dimensions could be huge. E.g. maybe the index has shirts, computer memory, hard drives, etc., and each of these many categories has different attributes. In such an index, when someone searches for so dimm, which should match a bunch of laptop memory modules, you can't (easily) know up front which facet dimensions will be important. But, I think this is very easy for the facet module, since ords are stored row stride (each doc lists all facet labels it has), we could simply count all facets that the hits actually saw, and then in the end see which ones got traction and return facet results for these top dims. I'm not sure what the API would look like, but conceptually this should work very well, because of how the facet module works. You shouldn't have to state up front exactly which facet dimensions to count... -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-4.x-MacOSX (64bit/jdk1.6.0) - Build # 979 - Still Failing!
Build: http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-MacOSX/979/ Java: 64bit/jdk1.6.0 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC All tests passed Build Log: [...truncated 22550 lines...] BUILD FAILED /Users/jenkins/workspace/Lucene-Solr-4.x-MacOSX/build.xml:428: The following error occurred while executing this line: /Users/jenkins/workspace/Lucene-Solr-4.x-MacOSX/build.xml:67: The following error occurred while executing this line: /Users/jenkins/workspace/Lucene-Solr-4.x-MacOSX/lucene/build.xml:188: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572) at java.lang.StringBuilder.append(StringBuilder.java:203) at de.thetaphi.forbiddenapis.Checker$1$1.checkMethodAccess(Checker.java:475) at de.thetaphi.forbiddenapis.Checker$1$1.visitMethodInsn(Checker.java:527) at de.thetaphi.forbiddenapis.asm.ClassReader.a(Unknown Source) at de.thetaphi.forbiddenapis.asm.ClassReader.b(Unknown Source) at de.thetaphi.forbiddenapis.asm.ClassReader.accept(Unknown Source) at de.thetaphi.forbiddenapis.asm.ClassReader.accept(Unknown Source) at de.thetaphi.forbiddenapis.Checker.checkClass(Checker.java:378) at de.thetaphi.forbiddenapis.Checker.run(Checker.java:563) at de.thetaphi.forbiddenapis.AntTask.execute(AntTask.java:166) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291) at sun.reflect.GeneratedMethodAccessor462.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) at org.apache.tools.ant.Task.perform(Task.java:348) at org.apache.tools.ant.Target.execute(Target.java:390) at org.apache.tools.ant.Target.performTasks(Target.java:411) at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1399) at org.apache.tools.ant.helper.SingleCheckExecutor.executeTargets(SingleCheckExecutor.java:38) at org.apache.tools.ant.Project.executeTargets(Project.java:1251) at org.apache.tools.ant.taskdefs.Ant.execute(Ant.java:442) at org.apache.tools.ant.taskdefs.SubAnt.execute(SubAnt.java:302) at org.apache.tools.ant.taskdefs.SubAnt.execute(SubAnt.java:221) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291) at sun.reflect.GeneratedMethodAccessor462.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) Total time: 108 minutes 2 seconds Build step 'Invoke Ant' marked build as failure Description set: Java: 64bit/jdk1.6.0 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC Archiving artifacts Recording test results Email was triggered for: Failure Sending email for trigger: Failure - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-5434) Create minimal solrcloud example directory
Alan Woodward created SOLR-5434: --- Summary: Create minimal solrcloud example directory Key: SOLR-5434 URL: https://issues.apache.org/jira/browse/SOLR-5434 Project: Solr Issue Type: Improvement Reporter: Alan Woodward Assignee: Alan Woodward Priority: Minor Fix For: 4.6, 5.0 The various intro to solr cloud pages (for example https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud) currently tell new users to use the example/ directory as a basis for setting up new cloud instances. These directories contain, under the default solr/ solr home directory, a single core, defined to point to the collection1 collection. It's not at all obvious that, to change the name of your collection, you have to go and edit the core.properties file underneath the solr/ directory. A lot of users on the mailing list also seem to get confused by having to include bootstrap_confdir and numShards the first time they run solr, but not afterwards. So here's a suggestion: * Have a new solrcloud/ directory in the example webapp that just contains a solr.xml file * Change the startup example code to just include -Dsolr.solr.home and -DzkRun * Tell the user to then run zkcli to bootstrap their configuration (solr startup and configuration loading are kept separate) * Tell the users to use the collections API to create a new collection, naming it however they want (confignames, collection names and core names are all kept separate) This way, there's a lot less 'magic' and hidden defaults involved, and all the steps to get a cloud up and running (start processes, upload configuration, create collection) are made distinguishable. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4753) Make forbidden API checks per-module
[ https://issues.apache.org/jira/browse/LUCENE-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-4753: -- Attachment: LUCENE-4753.patch Patch. I will commit this soon if nobody objects. There is still room for improvements (e.g. we can now enable servlet-api checks in some Lucene modules that use servlets, or enable commons-io checks for lucene modules that use commons-io). Make forbidden API checks per-module Key: LUCENE-4753 URL: https://issues.apache.org/jira/browse/LUCENE-4753 Project: Lucene - Core Issue Type: Improvement Components: general/build Reporter: Uwe Schindler Assignee: Uwe Schindler Priority: Critical Fix For: 4.6 Attachments: LUCENE-4753.patch After the forbidden API checker was released separately from Lucene as a Google Code project (forked and improved), including Maven support, the checks on Lucene should be changed to work per-module. The reason for this is: The improved checker is more picky about e.g. extending classes that are forbidden or overriding methods and calling super.method() if they are on the forbidden signatures list. For these checks, it is not enough to have the class files and the rt.jar, you need the whole classpath. The forbidden APIs 1.0 now by default complains if classes are missing from the classpath. It is very hard with the module architecture of Lucene/Solr, to make a uber-classpath, instead the checks should be done per module, so the default compile/test classpath of the module can be used and no crazy path statements with **/*.jar are needed. This needs some refactoring in the exclusion lists, but the Lucene checks could be done by a macro in common-build, that allows custom exclusion lists for specific modules. Currently, the strict checking is disabled for Solr, so the checker only complains about missing classes but does not fail the build: {noformat} -check-forbidden-java-apis: [forbidden-apis] Reading bundled API signatures: jdk-unsafe-1.6 [forbidden-apis] Reading bundled API signatures: jdk-deprecated-1.6 [forbidden-apis] Reading bundled API signatures: commons-io-unsafe-2.1 [forbidden-apis] Reading API signatures: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr3\lucene\tools\forbiddenApis\executors.txt [forbidden-apis] Reading API signatures: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr3\lucene\tools\forbiddenApis\servlet-api.txt [forbidden-apis] Loading classes to check... [forbidden-apis] Scanning for API signatures and dependencies... [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProviderFactory' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProviderFactory' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProvider' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.collation.ICUCollationKeyAnalyzer' cannot be loaded. Please fix the classpath! [forbidden-apis] Scanned 2177 (and 1222 related) class file(s) for forbidden API invocations (in 1.80s), 0 error(s). {noformat} I added almost all missing jars, but those do not seem to be in the solr part of the source tree (i think they are only copied when building artifacts). With making the whole thing per module, we can use the default classpath of the module which makes it much easier. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-5152) EdgeNGramFilterFactory deletes token
[ https://issues.apache.org/jira/browse/SOLR-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Furkan KAMACI updated SOLR-5152: Attachment: SOLR-5152.patch EdgeNGramFilterFactory deletes token Key: SOLR-5152 URL: https://issues.apache.org/jira/browse/SOLR-5152 Project: Solr Issue Type: Improvement Affects Versions: 4.4 Reporter: Christoph Lingg Attachments: SOLR-5152.patch I am using EdgeNGramFilterFactory in my schema.xml {code:xml}fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index !-- ... -- filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=10 side=front / /analyzer /fieldType{code} Some tokens in my index only consist of one character, let's say {{R}}. minGramSize is set to 2 and is bigger than the length of the token. I expected the NGramFilter to left {{R}} unchanged but in fact it is deleting the token. For my use case this interpretation is undesirable, and probably for most use cases too!? -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5152) EdgeNGramFilterFactory deletes token
[ https://issues.apache.org/jira/browse/SOLR-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818583#comment-13818583 ] Furkan KAMACI commented on SOLR-5152: - I've added preserveOriginal capability to EdgeNGramFilterFactory and attached a patch. EdgeNGramFilterFactory deletes token Key: SOLR-5152 URL: https://issues.apache.org/jira/browse/SOLR-5152 Project: Solr Issue Type: Improvement Affects Versions: 4.4 Reporter: Christoph Lingg Attachments: SOLR-5152.patch I am using EdgeNGramFilterFactory in my schema.xml {code:xml}fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index !-- ... -- filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=10 side=front / /analyzer /fieldType{code} Some tokens in my index only consist of one character, let's say {{R}}. minGramSize is set to 2 and is bigger than the length of the token. I expected the NGramFilter to left {{R}} unchanged but in fact it is deleting the token. For my use case this interpretation is undesirable, and probably for most use cases too!? -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4753) Make forbidden API checks per-module
[ https://issues.apache.org/jira/browse/LUCENE-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-4753: -- Attachment: LUCENE-4753.patch New patch, removed useless dependency. Make forbidden API checks per-module Key: LUCENE-4753 URL: https://issues.apache.org/jira/browse/LUCENE-4753 Project: Lucene - Core Issue Type: Improvement Components: general/build Reporter: Uwe Schindler Assignee: Uwe Schindler Priority: Critical Fix For: 4.6 Attachments: LUCENE-4753.patch, LUCENE-4753.patch After the forbidden API checker was released separately from Lucene as a Google Code project (forked and improved), including Maven support, the checks on Lucene should be changed to work per-module. The reason for this is: The improved checker is more picky about e.g. extending classes that are forbidden or overriding methods and calling super.method() if they are on the forbidden signatures list. For these checks, it is not enough to have the class files and the rt.jar, you need the whole classpath. The forbidden APIs 1.0 now by default complains if classes are missing from the classpath. It is very hard with the module architecture of Lucene/Solr, to make a uber-classpath, instead the checks should be done per module, so the default compile/test classpath of the module can be used and no crazy path statements with **/*.jar are needed. This needs some refactoring in the exclusion lists, but the Lucene checks could be done by a macro in common-build, that allows custom exclusion lists for specific modules. Currently, the strict checking is disabled for Solr, so the checker only complains about missing classes but does not fail the build: {noformat} -check-forbidden-java-apis: [forbidden-apis] Reading bundled API signatures: jdk-unsafe-1.6 [forbidden-apis] Reading bundled API signatures: jdk-deprecated-1.6 [forbidden-apis] Reading bundled API signatures: commons-io-unsafe-2.1 [forbidden-apis] Reading API signatures: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr3\lucene\tools\forbiddenApis\executors.txt [forbidden-apis] Reading API signatures: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr3\lucene\tools\forbiddenApis\servlet-api.txt [forbidden-apis] Loading classes to check... [forbidden-apis] Scanning for API signatures and dependencies... [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProviderFactory' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProviderFactory' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProvider' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.collation.ICUCollationKeyAnalyzer' cannot be loaded. Please fix the classpath! [forbidden-apis] Scanned 2177 (and 1222 related) class file(s) for forbidden API invocations (in 1.80s), 0 error(s). {noformat} I added almost all missing jars, but those do not seem to be in the solr part of the source tree (i think they are only copied when building artifacts). With making the whole thing per module, we can use the default classpath of the module which makes it much easier. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5332) Add preserve original setting to the EdgeNGramFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818596#comment-13818596 ] Furkan KAMACI commented on SOLR-5332: - This issue can be marked as duplicated because of that issue: https://issues.apache.org/jira/browse/SOLR-5152 Add preserve original setting to the EdgeNGramFilterFactory - Key: SOLR-5332 URL: https://issues.apache.org/jira/browse/SOLR-5332 Project: Solr Issue Type: Wish Reporter: Alexander S. Hi, as described here: http://lucene.472066.n3.nabble.com/Help-to-figure-out-why-query-does-not-match-td4086967.html the problem is in that if you have these 2 strings to index: 1. facebook.com/someuser.1 2. facebook.com/someveryandverylongusername and the edge ngram filter factory with min and max gram size settings 2 and 25, search requests for these urls will fail. But search requests for: 1. facebook.com/someuser 2. facebook.com/someveryandverylonguserna will work properly. It's because first url has 1 at the end, which is lover than the allowed min gram size. In the second url the user name is longer than the max gram size (27 characters). Would be good to have a preserve original option, that will add the original string to the index if it does not fit the allowed gram size, so that 1 and someveryandverylongusername tokens will also be added to the index. Best, Alex -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5332) Add preserve original setting to the EdgeNGramFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818593#comment-13818593 ] Furkan KAMACI commented on SOLR-5332: - I've added preserveOriginal capability to EdgeNGramFilterFactory and attached a patch to SOLR-5152. I want to make clear something about the problem that is pointed at this issue. The schema that is described at here: http://lucene.472066.n3.nabble.com/Help-to-figure-out-why-query-does-not-match-td4086967.html uses LowerCaseFilterFactory before EdgeNGramFilterFactory. There is an explanation about it: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory and says that: Creates tokens by lowercasing all letters and dropping non-letters. So non-letters will be dropped before tokens are retrieved by EdgeNGramFilterFactory. My patch preserves original token if preserveOriginal is set to true and token length is less than minGramSize or greater than maxGramSize. Add preserve original setting to the EdgeNGramFilterFactory - Key: SOLR-5332 URL: https://issues.apache.org/jira/browse/SOLR-5332 Project: Solr Issue Type: Wish Reporter: Alexander S. Hi, as described here: http://lucene.472066.n3.nabble.com/Help-to-figure-out-why-query-does-not-match-td4086967.html the problem is in that if you have these 2 strings to index: 1. facebook.com/someuser.1 2. facebook.com/someveryandverylongusername and the edge ngram filter factory with min and max gram size settings 2 and 25, search requests for these urls will fail. But search requests for: 1. facebook.com/someuser 2. facebook.com/someveryandverylonguserna will work properly. It's because first url has 1 at the end, which is lover than the allowed min gram size. In the second url the user name is longer than the max gram size (27 characters). Would be good to have a preserve original option, that will add the original string to the index if it does not fit the allowed gram size, so that 1 and someveryandverylongusername tokens will also be added to the index. Best, Alex -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-5152) EdgeNGramFilterFactory deletes token
[ https://issues.apache.org/jira/browse/SOLR-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Furkan KAMACI updated SOLR-5152: Attachment: (was: SOLR-5152.patch) EdgeNGramFilterFactory deletes token Key: SOLR-5152 URL: https://issues.apache.org/jira/browse/SOLR-5152 Project: Solr Issue Type: Improvement Affects Versions: 4.4 Reporter: Christoph Lingg Attachments: SOLR-5152.patch I am using EdgeNGramFilterFactory in my schema.xml {code:xml}fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index !-- ... -- filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=10 side=front / /analyzer /fieldType{code} Some tokens in my index only consist of one character, let's say {{R}}. minGramSize is set to 2 and is bigger than the length of the token. I expected the NGramFilter to left {{R}} unchanged but in fact it is deleting the token. For my use case this interpretation is undesirable, and probably for most use cases too!? -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-5152) EdgeNGramFilterFactory deletes token
[ https://issues.apache.org/jira/browse/SOLR-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Furkan KAMACI updated SOLR-5152: Attachment: SOLR-5152.patch EdgeNGramFilterFactory deletes token Key: SOLR-5152 URL: https://issues.apache.org/jira/browse/SOLR-5152 Project: Solr Issue Type: Improvement Affects Versions: 4.4 Reporter: Christoph Lingg Attachments: SOLR-5152.patch I am using EdgeNGramFilterFactory in my schema.xml {code:xml}fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index !-- ... -- filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=10 side=front / /analyzer /fieldType{code} Some tokens in my index only consist of one character, let's say {{R}}. minGramSize is set to 2 and is bigger than the length of the token. I expected the NGramFilter to left {{R}} unchanged but in fact it is deleting the token. For my use case this interpretation is undesirable, and probably for most use cases too!? -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4753) Make forbidden API checks per-module
[ https://issues.apache.org/jira/browse/LUCENE-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-4753: -- Attachment: LUCENE-4753.patch Final patch. Will commit in a moment. Make forbidden API checks per-module Key: LUCENE-4753 URL: https://issues.apache.org/jira/browse/LUCENE-4753 Project: Lucene - Core Issue Type: Improvement Components: general/build Reporter: Uwe Schindler Assignee: Uwe Schindler Priority: Critical Fix For: 4.6 Attachments: LUCENE-4753.patch, LUCENE-4753.patch, LUCENE-4753.patch After the forbidden API checker was released separately from Lucene as a Google Code project (forked and improved), including Maven support, the checks on Lucene should be changed to work per-module. The reason for this is: The improved checker is more picky about e.g. extending classes that are forbidden or overriding methods and calling super.method() if they are on the forbidden signatures list. For these checks, it is not enough to have the class files and the rt.jar, you need the whole classpath. The forbidden APIs 1.0 now by default complains if classes are missing from the classpath. It is very hard with the module architecture of Lucene/Solr, to make a uber-classpath, instead the checks should be done per module, so the default compile/test classpath of the module can be used and no crazy path statements with **/*.jar are needed. This needs some refactoring in the exclusion lists, but the Lucene checks could be done by a macro in common-build, that allows custom exclusion lists for specific modules. Currently, the strict checking is disabled for Solr, so the checker only complains about missing classes but does not fail the build: {noformat} -check-forbidden-java-apis: [forbidden-apis] Reading bundled API signatures: jdk-unsafe-1.6 [forbidden-apis] Reading bundled API signatures: jdk-deprecated-1.6 [forbidden-apis] Reading bundled API signatures: commons-io-unsafe-2.1 [forbidden-apis] Reading API signatures: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr3\lucene\tools\forbiddenApis\executors.txt [forbidden-apis] Reading API signatures: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr3\lucene\tools\forbiddenApis\servlet-api.txt [forbidden-apis] Loading classes to check... [forbidden-apis] Scanning for API signatures and dependencies... [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProviderFactory' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProviderFactory' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProvider' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.collation.ICUCollationKeyAnalyzer' cannot be loaded. Please fix the classpath! [forbidden-apis] Scanned 2177 (and 1222 related) class file(s) for forbidden API invocations (in 1.80s), 0 error(s). {noformat} I added almost all missing jars, but those do not seem to be in the solr part of the source tree (i think they are only copied when building artifacts). With making the whole thing per module, we can use the default classpath of the module which makes it much easier. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Adding preserveOriginal Capability to EdgeNGramFilterFactory
Hi; There were two issues about adding preserveOriginal capability to EdgeNGramFilterFactory and I've made a patch about it. You can check and test it from here: https://issues.apache.org/jira/browse/SOLR-5152 This is the related issue that can be marked as duplicated: https://issues.apache.org/jira/browse/SOLR-5332 Thanks; Furkan KAMACI
[jira] [Commented] (LUCENE-4753) Make forbidden API checks per-module
[ https://issues.apache.org/jira/browse/LUCENE-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818603#comment-13818603 ] ASF subversion and git services commented on LUCENE-4753: - Commit 1540573 from [~thetaphi] in branch 'dev/trunk' [ https://svn.apache.org/r1540573 ] LUCENE-4753: Run forbidden-apis Ant task per module. This allows more improvements and prevents OOMs after the number of class files raised recently Make forbidden API checks per-module Key: LUCENE-4753 URL: https://issues.apache.org/jira/browse/LUCENE-4753 Project: Lucene - Core Issue Type: Improvement Components: general/build Reporter: Uwe Schindler Assignee: Uwe Schindler Priority: Critical Fix For: 4.6 Attachments: LUCENE-4753.patch, LUCENE-4753.patch, LUCENE-4753.patch After the forbidden API checker was released separately from Lucene as a Google Code project (forked and improved), including Maven support, the checks on Lucene should be changed to work per-module. The reason for this is: The improved checker is more picky about e.g. extending classes that are forbidden or overriding methods and calling super.method() if they are on the forbidden signatures list. For these checks, it is not enough to have the class files and the rt.jar, you need the whole classpath. The forbidden APIs 1.0 now by default complains if classes are missing from the classpath. It is very hard with the module architecture of Lucene/Solr, to make a uber-classpath, instead the checks should be done per module, so the default compile/test classpath of the module can be used and no crazy path statements with **/*.jar are needed. This needs some refactoring in the exclusion lists, but the Lucene checks could be done by a macro in common-build, that allows custom exclusion lists for specific modules. Currently, the strict checking is disabled for Solr, so the checker only complains about missing classes but does not fail the build: {noformat} -check-forbidden-java-apis: [forbidden-apis] Reading bundled API signatures: jdk-unsafe-1.6 [forbidden-apis] Reading bundled API signatures: jdk-deprecated-1.6 [forbidden-apis] Reading bundled API signatures: commons-io-unsafe-2.1 [forbidden-apis] Reading API signatures: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr3\lucene\tools\forbiddenApis\executors.txt [forbidden-apis] Reading API signatures: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr3\lucene\tools\forbiddenApis\servlet-api.txt [forbidden-apis] Loading classes to check... [forbidden-apis] Scanning for API signatures and dependencies... [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProviderFactory' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProviderFactory' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProvider' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.collation.ICUCollationKeyAnalyzer' cannot be loaded. Please fix the classpath! [forbidden-apis] Scanned 2177 (and 1222 related) class file(s) for forbidden API invocations (in 1.80s), 0 error(s). {noformat} I added almost all missing jars, but those do not seem to be in the solr part of the source tree (i think they are only copied when building artifacts). With making the whole thing per module, we can use the default classpath of the module which makes it much easier. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4753) Make forbidden API checks per-module
[ https://issues.apache.org/jira/browse/LUCENE-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-4753. --- Resolution: Fixed Make forbidden API checks per-module Key: LUCENE-4753 URL: https://issues.apache.org/jira/browse/LUCENE-4753 Project: Lucene - Core Issue Type: Improvement Components: general/build Reporter: Uwe Schindler Assignee: Uwe Schindler Priority: Critical Fix For: 4.6 Attachments: LUCENE-4753.patch, LUCENE-4753.patch, LUCENE-4753.patch After the forbidden API checker was released separately from Lucene as a Google Code project (forked and improved), including Maven support, the checks on Lucene should be changed to work per-module. The reason for this is: The improved checker is more picky about e.g. extending classes that are forbidden or overriding methods and calling super.method() if they are on the forbidden signatures list. For these checks, it is not enough to have the class files and the rt.jar, you need the whole classpath. The forbidden APIs 1.0 now by default complains if classes are missing from the classpath. It is very hard with the module architecture of Lucene/Solr, to make a uber-classpath, instead the checks should be done per module, so the default compile/test classpath of the module can be used and no crazy path statements with **/*.jar are needed. This needs some refactoring in the exclusion lists, but the Lucene checks could be done by a macro in common-build, that allows custom exclusion lists for specific modules. Currently, the strict checking is disabled for Solr, so the checker only complains about missing classes but does not fail the build: {noformat} -check-forbidden-java-apis: [forbidden-apis] Reading bundled API signatures: jdk-unsafe-1.6 [forbidden-apis] Reading bundled API signatures: jdk-deprecated-1.6 [forbidden-apis] Reading bundled API signatures: commons-io-unsafe-2.1 [forbidden-apis] Reading API signatures: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr3\lucene\tools\forbiddenApis\executors.txt [forbidden-apis] Reading API signatures: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr3\lucene\tools\forbiddenApis\servlet-api.txt [forbidden-apis] Loading classes to check... [forbidden-apis] Scanning for API signatures and dependencies... [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProviderFactory' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProviderFactory' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProvider' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.collation.ICUCollationKeyAnalyzer' cannot be loaded. Please fix the classpath! [forbidden-apis] Scanned 2177 (and 1222 related) class file(s) for forbidden API invocations (in 1.80s), 0 error(s). {noformat} I added almost all missing jars, but those do not seem to be in the solr part of the source tree (i think they are only copied when building artifacts). With making the whole thing per module, we can use the default classpath of the module which makes it much easier. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4753) Make forbidden API checks per-module
[ https://issues.apache.org/jira/browse/LUCENE-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818609#comment-13818609 ] ASF subversion and git services commented on LUCENE-4753: - Commit 1540575 from [~thetaphi] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1540575 ] Merged revision(s) 1540573 from lucene/dev/trunk: LUCENE-4753: Run forbidden-apis Ant task per module. This allows more improvements and prevents OOMs after the number of class files raised recently Make forbidden API checks per-module Key: LUCENE-4753 URL: https://issues.apache.org/jira/browse/LUCENE-4753 Project: Lucene - Core Issue Type: Improvement Components: general/build Reporter: Uwe Schindler Assignee: Uwe Schindler Priority: Critical Fix For: 4.6 Attachments: LUCENE-4753.patch, LUCENE-4753.patch, LUCENE-4753.patch After the forbidden API checker was released separately from Lucene as a Google Code project (forked and improved), including Maven support, the checks on Lucene should be changed to work per-module. The reason for this is: The improved checker is more picky about e.g. extending classes that are forbidden or overriding methods and calling super.method() if they are on the forbidden signatures list. For these checks, it is not enough to have the class files and the rt.jar, you need the whole classpath. The forbidden APIs 1.0 now by default complains if classes are missing from the classpath. It is very hard with the module architecture of Lucene/Solr, to make a uber-classpath, instead the checks should be done per module, so the default compile/test classpath of the module can be used and no crazy path statements with **/*.jar are needed. This needs some refactoring in the exclusion lists, but the Lucene checks could be done by a macro in common-build, that allows custom exclusion lists for specific modules. Currently, the strict checking is disabled for Solr, so the checker only complains about missing classes but does not fail the build: {noformat} -check-forbidden-java-apis: [forbidden-apis] Reading bundled API signatures: jdk-unsafe-1.6 [forbidden-apis] Reading bundled API signatures: jdk-deprecated-1.6 [forbidden-apis] Reading bundled API signatures: commons-io-unsafe-2.1 [forbidden-apis] Reading API signatures: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr3\lucene\tools\forbiddenApis\executors.txt [forbidden-apis] Reading API signatures: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr3\lucene\tools\forbiddenApis\servlet-api.txt [forbidden-apis] Loading classes to check... [forbidden-apis] Scanning for API signatures and dependencies... [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProviderFactory' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProviderFactory' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProvider' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.collation.ICUCollationKeyAnalyzer' cannot be loaded. Please fix the classpath! [forbidden-apis] Scanned 2177 (and 1222 related) class file(s) for forbidden API invocations (in 1.80s), 0 error(s). {noformat} I added almost all missing jars, but those do not seem to be in the solr part of the source tree (i think they are only copied when building artifacts). With making the whole thing per module, we can use the default classpath of the module which makes it much easier. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4753) Make forbidden API checks per-module
[ https://issues.apache.org/jira/browse/LUCENE-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818616#comment-13818616 ] Uwe Schindler commented on LUCENE-4753: --- FYI: I opened [https://code.google.com/p/forbidden-apis/issues/detail?id=20] to improve the memory usage of forbidden-apis. Make forbidden API checks per-module Key: LUCENE-4753 URL: https://issues.apache.org/jira/browse/LUCENE-4753 Project: Lucene - Core Issue Type: Improvement Components: general/build Reporter: Uwe Schindler Assignee: Uwe Schindler Priority: Critical Fix For: 4.6 Attachments: LUCENE-4753.patch, LUCENE-4753.patch, LUCENE-4753.patch After the forbidden API checker was released separately from Lucene as a Google Code project (forked and improved), including Maven support, the checks on Lucene should be changed to work per-module. The reason for this is: The improved checker is more picky about e.g. extending classes that are forbidden or overriding methods and calling super.method() if they are on the forbidden signatures list. For these checks, it is not enough to have the class files and the rt.jar, you need the whole classpath. The forbidden APIs 1.0 now by default complains if classes are missing from the classpath. It is very hard with the module architecture of Lucene/Solr, to make a uber-classpath, instead the checks should be done per module, so the default compile/test classpath of the module can be used and no crazy path statements with **/*.jar are needed. This needs some refactoring in the exclusion lists, but the Lucene checks could be done by a macro in common-build, that allows custom exclusion lists for specific modules. Currently, the strict checking is disabled for Solr, so the checker only complains about missing classes but does not fail the build: {noformat} -check-forbidden-java-apis: [forbidden-apis] Reading bundled API signatures: jdk-unsafe-1.6 [forbidden-apis] Reading bundled API signatures: jdk-deprecated-1.6 [forbidden-apis] Reading bundled API signatures: commons-io-unsafe-2.1 [forbidden-apis] Reading API signatures: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr3\lucene\tools\forbiddenApis\executors.txt [forbidden-apis] Reading API signatures: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr3\lucene\tools\forbiddenApis\servlet-api.txt [forbidden-apis] Loading classes to check... [forbidden-apis] Scanning for API signatures and dependencies... [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProviderFactory' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProviderFactory' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.analysis.uima.ae.AEProvider' cannot be loaded. Please fix the classpath! [forbidden-apis] WARNING: The referenced class 'org.apache.lucene.collation.ICUCollationKeyAnalyzer' cannot be loaded. Please fix the classpath! [forbidden-apis] Scanned 2177 (and 1222 related) class file(s) for forbidden API invocations (in 1.80s), 0 error(s). {noformat} I added almost all missing jars, but those do not seem to be in the solr part of the source tree (i think they are only copied when building artifacts). With making the whole thing per module, we can use the default classpath of the module which makes it much easier. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5374) Support user configured doc-centric versioning rules
[ https://issues.apache.org/jira/browse/SOLR-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13818637#comment-13818637 ] Yonik Seeley commented on SOLR-5374: bq. should we change the commented logging to log.debug? I only left them there (commented out) in case I needed to try and debug again in the short term. They are not of the quality one would want for long term. I'd rather they be deleted than changed to logs. Support user configured doc-centric versioning rules Key: SOLR-5374 URL: https://issues.apache.org/jira/browse/SOLR-5374 Project: Solr Issue Type: Improvement Reporter: Hoss Man Assignee: Hoss Man Fix For: 4.6, 5.0 Attachments: SOLR-5374.patch, SOLR-5374.patch, SOLR-5374.patch, SOLR-5374.patch, SOLR-5374.patch, SOLR-5374.patch The existing optimistic concurrency features of Solr can be very handy for ensuring that you are only updating/replacing the version of the doc you think you are updating/replacing, w/o the risk of someone else adding/removing the doc in the mean time -- but I've recently encountered some situations where I really wanted to be able to let the client specify an arbitrary version, on a per document basis, (ie: generated by an external system, or perhaps a timestamp of when a file was last modified) and ensure that the corresponding document update was processed only if the new version is greater then the old version -- w/o needing to check exactly which version is currently in Solr. (ie: If a client wants to index version 101 of a doc, that update should fail if version 102 is already in the index, but succeed if the currently indexed version is 99 -- w/o the client needing to ask Solr what the current version) The idea Yonik brought up in SOLR-5298 (letting the client specify a {{\_new\_version\_}} that would be used by the existing optimistic concurrency code to control the assignment of the {{\_version\_}} field for documents) looked like a good direction to go -- but after digging into the way {{\_version\_}} is used internally I realized it requires a uniqueness constraint across all update commands, that would make it impossible to allow multiple independent documents to have the same {{\_version\_}}. So instead I've tackled the problem in a different way, using an UpdateProcessor that is configured with user defined field to track a DocBasedVersion and uses the RTG logic to figure out if the update is allowed. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5337) Add Payload support to FileDictionary (Suggest) and make it more configurable
Areek Zillur created LUCENE-5337: Summary: Add Payload support to FileDictionary (Suggest) and make it more configurable Key: LUCENE-5337 URL: https://issues.apache.org/jira/browse/LUCENE-5337 Project: Lucene - Core Issue Type: Improvement Components: core/search Reporter: Areek Zillur It would be nice to add payload support to FileDictionary, so user can pass in associated payload with suggestion entries. Currently the FileDictionary has a hard-coded field-delimiter (TAB), it would be nice to let the users configure the field delimiter as well. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org