Re: Using the Maven Indexer
Hi, I`ve just created 2 pull requests: https://github.com/apache/maven-indexer/pull/10 (indexer-cli not working as expected) https://github.com/apache/maven-indexer/pull/11 (what I mentioned about the ArtifactInfo's repositoryID being null) Hope you have the time to have a look. I`m particularly interested in #11. If it is accepted, I might continue exploring option 1), as detailed in my previous post, since I am currently blocked by it, as detailed in 1.1). Thanks, Eduard On Mon, Dec 8, 2014 at 10:26 PM, Tamas Cservenak wrote: > Hi Eduard, > > for additional information see: > http://jira.codehaus.org/browse/MINDEXER-81 > > Currently, the ArtifactInfo is hardwired, is not extensible. > > Re available index for Central, it’s not the “minimal usable” > the decision driver, but the SIZE of the index download instead. > We were experimenting with different creators, but the bandwidth > it took off (if you compared it to artifact downloads) was really huge. > > As almost everyone uses MRMs, and they tend to “improve” the > basic GAV index Central publishes (ie. once Nexus caches a > JAR file, it will “improve” the index with Classnames in the JAR too, > something Central does not publish). > > artifactInfo#repoId should not return null, if asked via context. > If it does, there is a bug lurking somewhere. > > Currently the “extra info” path is viable, but that would create a lot > of cruft around indexer classes….. > > > > > > -- > Thanks, > ~t~ > > On 8 Dec 2014 at 17:15:08, Eduard Moraru (enygma2...@gmail.com) wrote: > > Hi, > > I have a new challenge for your maven-indexer expertise :) > > What about adding additional information to the local index? I see the > default indexers (min, etc..) produce really minimal information. The > problem is that everybody is using these default indexers and all the > available indexes (maven central, etc) offer very little information that > you can actually use to make the index useful in an application outside of > really basic name, description, group, artifact, etc queries. > > For instance, if I would want to add author information (to query by > author) or dependency information (to perform compatibility checks against > an installation/group of installed artifacts) or anything else for the > matter, what would be the recommended approach? > > From what I have currently researched, I see 2 options: > > 1) Have a custom IndexCreator that uses the updateDocument(ArtifactInfo > artifactInfo, Document document) method to fetch (HTTP GET) get the > pom.xml > by using information from the artifactInfo object (repository, groupID, > artifactId, classifier, version, etc.) so that the resulting document > contains the extra information. It seems that IndexCreators are used a lot > more than they are advertised in the descriptions, not only for indexing > new items, but also when converting between ArtifactInfo objects and > Lucene > Documents. > > 1.1) I had initially started going on this pat, but then I realized that > the artifactInfo that I receive in this method does not provide basic > information (i.e. artifactInfo.getRepository() always returns null ;-( ) > It > would be awesome if information like context and/or repository would be > added to the artifactInfo object (maybe in > IndexUtils.constructArtifactInfo( Document doc, IndexingContext context ) > ?), the same way the ArtifactInfo.UINFO and ArtifactInfo.LAST_MODIFIED > fields are handled specially and explicitly added to a new Document that > is > passed to the IndexCreators. > > 2) Handle this separately from maven indexer's work, and do it right after > index/update operations, i.e. let maven-indexer update the local index > with > information from the remote index and then start manipulating the > underlying Lucene index by adding information retrieved from the network > (HTTP GET) from the remote repositoy's POM files. In a rough pseudocode, > something like: > > indexer.update(repoX); > indexer.getAllIndexedArtifacts().forEach(artifact -> > var extraData = getExtraData(repoX, artifact); > var indexer.getLuceneIndex().add(artifact, extraData) > ); > > 3) Any other suggestions? > > My ultimate goal is (besides basic name/description queries) to be able to > perform compatibility queries on artifacts coming from multiple > repositories, so I need to find a solution to add this missing infrmation > (artifact dependencies, and maybe more). > > As previously, your help and suggestions are most welcomed. > > Thanks, > Eduard > > On Wed, Nov 26, 2014 at 1:22 PM, Eduard Moraru > wrote: > > > > > > > On Tue, Nov 25, 2014 at 12:22 PM, Tamas Cservenak > > wrote:
Re: Using the Maven Indexer
Hi, I have a new challenge for your maven-indexer expertise :) What about adding additional information to the local index? I see the default indexers (min, etc..) produce really minimal information. The problem is that everybody is using these default indexers and all the available indexes (maven central, etc) offer very little information that you can actually use to make the index useful in an application outside of really basic name, description, group, artifact, etc queries. For instance, if I would want to add author information (to query by author) or dependency information (to perform compatibility checks against an installation/group of installed artifacts) or anything else for the matter, what would be the recommended approach? >From what I have currently researched, I see 2 options: 1) Have a custom IndexCreator that uses the updateDocument(ArtifactInfo artifactInfo, Document document) method to fetch (HTTP GET) get the pom.xml by using information from the artifactInfo object (repository, groupID, artifactId, classifier, version, etc.) so that the resulting document contains the extra information. It seems that IndexCreators are used a lot more than they are advertised in the descriptions, not only for indexing new items, but also when converting between ArtifactInfo objects and Lucene Documents. 1.1) I had initially started going on this pat, but then I realized that the artifactInfo that I receive in this method does not provide basic information (i.e. artifactInfo.getRepository() always returns null ;-( ) It would be awesome if information like context and/or repository would be added to the artifactInfo object (maybe in IndexUtils.constructArtifactInfo( Document doc, IndexingContext context ) ?), the same way the ArtifactInfo.UINFO and ArtifactInfo.LAST_MODIFIED fields are handled specially and explicitly added to a new Document that is passed to the IndexCreators. 2) Handle this separately from maven indexer's work, and do it right after index/update operations, i.e. let maven-indexer update the local index with information from the remote index and then start manipulating the underlying Lucene index by adding information retrieved from the network (HTTP GET) from the remote repositoy's POM files. In a rough pseudocode, something like: indexer.update(repoX); indexer.getAllIndexedArtifacts().forEach(artifact -> var extraData = getExtraData(repoX, artifact); var indexer.getLuceneIndex().add(artifact, extraData) ); 3) Any other suggestions? My ultimate goal is (besides basic name/description queries) to be able to perform compatibility queries on artifacts coming from multiple repositories, so I need to find a solution to add this missing infrmation (artifact dependencies, and maybe more). As previously, your help and suggestions are most welcomed. Thanks, Eduard On Wed, Nov 26, 2014 at 1:22 PM, Eduard Moraru wrote: > > > On Tue, Nov 25, 2014 at 12:22 PM, Tamas Cservenak > wrote: > >> Hi there, >> >> 1) yes, indexing context retains the artefact “origin” (ie. repo), so you >> need context per origin. Sadly, the 1 index per context is current >> limitation of maven indexer, but this problem is known. Created >> http://jira.codehaus.org/browse/MINDEXER-93 >> >> 2) Yes, merged context is basically delegating to member contexts. under >> the hud, it uses Lucene’s MultiReader to actually perform the search. >> > > I have solved the search problem for now by using the SearchEngine > component and issuing an IteratorSearchRequest on a list of > IndexingContexts to get paginated results. Will have to see how that works > on the long run. > > Thanks, > Eduard > > >> Re ranging, there are already issues (or problem spread across multiple >> issues), most notably this one >> http://jira.codehaus.org/browse/MINDEXER-8 >> >> 3) I think yes. Currently, indexer is being transitioned from Plexus to >> JSR330, and as you see in examples, it should work with any container >> supporting it. re “manually wiring”, in latest releases you might be able >> to do it, but in older ones probably not, as Plexus supported field >> injection only, and some of those member was not exposed via getter/setter. >> See >> http://jira.codehaus.org/browse/MINDEXER-80 >> >> >> -- >> Thanks, >> ~t~ >> >> On 21 Nov 2014 at 18:08:26, Eduard Moraru (enygma2...@gmail.com) wrote: >> >> Hi, >> >> I have recently started playing with the maven indexer [1], following the >> examples [2], and I have some questions (since AFAIS, documentation is >> practically unexistent on the matter): >> >> 1) From what I can understand, you need an IndexingContext for each >> repository you plan to index. This makes you end up with n lucene indexes, &g
Re: Using the Maven Indexer
On Tue, Nov 25, 2014 at 12:22 PM, Tamas Cservenak wrote: > Hi there, > > 1) yes, indexing context retains the artefact “origin” (ie. repo), so you > need context per origin. Sadly, the 1 index per context is current > limitation of maven indexer, but this problem is known. Created > http://jira.codehaus.org/browse/MINDEXER-93 > > 2) Yes, merged context is basically delegating to member contexts. under > the hud, it uses Lucene’s MultiReader to actually perform the search. > I have solved the search problem for now by using the SearchEngine component and issuing an IteratorSearchRequest on a list of IndexingContexts to get paginated results. Will have to see how that works on the long run. Thanks, Eduard > Re ranging, there are already issues (or problem spread across multiple > issues), most notably this one > http://jira.codehaus.org/browse/MINDEXER-8 > > 3) I think yes. Currently, indexer is being transitioned from Plexus to > JSR330, and as you see in examples, it should work with any container > supporting it. re “manually wiring”, in latest releases you might be able > to do it, but in older ones probably not, as Plexus supported field > injection only, and some of those member was not exposed via getter/setter. > See > http://jira.codehaus.org/browse/MINDEXER-80 > > > -- > Thanks, > ~t~ > > On 21 Nov 2014 at 18:08:26, Eduard Moraru (enygma2...@gmail.com) wrote: > > Hi, > > I have recently started playing with the maven indexer [1], following the > examples [2], and I have some questions (since AFAIS, documentation is > practically unexistent on the matter): > > 1) From what I can understand, you need an IndexingContext for each > repository you plan to index. This makes you end up with n lucene indexes, > one for each repository. Is there any way that I could have just 1 lucene > index, with all my repositories indexed in the same place? If the main > purpose is searchig, why scatter the indexed information across n indexes > and make the whole process dificult? Maybe I`m missing something. > > 2) On the same line as the first question, when it comes to searching, it > seems that I can use a MergedIndexingContext to perform a search on > multiple (all) indexed repositories (IndexingContexts). How does this merge > the search results? I assume it takes each lucene index and queries it > individually, but this probably means that the lucene scores of these > merged results are completely messed up and ureliable, right? > Any suggestions on how to properly perform search over multiple indexed > repositories? > > 3) About the Plexus Container: Am I forced to initialize and use one, or > can I/should manually instantiate the default implementations and use them > instead? > > I`ll probably come up with more questions along the way, hope someone will > find the time to guide me on the right path. > > Thanks, > Eduard > > -- > [1] https://github.com/apache/maven-indexer/ > [2] > > https://github.com/apache/maven-indexer/tree/master/indexer-examples/indexer-examples-basic >
Re: Using the Maven Indexer
Hi, I have a new question: How can I index a remote repository? All the examples I have found and even the NexusIndexerCli seem to be focused about indexing *only* local repositories and then publishing this index for consumption. I do not want to do that. Is there any way I can pass an URL to an IndexPackingRequest instead of a (local) directory? Basically, my use case is: 1. Take a maven URL 2. If it has an index already created, use it through an IndexUpdatingRequest 3. If not, create a local index (IndexPackingRequest?) 4. In both cases, I then need to periodically update/synchronize my local index of the remote repository. Any help is deeply appreciated. Thanks, Eduard On Fri, Nov 21, 2014 at 7:07 PM, Eduard Moraru wrote: > Hi, > > I have recently started playing with the maven indexer [1], following the > examples [2], and I have some questions (since AFAIS, documentation is > practically unexistent on the matter): > > 1) From what I can understand, you need an IndexingContext for each > repository you plan to index. This makes you end up with n lucene indexes, > one for each repository. Is there any way that I could have just 1 lucene > index, with all my repositories indexed in the same place? If the main > purpose is searchig, why scatter the indexed information across n indexes > and make the whole process dificult? Maybe I`m missing something. > > 2) On the same line as the first question, when it comes to searching, it > seems that I can use a MergedIndexingContext to perform a search on > multiple (all) indexed repositories (IndexingContexts). How does this merge > the search results? I assume it takes each lucene index and queries it > individually, but this probably means that the lucene scores of these > merged results are completely messed up and ureliable, right? > Any suggestions on how to properly perform search over multiple indexed > repositories? > > 3) About the Plexus Container: Am I forced to initialize and use one, or > can I/should manually instantiate the default implementations and use them > instead? > > I`ll probably come up with more questions along the way, hope someone will > find the time to guide me on the right path. > > Thanks, > Eduard > > -- > [1] https://github.com/apache/maven-indexer/ > [2] > https://github.com/apache/maven-indexer/tree/master/indexer-examples/indexer-examples-basic >
Using the Maven Indexer
Hi, I have recently started playing with the maven indexer [1], following the examples [2], and I have some questions (since AFAIS, documentation is practically unexistent on the matter): 1) From what I can understand, you need an IndexingContext for each repository you plan to index. This makes you end up with n lucene indexes, one for each repository. Is there any way that I could have just 1 lucene index, with all my repositories indexed in the same place? If the main purpose is searchig, why scatter the indexed information across n indexes and make the whole process dificult? Maybe I`m missing something. 2) On the same line as the first question, when it comes to searching, it seems that I can use a MergedIndexingContext to perform a search on multiple (all) indexed repositories (IndexingContexts). How does this merge the search results? I assume it takes each lucene index and queries it individually, but this probably means that the lucene scores of these merged results are completely messed up and ureliable, right? Any suggestions on how to properly perform search over multiple indexed repositories? 3) About the Plexus Container: Am I forced to initialize and use one, or can I/should manually instantiate the default implementations and use them instead? I`ll probably come up with more questions along the way, hope someone will find the time to guide me on the right path. Thanks, Eduard -- [1] https://github.com/apache/maven-indexer/ [2] https://github.com/apache/maven-indexer/tree/master/indexer-examples/indexer-examples-basic