Re: maven indexing tweaks

Michael Bien Fri, 17 Mar 2023 17:31:33 -0700

On 17.03.23 22:38, Antonio wrote:

Hi,


These are impressive savings!

yeah I am pretty happy about the results too. Esp the removal of thesha1 field had a great effect. Technically we do actually offer this asquery through the public API, however, it doesn't appear as anything isusing it - i have to take another look just to be sure. Even ifsomething does we could make it an option in the settings.

Out of curiosity, we don't build the index incrementally using Maven'sIndexReader, do we? That's why we download the whole index, right?

first use will download the whole copy, weekly updates are incremental.And yes it uses DefaultIndexReader (and the updater) of themaven-indexer project.

Which is the reason why we have to make some tweaks upstream to get moreflexibility (and filtering). For example some time in future we mightwant to change where the temp extraction storage is, which maven-indexeruses, which is also part of the proposed PR upstream right now.

https://repo1.maven.org/maven2/.index/ has the compressed data forcentral, (apache etc have their own locations but those indices aresmaller so you barely notice anything)

Currently the lucene index isn't moved into new NetBeans config from oldcaches. This is something we could take a look at too but things likethis are super annoying to test + risky since someone will find a way toimport an index from a 10 year old backup and report that somethingfails (just like users who try to import nb-javac from NB 12.x whichwhich breaks pretty much everything).


-mbien

Thanks,
Antonio


[1]
https://maven.apache.org/maven-indexer/indexer-reader/apidocs/org/apache/maven/index/reader/IndexReader.html
On 17/3/23 11:06, Michael Bien wrote:
Hello everyone,
I experimented a bit with the maven index extraction process and gotsome pretty good results (I think).
There might be a way to filter the index during extraction withoutnoteworthy overhead, which allows the following:
- "sliding window" time filters, e.g drop all documents older than2 years (aka: who uses old libraries?)
- we can drop fields we don't need from the index. Esp interestingfor fields which don't compress well (looking at you, sha1 hash)
some results for the time cutoff filter:

full: 5.6 GB
2y: 2.6 GB
1y: 1.4 GB
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@netbeans.apache.org
For additional commands, e-mail: dev-h...@netbeans.apache.org

For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@netbeans.apache.org
For additional commands, e-mail: dev-h...@netbeans.apache.org

For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists

Re: maven indexing tweaks

Reply via email to