[ https://issues.apache.org/jira/browse/SOLR-8492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264923#comment-15264923 ]
Cao Manh Dat commented on SOLR-8492: ------------------------------------ [~joel.bernstein] the mem leak appear in LogitCall class, whe create solrclient and never close it. > Add LogisticRegressionQuery and LogitStream > ------------------------------------------- > > Key: SOLR-8492 > URL: https://issues.apache.org/jira/browse/SOLR-8492 > Project: Solr > Issue Type: New Feature > Reporter: Joel Bernstein > Fix For: 6.1 > > Attachments: SOLR-8492.diff, SOLR-8492.diff, SOLR-8492.patch, > SOLR-8492.patch, SOLR-8492.patch, SOLR-8492.patch, SOLR-8492.patch, > SOLR-8492.patch, SOLR-8492.patch, SOLR-8492.patch, logit.csv > > > This ticket is to add a new query called a LogisticRegressionQuery (LRQ). > The LRQ extends AnalyticsQuery > (http://joelsolr.blogspot.com/2015/12/understanding-solrs-analyticsquery.html) > and returns a DelegatingCollector that implements a Stochastic Gradient > Descent (SGD) optimizer for Logistic Regression. > This ticket also adds the LogitStream which leverages Streaming Expressions > to provide iteration over the shards. Each call to LogitStream.read() calls > down to the shards and executes the LogisticRegressionQuery. The model data > is collected from the shards and the weights are averaged and sent back to > the shards with the next iteration. Each call to read() returns a Tuple with > the averaged weights and error from the shards. With this approach the > LogitStream streams the changing model back to the client after each > iteration. > The LogitStream will return the EOF Tuple when it reaches the defined > maxIterations. When sent as a Streaming Expression to the Stream handler this > provides parallel iterative behavior. This same approach can be used to > implement other parallel iterative algorithms. > The initial patch has a test which simply tests the mechanics of the > iteration. More work will need to be done to ensure the SGD is properly > implemented. The distributed approach of the SGD will also need to be > reviewed. > This implementation is designed for use cases with a small number of features > because each feature is it's own discreet field. > An implementation which supports a higher number of features would be > possible by packing features into a byte array and storing as binary > DocValues. > This implementation is designed to support a large sample set. With a large > number of shards, a sample set into the billions may be possible. > sample Streaming Expression Syntax: > {code} > logit(collection1, features="a,b,c,d,e,f" outcome="x" maxIterations="80") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org