[ 
https://issues.apache.org/jira/browse/SOLR-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343602#comment-16343602
 ] 

Kevin Watters commented on SOLR-11838:
--------------------------------------

I'm very excited to see this integration happening.  [~gus_heck] has been 
working with me on some DL4j projects in particular training models and 
evaluating them for classification.  I think at a high level there are 3 main 
integration patterns that we could / should consider in Solr.
 # using a model at ingest time to tag / annotate a record going into the 
index.  (primary example would be something like sentiment analysis tagging.)  
This implies the model was trained and saved somewhere.
 # using a solr index (query) to generate a set of training test data so that 
DL4j can "fit" the model and train it.  (there might even be a desire for some 
join functionality so you can join together 2 datasets to create adhoc training 
datasets.)
 # (this is a bit more out there.)  indexing each node of the multi layer 
network / computation graph as a document in the index, and use a query to 
evaluate the output of the model by traversing the documents in the index to 
ultimately come up with a set of relevancy scores for the documents that 
represent the output layer of the network.

I think , perhaps, the most interesting use case is #2.  So basically, the idea 
is you want to define a network  (specify the layers, types of layers, 
activation function, etc..) and then specify a query that matches a set of 
documents in the index that would be used to train that model.  Currently DL4j 
uses "datavec" to handle all the data normalization prior to going into the 
model for training.  That exposes a DataSetIterator.  The datasetiterator could 
be replaced with an iterator that sits ontop of the export handler or even just 
a raw search result.  The general use cases here for pagination would be 
 # to iterate the full result set  (presumably multiple times as the model will 
make multiple passes over the data when training.)
 # generate a random ordering of the dataset being returned
 # excluding a random (but deterministic?) set of documents from the main query 
to provide a holdout testing dataset.

Keeping in mind that typically in network training, you have both your training 
dataset and the testing dataset.  

The final outcome of this would be a computationgraph/multilayernetwork which 
can be serialized by dl4j as a json file, and the other output could/should be 
the evaluation or accuracy scores of the model  (F1, Accuracy, and confusion 
matrix.)

As per the comments about natives, yes, there are definitely platform dependent 
parts of DL4j, in particular the "nd4j" which can be gpu/cpu, but there are 
also other dependencies on javacv/javacpp.  The javacv/javacpp stuff is really 
only used for image manipulation as it's the java binding to OpenCV.  The 
dependency tree for DL4j is rather large, so I think we'll need to take 
care/caution that we're not injecting a bunch of conflicting jar files.  
Perhaps, if we identify the conflicting jar versions. 

 

> explore supporting Deeplearning4j NeuralNetwork models
> ------------------------------------------------------
>
>                 Key: SOLR-11838
>                 URL: https://issues.apache.org/jira/browse/SOLR-11838
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Christine Poerschke
>            Priority: Major
>         Attachments: SOLR-11838.patch
>
>
> [~yuyano] wrote in SOLR-11597:
> bq. ... If we think to apply this to more complex neural networks in the 
> future, we will need to support layers ...
> [~malcorn_redhat] wrote in SOLR-11597:
> bq. ... In my opinion, if this is a route Solr eventually wants to go, I 
> think a better strategy would be to just add a dependency on 
> [Deeplearning4j|https://deeplearning4j.org/] ...
> Creating this ticket for the idea to be explored further (if anyone is 
> interested in exploring it), complimentary to and independent of the 
> SOLR-11597 RankNet related effort.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to