This is an automated email from the ASF dual-hosted git repository. dweiss pushed a commit to branch jira/solr-13105-toMerge in repository https://gitbox.apache.org/repos/asf/solr.git
commit eb6298091b0705c53b148ebeb149d2b21ddfa652 Author: Cassandra Targett <ctarg...@apache.org> AuthorDate: Wed Jan 6 21:43:46 2021 -0600 Remove old doc left behind in branch merge; fix children list to pass the build --- solr/solr-ref-guide/src/math-expressions.adoc | 2 +- solr/solr-ref-guide/src/vectorization.adoc | 383 -------------------------- 2 files changed, 1 insertion(+), 384 deletions(-) diff --git a/solr/solr-ref-guide/src/math-expressions.adoc b/solr/solr-ref-guide/src/math-expressions.adoc index 3554c90..343696e 100644 --- a/solr/solr-ref-guide/src/math-expressions.adoc +++ b/solr/solr-ref-guide/src/math-expressions.adoc @@ -1,5 +1,5 @@ = Streaming Expressions and Math Expressions -:page-children: visualization, math-start, loading, search-sample, transform, scalar-math, vector-math, variables, matrix-math, term-vectors, statistics, probability-distributions, simulations, time-series, regression, numerical-analysis, curve-fitting, dsp, machine-learning, computational-geometry +:page-children: visualization, math-start, loading, search-sample, transform, scalar-math, vector-math, variables, matrix-math, term-vectors, statistics, probability-distributions, simulations, time-series, regression, numerical-analysis, curve-fitting, dsp, machine-learning, computational-geometry, logs // Licensed to the Apache Software Foundation (ASF) under one // or more contributor license agreements. See the NOTICE file diff --git a/solr/solr-ref-guide/src/vectorization.adoc b/solr/solr-ref-guide/src/vectorization.adoc deleted file mode 100644 index 26a6f60..0000000 --- a/solr/solr-ref-guide/src/vectorization.adoc +++ /dev/null @@ -1,383 +0,0 @@ -= Streams and Vectorization -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -This section of the user guide explores techniques -for retrieving streams of data from Solr and vectorizing the -numeric fields. - -See the section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> which describes how to -vectorize text fields. - -== Streams - -Streaming Expressions has a wide range of stream sources that can be used to -retrieve data from SolrCloud collections. Math expressions can be used -to vectorize and analyze the results sets. - -Below are some of the key stream sources: - -* *`facet`*: Multi-dimensional aggregations are a powerful tool for generating -co-occurrence counts for categorical data. The `facet` function uses the JSON facet API -under the covers to provide fast, distributed, multi-dimension aggregations. With math expressions -the aggregated results can be pivoted into a co-occurance matrix which can be mined for -correlations and hidden similarities within the data. - -* *`random`*: Random sampling is widely used in statistics, probability and machine learning. -The `random` function returns a random sample of search results that match a -query. The random samples can be vectorized and operated on by math expressions and the results -can be used to describe and make inferences about the entire population. - -* *`timeseries`*: The `timeseries` -expression provides fast distributed time series aggregations, which can be -vectorized and analyzed with math expressions. - -* *`knnSearch`*: K-nearest neighbor is a core machine learning algorithm. The `knnSearch` -function is a specialized knn algorithm optimized to find the k-nearest neighbors of a document in -a distributed index. Once the nearest neighbors are retrieved they can be vectorized -and operated on by machine learning and text mining algorithms. - -* *`sql`*: SQL is the primary query language used by data scientists. The `sql` function supports -data retrieval using a subset of SQL which includes both full text search and -fast distributed aggregations. The result sets can then be vectorized and operated -on by math expressions. - -* *`jdbc`*: The `jdbc` function allows data from any JDBC compliant data source to be combined with -streams originating from Solr. Result sets from outside data sources can be vectorized and operated -on by math expressions in the same manner as result sets originating from Solr. - -* *`topic`*: Messaging is an important foundational technology for large scale computing. The `topic` -function provides publish/subscribe messaging capabilities by treating -SolrCloud as a distributed message queue. Topics are extremely powerful -because they allow subscription by query. Topics can be use to support a broad set of -use cases including bulk text mining operations and AI alerting. - -* *`nodes`*: Graph queries are frequently used by recommendation engines and are an important -machine learning tool. The `nodes` function provides fast, distributed, breadth -first graph traversal over documents in a SolrCloud collection. The node sets collected -by the `nodes` function can be operated on by statistical and machine learning expressions to -gain more insight into the graph. - -* *`search`*: Ranked search results are a powerful tool for finding the most relevant -documents from a large document corpus. The `search` expression -returns the top N ranked search results that match any -Solr query, including geo-spatial queries. The smaller set of relevant -documents can then be explored with statistical, machine learning and -text mining expressions to gather insights about the data set. - -== Assigning Streams to Variables - -The output of any streaming expression can be set to a variable. -Below is a very simple example using the `random` function to fetch -three random samples from collection1. The random samples are returned -as tuples which contain name/value pairs. - - -[source,text] ----- -let(a=random(collection1, q="*:*", rows="3", fl="price_f")) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "a": [ - { - "price_f": 0.7927976 - }, - { - "price_f": 0.060795486 - }, - { - "price_f": 0.55128294 - } - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 11 - } - ] - } -} ----- - -== Creating a Vector with the col Function - -The `col` function iterates over a list of tuples and copies the values -from a specific column into an array. - -The output of the `col` function is an numeric array that can be set to a -variable and operated on by math expressions. - -Below is an example of the `col` function: - -[source,text] ----- -let(a=random(collection1, q="*:*", rows="3", fl="price_f"), - b=col(a, price_f)) ----- - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "b": [ - 0.42105234, - 0.85237443, - 0.7566981 - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 9 - } - ] - } -} ----- - -== Applying Math Expressions to the Vector - -Once a vector has been created any math expression that operates on vectors -can be applied. In the example below the `mean` function is applied to -the vector assigned to variable *`b`*. - -[source,text] ----- -let(a=random(collection1, q="*:*", rows="15000", fl="price_f"), - b=col(a, price_f), - c=mean(b)) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "c": 0.5016035594638814 - }, - { - "EOF": true, - "RESPONSE_TIME": 306 - } - ] - } -} ----- - -== Creating Matrices - -Matrices can be created by vectorizing multiple numeric fields -and adding them to a matrix. The matrices can then be operated on by -any math expression that operates on matrices. - -[TIP] -==== -Note that this section deals with the creation of matrices -from numeric data. The section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> describes how to build TF-IDF term vector matrices from text fields. -==== - -Below is a simple example where four random samples are taken -from different sub-populations in the data. The `price_f` field of -each random sample is -vectorized and the vectors are added as rows to a matrix. -Then the `sumRows` -function is applied to the matrix to return a vector containing -the sum of each row. - -[source,text] ----- -let(a=random(collection1, q="market:A", rows="5000", fl="price_f"), - b=random(collection1, q="market:B", rows="5000", fl="price_f"), - c=random(collection1, q="market:C", rows="5000", fl="price_f"), - d=random(collection1, q="market:D", rows="5000", fl="price_f"), - e=col(a, price_f), - f=col(b, price_f), - g=col(c, price_f), - h=col(d, price_f), - i=matrix(e, f, g, h), - j=sumRows(i)) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "j": [ - 154390.1293375, - 167434.89453, - 159293.258493, - 149773.42769, - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 9 - } - ] - } -} ----- - -== Facet Co-occurrence Matrices - -The `facet` function can be used to quickly perform multi-dimension aggregations of categorical data from -records stored in a SolrCloud collection. These multi-dimension aggregations can represent co-occurrence -counts for the values in the dimensions. The `pivot` function can be used to move two dimensional -aggregations into a co-occurrence matrix. The co-occurrence matrix can then be clustered or analyzed for -correlations to learn about the hidden connections within the data. - -In the example below the `facet` expression is used to generate a two dimensional faceted aggregation. -The first dimension is the US State that a car was purchased in and the second dimension is the car model. -This two dimensional facet generates the co-occurrence counts for the number of times a particular car model -was purchased in a particular state. - - -[source,text] ----- -facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows=5, count(*)) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "state": "NY", - "model": "camry", - "count(*)": 13342 - }, - { - "state": "NJ", - "model": "accord", - "count(*)": 13002 - }, - { - "state": "NY", - "model": "civic", - "count(*)": 12901 - }, - { - "state": "CA", - "model": "focus", - "count(*)": 12892 - }, - { - "state": "TX", - "model": "f150", - "count(*)": 12871 - }, - { - "EOF": true, - "RESPONSE_TIME": 171 - } - ] - } -} ----- - -The `pivot` function can be used to move the facet results into a co-occurrence matrix. In the example below -The `pivot` function is used to create a matrix where the rows of the matrix are the US States (state) and the -columns of the matrix are the car models (model). The values in the matrix are the co-occurrence counts (count(*)) - from the facet results. Once the co-occurrence matrix has been created the US States can be clustered -by car model, or the matrix can be transposed and car models can be clustered by the US States -where they were bought. - -[source,text] ----- -let(a=facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows="-1", count(*)), - b=pivot(a, state, model, count(*)), - c=kmeans(b, 7)) ----- - -== Latitude / Longitude Vectors - -The `latlonVectors` function wraps a list of tuples and parses a lat/lon location field into -a matrix of lat/long vectors. Each row in the matrix is a vector that contains the lat/long -pair for the corresponding tuple in the list. The row labels for the matrix are -automatically set to the `id` field in the tuples. The lat/lon matrix can then be operated -on by distance-based machine learning functions using the `haversineMeters` distance measure. - -The `latlonVectors` function takes two parameters: a list of tuples and a named parameter called -`field`, which tells the `latlonVectors` function which field to parse the lat/lon -vectors from. - -Below is an example of the `latlonVectors`. - -[source,text] ----- -let(a=random(collection1, q="*:*", fl="id, loc_p", rows="5"), - b=latlonVectors(a, field="loc_p")) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "b": [ - [ - 42.87183530723629, - 76.74102353397778 - ], - [ - 42.91372904094898, - 76.72874889228416 - ], - [ - 42.911528804897564, - 76.70537292977619 - ], - [ - 42.91143870500213, - 76.74749913047408 - ], - [ - 42.904666267479705, - 76.73933236046092 - ] - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 21 - } - ] - } -} -----