Re: Copy field and regex

Erick Erickson Fri, 08 Dec 2017 12:04:48 -0800

Grouping does _not_ require docValues, it's just that the with
docValues=false, uninverted structure is built on the heap at run
time. When docValues=true, the uninverted structure is written to disk
at index time and MMapped into the OS's memory space rather than the
Java heap.


Second, grouping works fine in distributed mode with a couple of
restrictions, see the reference guide. Collapse/Expand (an alternative
to standard grouping) requires that all the members of a group be on
the same shard.

Right, text fields aren't eligible for docValues, only "simple" types
(string included). If you want to use docValues, I'd recommend doing
the extraction on the client side. You can also put that in an update
component, but that's probably overkill.

Best,
Erick

On Fri, Dec 8, 2017 at 10:51 AM, Bradley Belyeu
<bradley.belyeu@life.church> wrote:
> Ah, thank you Erick & Shawn. That makes perfect sense. And yes when this goes 
> to prod it will be distributed. Good point about docValues and needing a 
> single shard, thanks!
> I’m new to result grouping, so I’m still prototyping that it will work for 
> what I need.
>
> On 12/8/17, 12:00 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:
>
>     I think you're getting confused by seeing the _stored_ data rather
>     than the indexed data. When you return fields in documents, you get
>     the stored data which is a verbatim copy of the input, no analysis
>     done at all. To see what's in the index (and thus what would be
>     grouped on) look at:
>
>     adminUI>>analysis>>(your field) and put some sample values in and see
>     what the regex transformer does. NOTE: unclick the "verbose" box for
>     less clutter.
>     or
>     adminUI>>(select core)>>schema browser
>     or
>     termscomponent
>
>     If you require the stored value to be different, you have several choices
>     1> change it on the client side before ingestion
>     2> use one of field mutating classes
>
>     Most often, people don't bother storing the copyfield since the stored
>     value is available in the original, the copyField destination is just
>     used for things like you're interested in.
>
>     Best,
>     Erick
>
>     On Fri, Dec 8, 2017 at 8:56 AM, Bradley Belyeu
>     <bradley.belyeu@life.church> wrote:
>     > I’m struggling a bit getting a copy field & regex tokenizer to work 
> like I think it should…
>     > I have an open source project I’m just starting out with here: 
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fyouversion%2Fsolrcloud&data=02%7C01%7Cbradley.belyeu%40life.church%7C1c830048a2f84986e57d08d53e659b6d%7C8c9a6ca9b4314964afb4b8e1a2ba636f%7C1%7C0%7C636483528492765542&sdata=ZWo4gQwKOa0wGo5%2B822bro2sxnEg9F5b7cNil%2F0pj4k%3D&reserved=0
>     > I have a uniqueKey field USFM defined as:
>     > <field name="usfm" type="string" indexed="true" required="true" 
> stored="true" />
>     > And a USFM will always be in the pattern of 3 characters followed by a 
> period followed by one or more digits followed by another period and finally 
> one or more digits.
>     > Optionally after the final digit there may be a hyphen and another 
> digit.
>     > IE: JHN.3.16 or MAT.6.33-34
>     >
>     > I’m wanting to do a result grouping by the first three characters, 
> period, & digit(s). For example, docs with the unique keys JHN.3.16 & 
> JHN.3.17 I would want grouped together.
>     > So my thought was to define another field and then copy the USFM into 
> it and use the regex tokenizer defined as so:
>     >
>     >     <fieldType name="chapter" class="solr.TextField" 
> positionIncrementGap="0">
>     >         <analyzer>
>     >             <tokenizer class="solr.PatternTokenizerFactory" 
> pattern="^(\w+\.\d+)\.\d+-*\d*$" group="1" />
>     >         </analyzer>
>     >     </fieldType>
>     >     <field name="chapter" type="chapter" indexed="true" required="true" 
> stored="true" />
>     >     <copyField source="usfm" dest="chapter" />
>     >
>     > BUT, when I import my data the entire USFM is being stored inside the 
> chapter field. And I get query results that look like:
>     >        {
>     >         "usfm":"MAT.10.1",
>     >         "chapter":"MAT.10.1",
>     >         "devo_keywords_en":"fear",
>     >         "_version_":1586184983451533312},
>     >       {
>     >         "usfm":"MAT.10.10",
>     >         "chapter":"MAT.10.10",
>     >         "devo_keywords_en":"fear",
>     >         "_version_":1586184983451533314},
>     >       {
>     >         "usfm":"MAT.10.11",
>     >         "chapter":"MAT.10.11",
>     >         "devo_keywords_en":"fear",
>     >         "_version_":1586184983451533316},
>     >       {
>     >         "usfm":"MAT.10.12",
>     >         "chapter":"MAT.10.12",
>     >         "devo_keywords_en":"fear",
>     >         "_version_":1586184983451533318}
>     >
>     > It’s probably something simple I’ve missed, but I’ve been banging my 
> head for long enough I thought I’d ask for help.
>     > Thanks in advance!
>
>

Re: Copy field and regex

Reply via email to