Re: [MarkLogic Dev General] question about search

Jason Hunter Tue, 03 May 2011 14:25:27 -0700

In 4.2 this is the best way.  You could do the call against <refauth> directly 
but you'll get "JamesWang" instead of "James Wang".


In my experience, deployments at scale work best if you develop a system where 
it's easy and automatic to reformat content to support new requirements.  
MarkLogic does a lot to make it possible to query your data as-is, but there's 
a limit to what can be done, and to get maximum performance you'll often want 
to tweak it.  People often do this by adding a <metadata> block at the top 
outside the <main> content.  Some have a "source" database and "compiled" 
database, where the source is the raw data you'll want humans to see, and it 
goes through a transformation step to the compiled database to optimize it for 
the application's deployment.

Imagine, for example, you want to sort articles by title.  If you want to 
ignore leading words "A", "An", and "The", then you probably want to create a 
"sortable-title" element or attribute with the leading word removed or placed 
at the end.  You don't want to dynamically remove those words at query time 
against millions of articles.

As another example, if you want to show all articles starting with "R" you 
might want an element or attribute start-letter="r" which makes this lookup a 
simple term list fetch and thus very lightweight.  You can do it without that 
attribute using a range index, but that's a bit less efficient.

Or let's say you want page counts or word counts on your articles.  That 
shouldn't go in the source that authors see, but it needs to be somewhere so 
the app can use it.  Put it in your transform step.

So my advice is treat the addition of a new element to support faster queries 
or extra features (like a word count) as not unusual, plan for it, and make it 
easy and automatic in your system.

-jh-

On May 3, 2011, at 1:56 PM, Helen Chen wrote:

> Hi Jason,
> 
> Your understanding is correct. I tried it using some other data and it seems 
> work fine.
> 
> The only problem here is that we have very large set of data, adding a new 
> element means all the data has to be touched to build the new element for 
> this search. It pretty much means that any time I want to do something like 
> this, I have to change all the data to construct the new element to fit the 
> search, this is based on the new requirement.  Do we have any function 
> similar to cts:element-values()  but it works on field like 
> cts:field-values()? What I'm trying to think is: the refauthor is simple 
> element, it only has fname and surname, if I can create a field which 
> combines the value of refauthor, then this field will serve as the same 
> functionality as the new element <referencedauthor>.
> 
> Or any other work around?  But the bottom line is if this is the only way 
> then I'll change all the data for it, I'll do it.
> 
> Thanks, Helen
> 
> 
> From: Jason Hunter <jhun...@marklogic.com>
> Reply-To: "general@developer.marklogic.com" <general@developer.marklogic.com>
> Date: Tue, 3 May 2011 13:25:24 -0700
> To: "general@developer.marklogic.com" <general@developer.marklogic.com>
> Subject: Re: [MarkLogic Dev General] question about search
> 
> You can use cts:frequency($author) to get the number of times the author was 
> cited.  If the same person might be cited multiple times in an article and 
> you want to count that, you'll want to specify "item-frequency" as an option 
> to the cts:element-values() call.  The default "fragment-frequency" would 
> count several citations in the same article as just one.
> 
> Hopefully I'm understanding what you want.
> 
> -jh-
> 
> On May 3, 2011, at 1:15 PM, Helen Chen wrote:
> 
>> Hi Jason,
>> 
>> cts:element-values() will return me the unique list the referencedauthor, 
>> but it does not tell me which one shows up (or cited) most.  The basic idea 
>> is: based on how many times of each refauthor showed up, and the search 
>> returns the top 1 or top 5 refauthor.
>> 
>> Does field help this?
>> 
>> Thanks,
>> Helen
>> 
>> 
>> 
>> From: Jason Hunter <jhun...@marklogic.com>
>> Reply-To: "general@developer.marklogic.com" <general@developer.marklogic.com>
>> Date: Tue, 3 May 2011 13:08:58 -0700
>> To: "general@developer.marklogic.com" <general@developer.marklogic.com>
>> Subject: Re: [MarkLogic Dev General] question about search
>> 
>> Hi Helen,
>> 
>> Add an element like <referencedauthor>James Wang</referencedauthor> in the 
>> document, perhaps in a new metadata block up top.  Put a range index on the 
>> chosen QName of type xs:string.  Then use cts:element-values() to extract 
>> the referenced authors.  You can pass a cts:query call to the function if 
>> you want to limit to just articles matching a query.  This approach will be 
>> fast at scale.  With the content shaped like you have right now, there's not 
>> an optimized way to do this at scale.
>> 
>> -jh-
>> 
>> On May 3, 2011, at 12:13 PM, Helen Chen wrote:
>> 
>>> Hello there,
>>> 
>>> We have article xml in marklogic, inside each article, it lists the 
>>> references that this article cited. I want to do a search to find out that 
>>> inside /article/back/reference/citation/ref/jcite, which author is 
>>> referenced most, or I can get a list of top 5  refauth who shows up in the 
>>> reference section  most in article.  
>>> 
>>> The article structure like the following:
>>> <article>
>>>     <front>…</front>
>>>     <back>
>>>         <references>
>>>             <citation id="c1">
>>>                 <ref>
>>>                     <jcite>
>>>                         <refauth>
>>>                             <fname>James</fname>
>>>                             <surname>Wang</surname>
>>>                         </refauth>
>>>                         <jtitle>article title</jtitle>
>>>                         <coden>AAA</coden>
>>>                         <issn>1111</issn>
>>>                         <volume>1</volume>
>>>                         <pages>90</pages>
>>>                         <date>2007</date>
>>>                     </jcite>
>>>                 </ref>
>>>             </citation>
>>>         </references>
>>>         <references>
>>>             <citation id="c2">
>>>                 <ref>
>>>                     <jcite>
>>>                         <refauth>
>>>                             <fname>Tom</fname>
>>>                             <surname>Ding</surname>
>>>                         </refauth>
>>>                         <jtitle>my article title</jtitle>
>>>                         <coden>AAB</coden>
>>>                         <issn>1112</issn>
>>>                         <volume>1</volume>
>>>                         <pages>20</pages>
>>>                         <date>2008</date>
>>>                     </jcite>
>>>                 </ref>
>>>             </citation>
>>>         </references>
>>>     </back>
>>> </article>
>>> 
>>> 
>>> Can anyone give me a suggestion how to do it? Or how to start ?
>>> 
>>> Thanks, helen
>>> _______________________________________________
>>> General mailing list
>>> General@developer.marklogic.com
>>> http://developer.marklogic.com/mailman/listinfo/general
>> 
>> _______________________________________________ General mailing list 
>> General@developer.marklogic.com 
>> http://developer.marklogic.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> General@developer.marklogic.com
>> http://developer.marklogic.com/mailman/listinfo/general
> 
> _______________________________________________ General mailing list 
> General@developer.marklogic.com 
> http://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> General@developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] question about search

Reply via email to