Re: [MarkLogic Dev General] Bulk updates (xqsync vs. mlcp)

2016-06-30 Thread Justin Makeig
Just to wrap up this thread, I've incorrectly and unhelpfully conflated two 
aspects of merging: consolidating stands and getting rid of obsolete fragments. 
When you set a merge timestamp you are only affecting the latter. Regardless of 
the timestamp the database will always consolidate stands for you. Again, the 
docs have very good coverage of this 
. (Hat tip, Jason Hunter and 
Danny Sokolsky.)

Regardless, I still stand by the recommendation to _not_ use MVCC timestamps as 
general-purpose versioning, mostly for the difficulty in querying and the 
potential to screw something up administratively.

Sorry for the confusion. 

Justin

--
Justin Makeig
Director, Product Management
MarkLogic
justin.mak...@marklogic.com

> On Jun 30, 2016, at 11:02 AM, Hans Hübner  wrote:
> 
> On Thu, Jun 30, 2016 at 6:30 PM, Justin Makeig  > wrote:
>> The amount of data that we're accumulating by keeping the old versions 
>> around does not bother us.  
> 
> It will bother you if you never allow the database to merge. If you're 
> keeping a small window of history this will work fine. (Though an errant 
> timestamp setting by a config script will delete your history.) If you need 
> to keep the entire history, you will effectively be disabling merging with 
> this strategy, which will certainly land you in trouble. Merges are good; the 
> database needs them to optimize its internal data structures to support fast 
> and consistent ingest and queries.
> 
> Save each version as a separate document. Put all versions of a single 
> document in a collection to represent the "logical" document and give each 
> instance version a unique URI to represent its version number. You could even 
> create a special "latest" collection that contains only the latest version of 
> each document. This will allow you to do queries like, "How many versions of 
> (logical) document ABC.xml do I have?" "What's the latest version of 
> (logical) ABC.xml?" "Run this diff code on latest ABC.xml and its previous 
> version." With timestamps you'll have to know _when_ a document was updated 
> in order to get its previous version. This will require two steps for every 
> query and won't allow you to do any queries across versions, because the 
> older/newer versions don't exist, from the perspective of a query that runs 
> at a single timestamp.  
> 
> Thank you for the concrete architectural advice!  It does not seem to be very 
> bothersome to follow that route, so we will certainly trust you in that it is 
> better than using MVCC timestamps.
> 
> Let me suggest again that the "Time Travel" section in the "Inside Marklogic" 
> document and the section on point-in-time queries in the "Application 
> Developers Guide" be updated to include information on the caveats that you 
> and your colleagues have expressed.  I'm still a bit puzzled by the vehemence 
> that you all put forth into discouraging us from using it.  Are there any 
> other advertised features that can affect the health of a database in a 
> similar way and should thus be avoided?
> 
> Thanks!
> Hans
> 
> -- 
> LambdaWerk GmbH
> Oranienburger Straße 87/89
> 10178 Berlin
> Phone: +49 30 555 7335 0
> Fax: +49 30 555 7335 99
> 
> HRB 169991 B Amtsgericht Charlottenburg
> USt-ID: DE301399951
> Geschäftsführer:  Hans Hübner
> 
> http://lambdawerk.com/ 
> 
> 
> ___
> General mailing list
> General@developer.marklogic.com
> Manage your subscription at: 
> http://developer.marklogic.com/mailman/listinfo/general



smime.p7s
Description: S/MIME cryptographic signature
___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] format:json && extract-document-data

2016-06-30 Thread Rob Szkutak
Since it sounds like you're doing this via the REST API, you may find this 
StackOverflow thread useful: 
http://stackoverflow.com/questions/37986731/extract-document-data-comes-as-xml-string-element-in-json-output

In short, you have to install a content transformation to turn it into JSON for 
you and invoke that with the "transform" param (eg. &transform=nameOfTransform) 
.

Best,
Rob

Rob Szkutak
Senior Consultant
MarkLogic Corporation
rob.szku...@marklogic.com
www.marklogic.com


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of Charles Greer 
[charles.gr...@marklogic.com]
Sent: Thursday, June 30, 2016 1:59 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] format:json && extract-document-data

Hi Stephane,

It must be that your documents are themselves in XML, right?
extract-path normally grabs trees from the persisted document, and so
the nodes extracted from an XML document will be XML.

I wonder whether you can add '/text()' to the end of your extract-path 
expressions
in order to force them into something that can be serialized within JSON.
That would erase the key names of course.

An alternate approach would be to use bulk search (from a client API)
and use an output transform to render results of each search result into JSON.
(Possible, but I can see why that would not be an appealing solution).

If your documents were JSON, I *think* you'd get the results you are expecting.

Charles Greer
Lead Engineer
MarkLogic Corporation


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of stephane.va...@oecd.org 
[stephane.va...@oecd.org]
Sent: Thursday, June 23, 2016 2:19 AM
To: general@developer.marklogic.com
Subject: [MarkLogic Dev General] format:json && extract-document-data

Hi,

I am trying to include some document data into my search results, using the 
following query options:

http://marklogic.com/appservices/search";>

  /language-version/ 
language-version-canonical-model/title
  /language-version/ 
language-version-canonical-model/language
(…)



Unfortunately, when I ask for json format (using header Accpet: 
application/json), the extracted element comes as “stringyfied” xml instead of 
being converted into json as I would have expected:

{
  "snippet-format": "snippet",
  "total": 564,
  "start": 1,
  "page-length": 10,
  "selected": "include",
  "results": [
{
  "index": 1,
  "uri": "ENV/CHEM/NANO(2015)22/ANN5/2",
  "path": "fn:doc(\"ENV/CHEM/NANO(2015)22/ANN5/2\")",
(…)
  "extracted": {
"kind": "element",
"content": [
  "En",
  "ZINC OXIDE DOSSIERANNEX 5",
  "ENV/CHEM/NANO(2015)22/ANN5",
  "2",
  "2015-04-16T00:00:00.000+02:00",
  "media",
  "fish",
(…)
]
  }
},

Anything I am doing wrong? Is there some configuration options I could tweak to 
enforce the conversion of xml to json?

Cheers,
Stéphane Varin
___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] format:json && extract-document-data

2016-06-30 Thread Charles Greer
Hi Stephane,

It must be that your documents are themselves in XML, right?
extract-path normally grabs trees from the persisted document, and so
the nodes extracted from an XML document will be XML.

I wonder whether you can add '/text()' to the end of your extract-path 
expressions
in order to force them into something that can be serialized within JSON.
That would erase the key names of course.

An alternate approach would be to use bulk search (from a client API)
and use an output transform to render results of each search result into JSON.
(Possible, but I can see why that would not be an appealing solution).

If your documents were JSON, I *think* you'd get the results you are expecting.

Charles Greer
Lead Engineer
MarkLogic Corporation


From: general-boun...@developer.marklogic.com 
[general-boun...@developer.marklogic.com] on behalf of stephane.va...@oecd.org 
[stephane.va...@oecd.org]
Sent: Thursday, June 23, 2016 2:19 AM
To: general@developer.marklogic.com
Subject: [MarkLogic Dev General] format:json && extract-document-data

Hi,

I am trying to include some document data into my search results, using the 
following query options:

http://marklogic.com/appservices/search";>

  /language-version/ 
language-version-canonical-model/title
  /language-version/ 
language-version-canonical-model/language
(…)



Unfortunately, when I ask for json format (using header Accpet: 
application/json), the extracted element comes as “stringyfied” xml instead of 
being converted into json as I would have expected:

{
  "snippet-format": "snippet",
  "total": 564,
  "start": 1,
  "page-length": 10,
  "selected": "include",
  "results": [
{
  "index": 1,
  "uri": "ENV/CHEM/NANO(2015)22/ANN5/2",
  "path": "fn:doc(\"ENV/CHEM/NANO(2015)22/ANN5/2\")",
(…)
  "extracted": {
"kind": "element",
"content": [
  "En",
  "ZINC OXIDE DOSSIERANNEX 5",
  "ENV/CHEM/NANO(2015)22/ANN5",
  "2",
  "2015-04-16T00:00:00.000+02:00",
  "media",
  "fish",
(…)
]
  }
},

Anything I am doing wrong? Is there some configuration options I could tweak to 
enforce the conversion of xml to json?

Cheers,
Stéphane Varin
___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] Bulk updates (xqsync vs. mlcp)

2016-06-30 Thread Hans Hübner
On Thu, Jun 30, 2016 at 6:30 PM, Justin Makeig 
wrote:

> The amount of data that we're accumulating by keeping the old versions
> around does not bother us.
>
>
> It will bother you if you never allow the database to merge. If you're
> keeping a small window of history this will work fine. (Though an errant
> timestamp setting by a config script will delete your history.) If you need
> to keep the entire history, you will effectively be disabling merging with
> this strategy, which will certainly land you in trouble. Merges are good;
> the database needs them to optimize its internal data structures to support
> fast and consistent ingest and queries.
>
> Save each version as a separate document. Put all versions of a single
> document in a collection to represent the "logical" document and give each
> instance version a unique URI to represent its version number. You could
> even create a special "latest" collection that contains only the latest
> version of each document. This will allow you to do queries like, "How many
> versions of (logical) document ABC.xml do I have?" "What's the latest
> version of (logical) ABC.xml?" "Run this diff code on latest ABC.xml and
> its previous version." With timestamps you'll have to know _when_ a
> document was updated in order to get its previous version. This will
> require two steps for every query and won't allow you to do any queries
> across versions, because the older/newer versions don't exist, from the
> perspective of a query that runs at a single timestamp.
>

Thank you for the concrete architectural advice!  It does not seem to be
very bothersome to follow that route, so we will certainly trust you in
that it is better than using MVCC timestamps.

Let me suggest again that the "Time Travel" section in the "Inside
Marklogic" document and the section on point-in-time queries in the
"Application Developers Guide" be updated to include information on the
caveats that you and your colleagues have expressed.  I'm still a bit
puzzled by the vehemence that you all put forth into discouraging us from
using it.  Are there any other advertised features that can affect the
health of a database in a similar way and should thus be avoided?

Thanks!
Hans

-- 
LambdaWerk GmbH
Oranienburger Straße 87/89
10178 Berlin
Phone: +49 30 555 7335 0
Fax: +49 30 555 7335 99

HRB 169991 B Amtsgericht Charlottenburg
USt-ID: DE301399951
Geschäftsführer:  Hans Hübner

http://lambdawerk.com/
___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] Bulk updates (xqsync vs. mlcp)

2016-06-30 Thread Justin Makeig
The amount of data that we're accumulating by keeping the old versions around 
does not bother us.

It will bother you if you never allow the database to merge. If you're keeping 
a small window of history this will work fine. (Though an errant timestamp 
setting by a config script will delete your history.) If you need to keep the 
entire history, you will effectively be disabling merging with this strategy, 
which will certainly land you in trouble. Merges are good; the database needs 
them to optimize its internal data structures to support fast and consistent 
ingest and queries.

Save each version as a separate document. Put all versions of a single document 
in a collection to represent the "logical" document and give each instance 
version a unique URI to represent its version number. You could even create a 
special "latest" collection that contains only the latest version of each 
document. This will allow you to do queries like, "How many versions of 
(logical) document ABC.xml do I have?" "What's the latest version of (logical) 
ABC.xml?" "Run this diff code on latest ABC.xml and its previous version." With 
timestamps you'll have to know _when_ a document was updated in order to get 
its previous version. This will require two steps for every query and won't 
allow you to do any queries across versions, because the older/newer versions 
don't exist, from the perspective of a query that runs at a single timestamp.

Justin

On Jun 29, 2016, at 8:49 PM, Hans Hübner 
mailto:hans.hueb...@lambdawerk.com>> wrote:

On Wed, Jun 29, 2016 at 11:29 PM, Danny Sokolsky 
mailto:danny.sokol...@marklogic.com>> wrote:
It might be tempting to treat point-in-time queries for generic versioning, but 
it is usually not what you want.

Does that help to clarify?

Thanks, Danny, this helps.  In our use case, we have thousands of relatively 
complex trees of nodes, and the configuration of each tree changes over time, 
when new data is inserted into the database.  In order to make old 
configurations of each tree available for inspection, we use the MVCC 
point-in-time rollback feature of our current database system to recover 
previous database states and visualize them.  This is merely a diagnostic 
feature, but given the relative complexity of the connections between the tree 
nodes, it is helpful to be able to visualize the changes to each tree that 
happened when new data was inserted.

The amount of data that we're accumulating by keeping the old versions around 
does not bother us.  This database is a special purpose database tied to a 
particular application, and it won't be used to insert random other documents.  
It thus seems to me that we'll be fine with using the MVCC feature for our 
history visualization for now.  If we decide that the space overhead is 
prohibitive, we can always adjust the merge timestamp, trading off history 
depth against database space used.

It would be helpful to have the tradeoffs that one has to make when using the 
"Time Travel" feature be listed in the documentation.

-Hans

--
LambdaWerk GmbH
Oranienburger Straße 87/89
10178 Berlin
Phone: +49 30 555 7335 0
Fax: +49 30 555 7335 99

HRB 169991 B Amtsgericht Charlottenburg
USt-ID: DE301399951
Geschäftsführer:  Hans Hübner

http://lambdawerk.com/


___
General mailing list
General@developer.marklogic.com
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general

___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] word-query including punctuation characters

2016-06-30 Thread Wissam Asfahani (TSO GB)
Using fields won't be an option for our usage case, but arranging things to use 
value queries may be.

Is it possible to re-classify these characters as symbols or words, without 
using field tokenizer overrides? For example, by modifying the tokenizer.xml 
file?

Wissam

-Original Message-
From: general-boun...@developer.marklogic.com 
[mailto:general-boun...@developer.marklogic.com] On Behalf Of Mary Holstege
Sent: 29 June 2016 17:42
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] word-query including punctuation characters

On Wed, 29 Jun 2016 08:06:35 -0700, Wissam Asfahani (TSO GB) 
 wrote:

> Good afternoon,
>
> We are having some issues estimating the number of documents when
> performing word queries containing punctuation characters.
>
> I have attached 4 sample documents. When using the below query, the
> estimate returns 3 and the count 1.
>
> Are there any db configuration settings we can use to ensure a more
> accurate estimate result?
>
>
> let $query := cts:word-query("4µ", ("exact"), 2)
>
> return
>   (
> xdmp:estimate(cts:search(fn:doc(), $query)),
> fn:count(cts:search(fn:doc(), $query))
>   )
>
>
> Wissam Asfahani
> XML Developer
>

Punctuation is not indexed in the word query indexes. An exact unwildcarded 
*value* query will consider punctuation, so if you can arrange things so that 
you can use a value query, that could be a solution. If it is just this 
character and searching for it in this way is confined to identifiable parts of 
the document, you could use field tokenizer overrides to redefine µ as a word  
or symbol character for that field.  But it looks like it is being classified 
as a punctuation mark in
error: it should be classified as a letter character anyway since it is listed 
as Ll in the Unicode tables.

//Mary
___
General mailing list
General@developer.marklogic.com
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general


This e-mail has been scanned for all viruses by Claranet. The service is 
powered by MessageLabs. For more information on a proactive anti-virus service 
working around the clock, around the globe, visit:
http://www.claranet.co.uk



GOGREEN Climate Protection with DHL: please consider your environmental 
responsibility before printing this email.

This email is intended exclusively for the individual or entity to which it is 
addressed. This communication may contain information that is proprietary, 
privileged or confidential. If you are not the named addressee, you are not 
authorized to read, print, retain, copy or disseminate this message or any part 
of it. If you have received this message in error, please notify the sender 
immediately by email and delete all copies of the message.
___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general