Re: [MarkLogic Dev General] Optimal strategy for finding duplicate values in database

2013-02-01 Thread Ryan Dew
Since you have a range index I believe you can do something like this:

cts:element-attribute-values(xs:QName('my:element'),xs:QName('my:attribute'))[cts:frequency(.)
gt 1]

You would need a second query to actually retrieve the documents with
duplicate ids, but still probably more efficient.

-Ryan Dew


On Fri, Feb 1, 2013 at 7:44 PM, David Sewell  wrote:

> Given a database with lots of files containing attribute values that are
> supposed to be unique across the database, is there an optimal
> MarkLogic-ish way to check for duplicates?
>
> One traditional approach to finding non-distinct values performs
> terribly:
>
>for $value in distinct-values(collection()//my:element/@my:attribute)
>return $value[count($values[. = $value]) > 1]
>
> (where by "terribly" I mean on the order of 10 seconds elapsed time for
> 5000 values on my system). Leveraging an element-attribute range index
> and running cts:search() on the distinct values was somewhat better, but
> not enough.
>
> By far the most performant approach I have found is to iterate over a
> sorted sequence of values, simulating a Unix "sort < file | uniq -d",
> like so:
>
> let $ordered_values :=
>for $v in collection()//my:element/@my:attribute
>order by $v
>return $v
> for $val at $pos in $ordered_values
> return
>   if ($val eq $ordered_values[$pos - 1])
>   then $val
>   else ()
>
> where "performant" means around 0.5 seconds for 10 values.
>
> Is this the best approach? Given that the attributes in question are in
> an element-attribute range index, is there another strategy worth
> trying?
>
> David
>
> --
> David Sewell, Editorial and Technical Manager
> ROTUNDA, The University of Virginia Press
> PO Box 400314, Charlottesville, VA 22904-4314 USA
> Email: dsew...@virginia.edu   Tel: +1 434 924 9973
> Web: http://rotunda.upress.virginia.edu/
> ___
> General mailing list
> General@developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
>
___
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general


[MarkLogic Dev General] Optimal strategy for finding duplicate values in database

2013-02-01 Thread David Sewell
Given a database with lots of files containing attribute values that are 
supposed to be unique across the database, is there an optimal 
MarkLogic-ish way to check for duplicates?

One traditional approach to finding non-distinct values performs 
terribly:

   for $value in distinct-values(collection()//my:element/@my:attribute)
   return $value[count($values[. = $value]) > 1]

(where by "terribly" I mean on the order of 10 seconds elapsed time for 
5000 values on my system). Leveraging an element-attribute range index 
and running cts:search() on the distinct values was somewhat better, but 
not enough.

By far the most performant approach I have found is to iterate over a
sorted sequence of values, simulating a Unix "sort < file | uniq -d",
like so:

let $ordered_values :=
   for $v in collection()//my:element/@my:attribute
   order by $v
   return $v
for $val at $pos in $ordered_values
return
  if ($val eq $ordered_values[$pos - 1])
  then $val
  else ()

where "performant" means around 0.5 seconds for 10 values.

Is this the best approach? Given that the attributes in question are in
an element-attribute range index, is there another strategy worth
trying?

David

-- 
David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 400314, Charlottesville, VA 22904-4314 USA
Email: dsew...@virginia.edu   Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/
___
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] Uneven load in 3 node cluster

2013-02-01 Thread Michael Blakeley
Which OS is this?

The "Hung..." messages mean that the OS was not letting MarkLogic do anything 
for N seconds. Sometimes that means memory is stressed by either swapping or 
fragmentation. Sometimes it means disk I/O capacity is overloaded. Hardware 
problems are also a possibility. Look into these areas.

If you don't have file-log-level=debug already, set that. It's in the group 
settings in the admin UI. You may see some interesting new information.

The "Hung..." messages fit nicely into the erratic load. If the database on one 
host is blocked by the OS, the other two hosts will have to wait until it it 
comes back before advancing timestamps. So any updates will have to wait for 
that host to come back. Queries that need results from forests on the blocked 
host will have to wait, too.

You don't have to worry about the config files differing from host to host 
within a cluster. The cluster takes care of that.

The CPF setup sounds odd to me. Normally you'd let CPF manage the state, and 
wouldn't need that scheduled task. I don't see how the scheduled task would 
reduce load, at least not over the long haul. Maybe that's the idea? You're 
trying to maintain insert performance and then run CPF in less busy times?

-- Mike

On 1 Feb 2013, at 05:28 , Miguel Rodríguez González  
wrote:

> Hi all,
> we are using CPF for post-processing a set of documents, we load via 
> content-pump into a 3 node cluster (version 6.0-2). 
> When we do, we do experience an uneven load in one of the servers (it hangs 
> every now and then, while the other 2 seem to be waiting for more work), and 
> so far, we did not 
> manage to get a grip on what could be wrong.
> 
> In short, these are the steps we are following:
> 
> The process we follow:
> - the ETL creates the xml files (around 40 million docs).
> - content-pump pushes the documents into MarkLogic (10 threads with 100 
> documents per transaction).
> - a CPF pipeline adds some collections to the uploaded documents.
> 
> These are the steps of the CPF pipeline:
> - Creation or update of a document, changes the document status to 
> "unprocessed". This is saved in a document property 
> - A scheduled task picks up batches of 50k documents and changes the state to 
> processing every 2 minutes (here we spawn 50 batches of 1k documents to have 
> 50 transactions).
> * we opted for using a scheduled task insted of relaying solely on CPF, 
> because the servers were chocking on high volume.
> - The state change triggers CPF (on-state-change event) and the document 
> receives its collections after a query. 
> - Once the collections are set, the status is changed to done.
> 
> We did verify that the 3 nodes have the same configuration. To do so, we 
> checked the following files:
> 
> - assignments.xml 
> - clusters.xml 
> - databases.xml 
> - groups.xml 
> - hosts.xml 
> - server.xml (it has 2 obvious differences: the host-id and the ssl private 
> key)
> 
> The only difference between the 3 of them is the memory. These are the specs:
> - CPU: 2x X5650, 6 cores, in total 12 cores
> - MEM: 48 GB ( 64 GB on the third one)
> - DISK: 6x 600 GB 15K in RAID 10 config
> 
> Attached to this email there are 6 pictures, which clearly show the problem 
> we are facing:
> - System load (system load, 5 minutes) for each of the 3 nodes
> - CPU usage on a 100% scale, again for the 3 boxes
> 
> On the 3rd machine we see these warnings everytime the CPU is been hog 
> (Error.log):
> 
> 2013-02-01 00:02:01.327 Warning: Hung 65 sec
> 2013-02-01 00:03:19.243 Warning: Hung 54 sec
> 2013-02-01 00:04:00.802 Warning: Hung 41 sec
> 2013-02-01 00:06:40.061 Warning: Hung 130 sec
> 
> And some connection lost/time outs on the other 2 machines of the cluster:
> 
> 2013-02-01 00:01:08.567 Info: Disconnecting from domestic host 
> ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not 
> responded for 30 seconds.
> 2013-02-01 00:02:54.634 Info: Disconnecting from domestic host 
> ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not 
> responded for 30 seconds.
> 2013-02-01 00:03:50.673 Info: Disconnecting from domestic host 
> ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not 
> responded for 30 seconds.
> 2013-02-01 00:05:01.473 Info: Disconnecting from domestic host 
> ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not 
> responded for 30 seconds.
> 
> 
> Could you please provide advice?
> 
> Miguel Rodríguez
> Lead Developer 
> E mrgonza...@nl.swets.com
> I www.swets.com
>  
> 
> ___
> General mailing list
> General@developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
> 

___
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] General Digest, Vol 104, Issue 1

2013-02-01 Thread Ashwini Kumar
Thanks Damon,

Reading more closely, it turns out that both 32 bit and 64 bit versions of
glibc are required on Centos. So, all that I had to run was

yum install glibc.i686

That did the trick!

Regards,
Ashwini




> Date: Fri, 1 Feb 2013 09:40:52 -0500
> From: Ashwini Kumar 
> Subject: [MarkLogic Dev General] Cannot install MarkLogic 6 on Centos
> 6.0
> To: general@developer.marklogic.com
> Message-ID:
>  f...@mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi,
>
> I just downloaded MarkLogic 6.0 and was trying to install on Centos 6.0
> which is documented to be supported.
>
> when I run the rpm I get
>
> [root@eeepc installbins]# rpm -i MarkLogic-6.0-2.1.x86_64.rpm
> error: Failed dependencies:
> lsb is needed by MarkLogic-6.0-2.1.x86_64
> gdb is needed by MarkLogic-6.0-2.1.x86_64
> libc.so.6(GLIBC_2.4) is needed by MarkLogic-6.0-2.1.x86_64
>
>
> Centos 6.0 comes along with Glib_2.12
>
> [root@eeepc installbins]# yum list installed
> glib2.x86_64  2.22.5-5.el6
> @anaconda-CentOS-201106060106.x86_64/6.0
> glibc.x86_64  2.12-1.7.el6
> @anaconda-CentOS-201106060106.x86_64/6.0
> glibc-common.x86_64   2.12-1.7.el6
> @anaconda-CentOS-201106060106.x86_64/6.0
>
> I downloaded both glibc and glib-common 2.4  but was unsuccessful in
> installing them.
>
> [root@eeepc installbins]# yum install glibc-2.4-4.x86_64.rpm
> Loaded plugins: fastestmirror
> Loading mirror speeds from cached hostfile
>  * base: mirror.atlanticmetro.net
>  * extras: mirror.trouble-free.net
>  * updates: mirror.cisp.com
> Setting up Install Process
> Examining glibc-2.4-4.x86_64.rpm: glibc-2.4-4.x86_64
> glibc-2.4-4.x86_64.rpm: does not update installed package.
> Error: Nothing to do
>
>
>
> [root@eeepc installbins]# rpm -i glibc-common-2.4-4.x86_64.rpm
> warning: glibc-common-2.4-4.x86_64.rpm: Header V3 DSA/SHA1 Signature, key
> ID 4f2a6fd2: NOKEY
> error: Failed dependencies:
> glibc > 2.4 conflicts with glibc-common-2.4-4.x86_64
>
> So, while I am surprised that glibc is not backward compatible, even if we
> succeed to replace version 2.12 of glibc with 2.4 of glibc, it may not be
> good for the operating system as a whole.
>
>
> Please help me to install MarkLogic on Centos 6.0.   Is there a VM image I
> can download from somewhere?
>
> Thanks and Regards,
> Ashwini
> -- next part --
> An HTML attachment was scrubbed...
> URL:
> http://developer.marklogic.com/pipermail/general/attachments/20130201/53874dbf/attachment-0001.html
>
> --
>
> Message: 4
> Date: Fri, 1 Feb 2013 14:51:33 +
> From: Damon Feldman 
> Subject: Re: [MarkLogic Dev General] Cannot install MarkLogic 6 on
> Centos 6.0
> To: MarkLogic Developer Discussion 
> Message-ID:
> <2abb2109db4965418f09773e1b600642a58...@exchg10-be02.marklogic.com
> >
> Content-Type: text/plain; charset="us-ascii"
>
> Ashwini,
>
> There are some required libraries (notably, both the 32 and 64 bit
> versions of glibc) that are required. Please refer to the install
> instructions at: http://docs.marklogic.com/guide/installation.
>
> Yours,
> Damon
>
>
> From: general-boun...@developer.marklogic.com [mailto:
> general-boun...@developer.marklogic.com] On Behalf Of Ashwini Kumar
> Sent: Friday, February 01, 2013 9:41 AM
> To: general@developer.marklogic.com
> Subject: [MarkLogic Dev General] Cannot install MarkLogic 6 on Centos 6.0
>
> Hi,
>
> I just downloaded MarkLogic 6.0 and was trying to install on Centos 6.0
> which is documented to be supported.
>
> when I run the rpm I get
>
> [root@eeepc installbins]# rpm -i MarkLogic-6.0-2.1.x86_64.rpm
> error: Failed dependencies:
> lsb is needed by MarkLogic-6.0-2.1.x86_64
> gdb is needed by MarkLogic-6.0-2.1.x86_64
> libc.so.6(GLIBC_2.4) is needed by MarkLogic-6.0-2.1.x86_64
>
>
> Centos 6.0 comes along with Glib_2.12
>
> [root@eeepc installbins]# yum list installed
> glib2.x86_64  2.22.5-5.el6
> @anaconda-CentOS-201106060106.x86_64/6.0
> glibc.x86_64  2.12-1.7.el6
> @anaconda-CentOS-201106060106.x86_64/6.0
> glibc-common.x86_64   2.12-1.7.el6
> @anaconda-CentOS-201106060106.x86_64/6.0
>
> I downloaded both glibc and glib-common 2.4  but was unsuccessful in
> installing them.
>
> [root@eeepc installbins]# yum install glibc-2.4-4.x86_64.rpm
> Loaded plugins: fastestmirror
> Loading mirror speeds from cached hostfile
>  * base: mirror.atlanticmetro.net<http://mirror.atlanticmetro.net>
>

Re: [MarkLogic Dev General] Cannot install MarkLogic 6 on Centos 6.0

2013-02-01 Thread Damon Feldman
Ashwini,

There are some required libraries (notably, both the 32 and 64 bit versions of 
glibc) that are required. Please refer to the install instructions at: 
http://docs.marklogic.com/guide/installation.

Yours,
Damon


From: general-boun...@developer.marklogic.com 
[mailto:general-boun...@developer.marklogic.com] On Behalf Of Ashwini Kumar
Sent: Friday, February 01, 2013 9:41 AM
To: general@developer.marklogic.com
Subject: [MarkLogic Dev General] Cannot install MarkLogic 6 on Centos 6.0

Hi,

I just downloaded MarkLogic 6.0 and was trying to install on Centos 6.0 which 
is documented to be supported.

when I run the rpm I get

[root@eeepc installbins]# rpm -i MarkLogic-6.0-2.1.x86_64.rpm
error: Failed dependencies:
lsb is needed by MarkLogic-6.0-2.1.x86_64
gdb is needed by MarkLogic-6.0-2.1.x86_64
libc.so.6(GLIBC_2.4) is needed by MarkLogic-6.0-2.1.x86_64


Centos 6.0 comes along with Glib_2.12

[root@eeepc installbins]# yum list installed
glib2.x86_64  2.22.5-5.el6 
@anaconda-CentOS-201106060106.x86_64/6.0
glibc.x86_64  2.12-1.7.el6 
@anaconda-CentOS-201106060106.x86_64/6.0
glibc-common.x86_64   2.12-1.7.el6 
@anaconda-CentOS-201106060106.x86_64/6.0

I downloaded both glibc and glib-common 2.4  but was unsuccessful in installing 
them.

[root@eeepc installbins]# yum install glibc-2.4-4.x86_64.rpm
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * base: mirror.atlanticmetro.net
 * extras: mirror.trouble-free.net
 * updates: mirror.cisp.com
Setting up Install Process
Examining glibc-2.4-4.x86_64.rpm: glibc-2.4-4.x86_64
glibc-2.4-4.x86_64.rpm: does not update installed package.
Error: Nothing to do



[root@eeepc installbins]# rpm -i glibc-common-2.4-4.x86_64.rpm
warning: glibc-common-2.4-4.x86_64.rpm: Header V3 DSA/SHA1 Signature, key ID 
4f2a6fd2: NOKEY
error: Failed dependencies:
glibc > 2.4 conflicts with glibc-common-2.4-4.x86_64

So, while I am surprised that glibc is not backward compatible, even if we 
succeed to replace version 2.12 of glibc with 2.4 of glibc, it may not be good 
for the operating system as a whole.


Please help me to install MarkLogic on Centos 6.0.   Is there a VM image I can 
download from somewhere?

Thanks and Regards,
Ashwini

___
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general


[MarkLogic Dev General] Cannot install MarkLogic 6 on Centos 6.0

2013-02-01 Thread Ashwini Kumar
Hi,

I just downloaded MarkLogic 6.0 and was trying to install on Centos 6.0
which is documented to be supported.

when I run the rpm I get

[root@eeepc installbins]# rpm -i MarkLogic-6.0-2.1.x86_64.rpm
error: Failed dependencies:
lsb is needed by MarkLogic-6.0-2.1.x86_64
gdb is needed by MarkLogic-6.0-2.1.x86_64
libc.so.6(GLIBC_2.4) is needed by MarkLogic-6.0-2.1.x86_64


Centos 6.0 comes along with Glib_2.12

[root@eeepc installbins]# yum list installed
glib2.x86_64  2.22.5-5.el6
@anaconda-CentOS-201106060106.x86_64/6.0
glibc.x86_64  2.12-1.7.el6
@anaconda-CentOS-201106060106.x86_64/6.0
glibc-common.x86_64   2.12-1.7.el6
@anaconda-CentOS-201106060106.x86_64/6.0

I downloaded both glibc and glib-common 2.4  but was unsuccessful in
installing them.

[root@eeepc installbins]# yum install glibc-2.4-4.x86_64.rpm
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * base: mirror.atlanticmetro.net
 * extras: mirror.trouble-free.net
 * updates: mirror.cisp.com
Setting up Install Process
Examining glibc-2.4-4.x86_64.rpm: glibc-2.4-4.x86_64
glibc-2.4-4.x86_64.rpm: does not update installed package.
Error: Nothing to do



[root@eeepc installbins]# rpm -i glibc-common-2.4-4.x86_64.rpm
warning: glibc-common-2.4-4.x86_64.rpm: Header V3 DSA/SHA1 Signature, key
ID 4f2a6fd2: NOKEY
error: Failed dependencies:
glibc > 2.4 conflicts with glibc-common-2.4-4.x86_64

So, while I am surprised that glibc is not backward compatible, even if we
succeed to replace version 2.12 of glibc with 2.4 of glibc, it may not be
good for the operating system as a whole.


Please help me to install MarkLogic on Centos 6.0.   Is there a VM image I
can download from somewhere?

Thanks and Regards,
Ashwini
___
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general


[MarkLogic Dev General] Uneven load in 3 node cluster

2013-02-01 Thread Miguel Rodríguez González
Hi all,
we are using CPF for post-processing a set of documents, we load via 
content-pump into a 3 node cluster (version 6.0-2). 
When we do, we do experience an uneven load in one of the servers (it hangs 
every now and then, while the other 2 seem to be waiting for more work), and so 
far, we did not 
manage to get a grip on what could be wrong.

In short, these are the steps we are following:

The process we follow:
- the ETL creates the xml files (around 40 million docs).
- content-pump pushes the documents into MarkLogic (10 threads with 100 
documents per transaction).
- a CPF pipeline adds some collections to the uploaded documents.

These are the steps of the CPF pipeline:
- Creation or update of a document, changes the document status to 
"unprocessed". This is saved in a document property 
- A scheduled task picks up batches of 50k documents and changes the state to 
processing every 2 minutes (here we spawn 50 batches of 1k documents to have 50 
transactions).
* we opted for using a scheduled task insted of relaying solely on CPF, because 
the servers were chocking on high volume.
- The state change triggers CPF (on-state-change event) and the document 
receives its collections after a query. 
- Once the collections are set, the status is changed to done.

We did verify that the 3 nodes have the same configuration. To do so, we 
checked the following files:

- assignments.xml 
- clusters.xml 
- databases.xml 
- groups.xml 
- hosts.xml 
- server.xml (it has 2 obvious differences: the host-id and the ssl private key)

The only difference between the 3 of them is the memory. These are the specs:
- CPU: 2x X5650, 6 cores, in total 12 cores
- MEM: 48 GB ( 64 GB on the third one)
- DISK: 6x 600 GB 15K in RAID 10 config

Attached to this email there are 6 pictures, which clearly show the problem we 
are facing:
- System load (system load, 5 minutes) for each of the 3 nodes
- CPU usage on a 100% scale, again for the 3 boxes

On the 3rd machine we see these warnings everytime the CPU is been hog 
(Error.log):

2013-02-01 00:02:01.327 Warning: Hung 65 sec
2013-02-01 00:03:19.243 Warning: Hung 54 sec
2013-02-01 00:04:00.802 Warning: Hung 41 sec
2013-02-01 00:06:40.061 Warning: Hung 130 sec

And some connection lost/time outs on the other 2 machines of the cluster:

2013-02-01 00:01:08.567 Info: Disconnecting from domestic host 
ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not responded 
for 30 seconds.
2013-02-01 00:02:54.634 Info: Disconnecting from domestic host 
ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not responded 
for 30 seconds.
2013-02-01 00:03:50.673 Info: Disconnecting from domestic host 
ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not responded 
for 30 seconds.
2013-02-01 00:05:01.473 Info: Disconnecting from domestic host 
ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not responded 
for 30 seconds.


Could you please provide advice?

Miguel Rodríguez
Lead Developer 
E mrgonza...@nl.swets.com
I www.swets.com
 

___
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general