[basex-talk] Migrating ~ 3M record db from BaseX to PostgreSQL results in OOM

2019-10-06 Thread first name last name
Hello,

This is essentially part2 of trying to index large amounts of web data.
To summarize what happened before: The initial discussion started here [1],
Christian suggested some options, I dove into each of them, I realized that
doing this on a low-memory system is harder than I initially thought.
At Christian's suggestion, I tried to split the big db into smaller dbs and
came up with a rudimentary sharding mechanism [3].
All attempts to full-text 30GB of data in BaseX, for me, resulted in OOM
(do take into consideration that I only have 3.1GB of memory to allocate
for BaseX).

Where to?
I decided to look more into what Christian said in [2] about option 2, and
to pick the exact values that I want, and to transfer them to PostgreSQL
(after transferring, a GiST index would have to be built there, to allow
full-text searches; PostgreSQL is picked because it uses an in-memory
buffer for all large operations, and several files on disk, and if it needs
to combine results that exceed the available memory, it goes to disk, but
at all times it never exceeds the given amount of memory).

Variant 1 (see attached script pg-import.sh)
All good. So, I basically started writing XQuery that would do the
following:
- Open up a JDBC connection to PostgreSQL
- Get me all text content from each thread page of the forum, and the db it
belonged to
- Create a prepared statement for one such thread page, populate the
prepared statement, and execute it
This ended up in OOM after around 250k records. So just to be clear, 250k
lines were rows in PostgreSQL, which is nice but eventually it ended up in
OOM. (Perhaps it has to do with how the GC works in Java .. I don't know)

Variant 2 (see attached script pg-import2.sh)
I did something similar to the above:
- Open up a JDBC connection to PostgreSQL
- Get all posts and for each post get the author, the date, the message
content, the post id, the BaseX db name (cause we're going over all shards,
and each shard is a BaseX db)
- Create a prepared statement for each post with the data mentioned above,
and execute it
This also ended up in OOM after around 340k records (my approximation would
be that there were around 3M posts in the data).

To summarize, I'm tempted to believe that there might be a leak in the
BaseX implementation of XQuery.
I will provide in the following, the relevant versions of the software used:
- BaseX 9.2.4
- java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
- the JVM memory param value was  -Xmx3100m

I would be interested to know your thoughts

Thanks,
Stefan

[1]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-September/014715.html
[2]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014727.html
[3]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014729.html
#!/bin/bash

mkdir tmp

# TODO: work on the full-text part in PostgreSQL
# a trigger will be required to make it work.

cat << EOF > tmp/archive-schema.sql
CREATE DATABASE "archive";

\c "archive"
CREATE EXTENSION IF NOT EXISTS btree_gist;

DROP TABLE IF EXISTS thread;
CREATE TABLE IF NOT EXISTS thread (
id SERIAL PRIMARY KEY,
content TEXT,
label VARCHAR(300),
forum VARCHAR(300)
);

-- CREATE INDEX idx_thread_content ON thread USING gist(content);
CREATE INDEX idx_thread_label ON thread(label);
CREATE UNIQUE INDEX idx_thread_uniq_label_forum ON thread(label,forum);

EOF

LD_LIBRARY_PATH="" /share/Public/builds/prefix/bin/psql -U postgres -d postgres 
< tmp/archive-schema.sql



cat << 'EOF' > tmp/import.xq

let 
$conn-string:="jdbc:postgresql://localhost:5432/archive?user=postgres&password=postgres"
let $pgconn := sql:connect($conn-string)
let $dbs:=fn:filter(db:list(), function($x){ 
matches($x,"linuxquestions-shard-") })
for $db in fn:reverse(fn:sort($dbs))
for $x in db:open($db)
let $label:=$x/fn:base-uri()
let $content:=$x//*[matches(@id,"post_message_")]/text()
let $params := 
 { $label   }
 { $content }
 { $db }
   
let $prep:=sql:prepare($pgconn, "INSERT INTO thread(label,content,forum) 
VALUES(?,?,?)")
return 
try {
sql:execute-prepared($prep,$params)
} catch * {
'Error [' || $err:code || ']: ' || $err:description || '--' || $params
}

EOF

/share/Public/builds/basex/bin/basexclient -U admin -P admin tmp/import.xq


#!/bin/bash

mkdir tmp

# TODO: work on the full-text part in PostgreSQL
# a trigger will be required to make it work.
#
# DONE: more detailed content extraction.

cat << EOF > tmp/archive-schema.sql
CREATE DATABASE "archive";

\c "archive"
CREATE EXTENSION IF NOT EXISTS btree_gist;

DROP TABLE IF EXISTS thread2;
CREATE TABLE IF NOT EXISTS thread2 (
id SERIAL PRIMARY KEY,
date VARCHAR(300),
author VARCHAR(300),
post_id VARCHAR(300),
doc_uri VARCHAR(300),
content TEXT,
basex_db VARCHAR(300)
);

CREATE INDEX idx_thread2_date ON thread2(date);
CREATE

Re: [basex-talk] Migrating ~ 3M record db from BaseX to PostgreSQL results in OOM

2019-10-06 Thread Christian Grün
Some complementary notes (others may be able to tell you more about their
experiences with large data sets):

a GiST index would have to be built there, to allow full-text searches;
> PostgreSQL is picked
>

You could as well have a look at Elasticsearch or its predecessors.

there might be a leak in the BaseX implementation of XQuery.
>

I assume you are referring to the SQL Module? Feel free to attach the OOM
stack trace, it might give us more insight.

I would recommend you to write SQL commands or an SQL dump to disk (see the
BaseX File Module for now information) and run/import this file in a second
step; this is probably faster than sending hundreds of thousands of single
SQL commands via JDBC, no matter if you are using XQuery or Java.


Re: [basex-talk] Migrating ~ 3M record db from BaseX to PostgreSQL results in OOM

2019-10-07 Thread first name last name
I'm currently at work and my setup is at home. In about 7 hours I'll get
home and I will send the stack trace.

Meanwhile, is there any way to write a FLWOR, a loop, in a batched style?

Like for example in my case, this approach I described to migrate data from
BaseX to PostgreSQL makes use of BaseX as an XQuery processor and transfers
the full-text indexing to PostgreSQL, this is what I'm trying to do.

However, in order to avoid OOM, I am thinking of batching the transfer into
chunks, and potentially restart the BaseX server in between the migration
of each chunk.
That's why I am asking how I could do that in BaseX. My hope is that the
OOM could be avoided in this way, because not all the data would pass
through main memory and there would be less chances of the JVM GC having to
deal with this data.
Restarting the BaseX server between each chunk transfer would help making
sure that whatever memory was used is released.

So I wonder if something like
( to ]
would work here. Of course, some count would have to be done beforehand to
know how many batches there will be. Or maybe even without knowing how many
batches there will be, a while-type loop could be written in Bash with the
stop conditon being to check if the current chunk is empty.

Would an approach like this work to mitigate the OOM? Are there
alternatives or work-arounds to this kind of OOM?

Thanks












On Mon, Oct 7, 2019, 1:13 AM Christian Grün 
wrote:

> Some complementary notes (others may be able to tell you more about their
> experiences with large data sets):
>
> a GiST index would have to be built there, to allow full-text searches;
>> PostgreSQL is picked
>>
>
> You could as well have a look at Elasticsearch or its predecessors.
>
> there might be a leak in the BaseX implementation of XQuery.
>>
>
> I assume you are referring to the SQL Module? Feel free to attach the OOM
> stack trace, it might give us more insight.
>
> I would recommend you to write SQL commands or an SQL dump to disk (see
> the BaseX File Module for now information) and run/import this file in a
> second step; this is probably faster than sending hundreds of thousands of
> single SQL commands via JDBC, no matter if you are using XQuery or Java.
>
>
>
>
>


Re: [basex-talk] Migrating ~ 3M record db from BaseX to PostgreSQL results in OOM

2019-10-08 Thread first name last name
On Mon, Oct 7, 2019 at 1:13 AM Christian Grün 
wrote:

>
> I would recommend you to write SQL commands or an SQL dump to disk (see
> the BaseX File Module for now information) and run/import this file in a
> second step; this is probably faster than sending hundreds of thousands of
> single SQL commands via JDBC, no matter if you are using XQuery or Java.
>
>
Ok, so I finally managed to reach a compromise regarding BaseX capabilities
and the hardware that I have at my disposal (for the time being).
This message will probably answer thread [1] as well (which is separate
from this but seems to ask the same question basically, which is, how to
use BaseX as an command-line XQuery processor).
The script attached will take a large collection of HTML documents, it will
pack them into small "balanced" sets, and then it will run XQuery on them
using BaseX.
The result will be a lot of SQL files ready to be imported in PostgreSQL
(with some small tweaks, the data could be adapted to be imported in
Elasticsearch).

I'm also including some benchmark data:

On system1 the following times were recorded: If run with -j4 it does 200
forum thread pages in 10 seconds.
And apparently there's about 5 posts on average per thread page. So in
85000 seconds (almost a day) it would process ~1.7M posts (in ~340k forum
thread pages) and have them prepared to be imported in PostgreSQL. With -j4
the observed peak memory usage was 500MB.

I've tested the script attached on the following two systems:
system1 config:
- BaseX 9.2.4
- script (from util-linux 2.31.1)
- GNU Parallel 20161222
- Ubuntu 18.04 LTS

system1 hardware:
- cpu: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz (4 cores)
- memory: 16GB DDR3 RAM, 2 x Kingston @ 1333 MT/s
- disk: WDC WD30EURS-73TLHY0 @ 5400-7200RPM

system2 config:
- BaseX 9.2.4
- GNU Parallel 20181222
- script (from util-linux 2.34)

system2 hardware:
- cpu: Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz  (4 cores)
- memory: 4GB RAM DDR @ 1600MHz
- disk: HDD ST3000VN007-2E4166 @ 5900 rpm

[1]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014722.html
#!/bin/bash
#
# This script leverages BaseX as an XQuery command line processor
# by using multiple small disposable BaseX databases, and parallelizing the 
entire processing.
# It will essentially run XQuery in batches on large data sets, and produce
# SQL insert statements, so the data can be imported into PostgreSQL.
#
# We're packing files for processing, and we're trying to balance them out in 
sets
# such that two constraints are met:
# - no more than 100 files per set
# - no more than 100*90k bytes per set
#
# Timing:
#
# On system1 the following times were recorded:
# If run with -j4 it does 200 thread pages in 10 seconds.
# And apparently there's about 5 posts on average per thread page.
# so in 85000 seconds (which is almost a day).
# So in a day, it would process ~1.7M posts (in 340k forum thread pages)
# and have them prepared to be imported in PostgreSQL.
# Again, for -j4, the observed peak memory usage was 500MB.
#
# Notes:
#
# 1)
# The following error(found through strace) would manifest itself because
# of GNU Parallel mainly:
# --- stopped by SIGTTOU --- 
# It's also described here: 
# https://notmuchmail.org/pipermail/notmuch/2019/028015.html
# It can be circumvented throuhg the use of script
# (script - make typescript of terminal session)
#
# 2) --linebuffer is used for GNU Parallel so it can write to stdout as
# soon as possible.
#
#
# system1 config:
# - BaseX 9.2.4
# - script (from util-linux 2.31.1)
# - GNU Parallel 20161222
# - Ubuntu 18.04 LTS
# 
# system1 hardware:
# - cpu: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
# - memory: 16GB DDR3 RAM, 2 x Kingston @ 1333 MT/s
# - disk: WDC WD30EURS-73TLHY0 @ 5400-7200RPM 
#
# system2 config:
# - BaseX 9.2.4
# - GNU Parallel 20181222
# - script (from util-linux 2.34)
#
# system2 hardware:
# - cpu: Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz  (4 cores)
# - memory: 4GB RAM DDR @ 1600MHz
# - disk: HDD ST3000VN007-2E4166 @ 5900 rpm
#
#

BASEX="$HOME/basex-preloaded-forums/bin/basex"
mkdir meta-process
echo "Partitioning files into different sets ..."
#fpart -f 100 -s $((100 * 9)) `pwd`/threads/ -o meta-process/files-shard

proc() {
s="$1"
f="$2"
j="$3"

echo "$s -- $j -- $f"

# Build script to create temp db, and import all the html files
SHARD_IMPORT_SCRIPT=$(pwd)"/tmp/import.script.$s"
SHARD_PROCESS_SCRIPT=$(pwd)"/tmp/process.script.$s"
SHARD_SQL_OUT=$(pwd)"/tmp/partial.$j.sql"

cat << EOF > $SHARD_IMPORT_SCRIPT
DROP   DB tmp-$s
CREATE DB tmp-$s
SET PARSER html
SET HTMLPARSER 
method=xml,nons=true,nocdata=true,nodefaults=true,nobogons=true,nocolons=true,ignorable=true
SET CREATEFILTER *.html
EOF
cat $f | perl -pne 's{^}{ADD }g;' >> $SHARD_IMPORT_SCRIPT ;
 
script --return -c "$BASEX < $SHARD_IMPORT_SCRIPT >/dev/null ; echo 'Importing 
Done'; "
 
# Build processing script, to pull values and build SQL queries
echo "for \$x in db:open(\"tmp-$s\

Re: [basex-talk] Migrating ~ 3M record db from BaseX to PostgreSQL results in OOM

2019-10-08 Thread Christian Grün
I was surprised to see the 16 GB RAM machine pop up in your setup. Did you
check how many gigabytes of XML data can fulltext-indexed with BaseX (and a
large -Xmx value, maybe 15g) on that system?





first name last name  schrieb am Di., 8. Okt. 2019,
22:40:

> On Mon, Oct 7, 2019 at 1:13 AM Christian Grün 
> wrote:
>
>>
>> I would recommend you to write SQL commands or an SQL dump to disk (see
>> the BaseX File Module for now information) and run/import this file in a
>> second step; this is probably faster than sending hundreds of thousands of
>> single SQL commands via JDBC, no matter if you are using XQuery or Java.
>>
>>
> Ok, so I finally managed to reach a compromise regarding BaseX
> capabilities and the hardware that I have at my disposal (for the time
> being).
> This message will probably answer thread [1] as well (which is separate
> from this but seems to ask the same question basically, which is, how to
> use BaseX as an command-line XQuery processor).
> The script attached will take a large collection of HTML documents, it
> will pack them into small "balanced" sets, and then it will run XQuery on
> them using BaseX.
> The result will be a lot of SQL files ready to be imported in PostgreSQL
> (with some small tweaks, the data could be adapted to be imported in
> Elasticsearch).
>
> I'm also including some benchmark data:
>
> On system1 the following times were recorded: If run with -j4 it does 200
> forum thread pages in 10 seconds.
> And apparently there's about 5 posts on average per thread page. So in
> 85000 seconds (almost a day) it would process ~1.7M posts (in ~340k forum
> thread pages) and have them prepared to be imported in PostgreSQL. With -j4
> the observed peak memory usage was 500MB.
>
> I've tested the script attached on the following two systems:
> system1 config:
> - BaseX 9.2.4
> - script (from util-linux 2.31.1)
> - GNU Parallel 20161222
> - Ubuntu 18.04 LTS
>
> system1 hardware:
> - cpu: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz (4 cores)
> - memory: 16GB DDR3 RAM, 2 x Kingston @ 1333 MT/s
> - disk: WDC WD30EURS-73TLHY0 @ 5400-7200RPM
>
> system2 config:
> - BaseX 9.2.4
> - GNU Parallel 20181222
> - script (from util-linux 2.34)
>
> system2 hardware:
> - cpu: Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz  (4 cores)
> - memory: 4GB RAM DDR @ 1600MHz
> - disk: HDD ST3000VN007-2E4166 @ 5900 rpm
>
> [1]
> https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014722.html
>
>