Hello,

This is essentially part2 of trying to index large amounts of web data.
To summarize what happened before: The initial discussion started here [1],
Christian suggested some options, I dove into each of them, I realized that
doing this on a low-memory system is harder than I initially thought.
At Christian's suggestion, I tried to split the big db into smaller dbs and
came up with a rudimentary sharding mechanism [3].
All attempts to full-text 30GB of data in BaseX, for me, resulted in OOM
(do take into consideration that I only have 3.1GB of memory to allocate
for BaseX).

Where to?
I decided to look more into what Christian said in [2] about option 2, and
to pick the exact values that I want, and to transfer them to PostgreSQL
(after transferring, a GiST index would have to be built there, to allow
full-text searches; PostgreSQL is picked because it uses an in-memory
buffer for all large operations, and several files on disk, and if it needs
to combine results that exceed the available memory, it goes to disk, but
at all times it never exceeds the given amount of memory).

Variant 1 (see attached script pg-import.sh)
All good. So, I basically started writing XQuery that would do the
following:
- Open up a JDBC connection to PostgreSQL
- Get me all text content from each thread page of the forum, and the db it
belonged to
- Create a prepared statement for one such thread page, populate the
prepared statement, and execute it
This ended up in OOM after around 250k records. So just to be clear, 250k
lines were rows in PostgreSQL, which is nice but eventually it ended up in
OOM. (Perhaps it has to do with how the GC works in Java .. I don't know)

Variant 2 (see attached script pg-import2.sh)
I did something similar to the above:
- Open up a JDBC connection to PostgreSQL
- Get all posts and for each post get the author, the date, the message
content, the post id, the BaseX db name (cause we're going over all shards,
and each shard is a BaseX db)
- Create a prepared statement for each post with the data mentioned above,
and execute it
This also ended up in OOM after around 340k records (my approximation would
be that there were around 3M posts in the data).

To summarize, I'm tempted to believe that there might be a leak in the
BaseX implementation of XQuery.
I will provide in the following, the relevant versions of the software used:
- BaseX 9.2.4
- java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
- the JVM memory param value was  -Xmx3100m

I would be interested to know your thoughts

Thanks,
Stefan

[1]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-September/014715.html
[2]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014727.html
[3]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014729.html
#!/bin/bash

mkdir tmp

# TODO: work on the full-text part in PostgreSQL
# a trigger will be required to make it work.

cat << EOF > tmp/archive-schema.sql
CREATE DATABASE "archive";

\c "archive"
CREATE EXTENSION IF NOT EXISTS btree_gist;

DROP TABLE IF EXISTS thread;
CREATE TABLE IF NOT EXISTS thread (
    id SERIAL PRIMARY KEY,
    content TEXT,
    label VARCHAR(300),
    forum VARCHAR(300)
);

-- CREATE INDEX idx_thread_content ON thread USING gist(content);
CREATE INDEX idx_thread_label ON thread(label);
CREATE UNIQUE INDEX idx_thread_uniq_label_forum ON thread(label,forum);

EOF

LD_LIBRARY_PATH="" /share/Public/builds/prefix/bin/psql -U postgres -d postgres 
< tmp/archive-schema.sql



cat << 'EOF' > tmp/import.xq

let 
$conn-string:="jdbc:postgresql://localhost:5432/archive?user=postgres&amp;password=postgres"
let $pgconn := sql:connect($conn-string)
let $dbs:=fn:filter(db:list(), function($x){ 
matches($x,"linuxquestions-shard-") })
for $db in fn:reverse(fn:sort($dbs))
for $x in db:open($db)
let $label:=$x/fn:base-uri()
let $content:=$x//*[matches(@id,"post_message_")]/text()
let $params := <sql:parameters>
                 <sql:parameter type='string'>{ $label   }</sql:parameter>
                 <sql:parameter type='string'>{ $content }</sql:parameter>
                 <sql:parameter type='string'>{ $db }</sql:parameter>
               </sql:parameters>
let $prep:=sql:prepare($pgconn, "INSERT INTO thread(label,content,forum) 
VALUES(?,?,?)")
return 
try {
sql:execute-prepared($prep,$params)
} catch * {
'Error [' || $err:code || ']: ' || $err:description || '--' || $params
}

EOF

/share/Public/builds/basex/bin/basexclient -U admin -P admin tmp/import.xq


#!/bin/bash

mkdir tmp

# TODO: work on the full-text part in PostgreSQL
# a trigger will be required to make it work.
#
# DONE: more detailed content extraction.

cat << EOF > tmp/archive-schema.sql
CREATE DATABASE "archive";

\c "archive"
CREATE EXTENSION IF NOT EXISTS btree_gist;

DROP TABLE IF EXISTS thread2;
CREATE TABLE IF NOT EXISTS thread2 (
    id SERIAL PRIMARY KEY,
    date VARCHAR(300),
    author VARCHAR(300),
    post_id VARCHAR(300),
    doc_uri VARCHAR(300),
    content TEXT,
    basex_db VARCHAR(300)
);

CREATE INDEX idx_thread2_date ON thread2(date);
CREATE INDEX idx_thread2_author ON thread2(author);
CREATE INDEX idx_thread2_post_id ON thread2(post_id);
CREATE INDEX idx_thread2_doc_uri ON thread2(doc_uri);
CREATE INDEX idx_thread2_basex_db ON thread2(basex_db);
CREATE UNIQUE INDEX idx_thread2_doc_uri ON thread2(doc_uri);

EOF

LD_LIBRARY_PATH="" /share/Public/builds/prefix/bin/psql -U postgres -d postgres 
< tmp/archive-schema.sql



cat << 'EOF' > tmp/import.xq

let 
$conn-string:="jdbc:postgresql://localhost:5432/archive?user=postgres&amp;password=postgres"
let $pgconn := sql:connect($conn-string)
let $dbs:=fn:filter(db:list(), function($x){ 
matches($x,"linuxquestions-shard-") })
for $db in fn:reverse(fn:sort($dbs))
for $x in db:open($db)
let $doc_uri:=$x/fn:base-uri()
let $elems:=$x//table[matches(@id,"post\d+")]
for $elem in $elems
let $date:=$elem/tr[1]/td[1]/text()
let $author:=$elem/tr[2]/td[1]/div[1]/a//text()
let $post_id:=$elem/@id/string()
let $content:=$elem//*[matches(@id,"post_message_")]/text()
let $params := <sql:parameters>
                 <sql:parameter type='string'>{ $date   }</sql:parameter>
                 <sql:parameter type='string'>{ $author }</sql:parameter>
                 <sql:parameter type='string'>{ $post_id }</sql:parameter>
                 <sql:parameter type='string'>{ $doc_uri }</sql:parameter>
                 <sql:parameter type='string'>{ $content }</sql:parameter>
                 <sql:parameter type='string'>{ $db }</sql:parameter>
               </sql:parameters>
let $prep:=sql:prepare($pgconn, "INSERT INTO 
thread2(date,author,post_id,doc_uri,content,basex_db) VALUES(?,?,?,?,?,?)")
return 
try {
sql:execute-prepared($prep,$params)
} catch * {
'Error [' || $err:code || ']: ' || $err:description || '--' || $params
}

EOF

/share/Public/builds/basex/bin/basexclient -U admin -P admin tmp/import.xq


Reply via email to