Re: [Wikidata] 2 million queries against a Wikidata instance

2020-07-31 Thread Adam Sanchez
Dear all,

Finally, I was able to execute the 2 millions of queries in 24 minutes.
The solution was to load Virtuoso fully in-memory. I created a ramdisk
filesystem and I copied the full Virtuoso installation there.
The copy was done in a few minutes.I know Wikidata is stored in a
volatile memory but I already synchronised this folder with a folder
in the ssd disk. I think this solution could also be used to load
Wikidata even faster if Virtuoso is stored in a ramdisk-based
directory. When the loading is done, the folder could be moved back
from the ramdisk-directory to a harddisk directory for data
persistence
Thanks for all your suggestions and ideas. It saved me time because I
was able to narrow the set of possible solutions between software and
hardware.

Best,

Adam

links
https://www.linuxbabe.com/command-line/create-ramdisk-linux


Le jeu. 23 juil. 2020 à 09:01, Aidan Hogan  a écrit :
>
> Hi Adam,
>
> On 2020-07-13 13:41, Adam Sanchez wrote:
> > Hi,
> >
> > I have to launch 2 million queries against a Wikidata instance.
> > I have loaded Wikidata in Virtuoso 7 (512 RAM, 32 cores, SSD disks with 
> > RAID 0).
> > The queries are simple, just 2 types.
> >
> > select ?s ?p ?o {
> > ?s ?p ?o.
> > filter (?s = ?param)
> > }
> >
> > select ?s ?p ?o {
> > ?s ?p ?o.
> > filter (?o = ?param)
> > }
> >
> > If I use a Java ThreadPoolExecutor takes 6 hours.
> > How can I speed up the queries processing even more?
>
> Perhaps I am a bit late to respond.
>
> It's not really clear to me what you are aiming for, but if this is a
> once-off task, I would recommend to download the dump in Turtle or
> N-Triples, load your two million parameters in memory in a sorted or
> hashed data structure in the programming language of your choice (should
> take considerably less than 1GB of memory assuming typical constants),
> use a streaming RDF parser for that language, and for each
> subject/object, check if its in your list in memory. This solution is
> about as good as you can get in terms of once-off batch processing.
>
> If your idea is to index the data so you can do 2 million lookups in
> "interactive time", your problem is not what software to use, it's what
> hardware to use.
>
> Traditional hard disks have a physical arm that takes maybe 5-10 ms to
> move. Sold state disks are quite a bit better but still have seeks in
> the range of 0.1 ms. Multiply those seek times by 2 million and you have
> a long wait (caching will help, as will multiple disks, but not by
> nearly enough). You would need to get the data into main memory (RAM) to
> have any chance of approximating interactive times, and even still you
> will probably not get interactive runtimes without leveraging some
> further assumptions about what you want to do to optimise further (e.g.,
> if you're only interesting in Q ids, you can use integers or bit
> vectors, etc). In the most general case, you would probably need to
> pre-filter the data as much as you can, and also use as much compression
> as you can (ideally with compact data structures) to get the data into
> memory on one machine, or you might think about something like Redis
> (in-memory key-value store) on lots of machines. Essentially, if your
> goal is interactive times on millions of lookups, you very likely need
> to look at options purely in RAM (unless you have thousands of disks
> available at least). The good news is that 512GB(?) sounds like a lot of
> space to store stuff in.
>
> Best,
> Aidan
>
> > I was thinking :
> >
> > a) to implement a Virtuoso cluster to distribute the queries or
> > b) to load Wikidata in a Spark dataframe (since Sansa framework is
> > very slow, I would use my own implementation) or
> > c) to load Wikidata in a Postgresql table and use Presto to distribute
> > the queries or
> > d) to load Wikidata in a PG-Strom table to use GPU parallelism.
> >
> > What do you think? I am looking for ideas.
> > Any suggestion will be appreciated.
> >
> > Best,
> >
> > ___
> > Wikidata mailing list
> > Wikidata@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikidata
> >

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] 2 million queries against a Wikidata instance

2020-07-23 Thread Aidan Hogan

Hi Adam,

On 2020-07-13 13:41, Adam Sanchez wrote:

Hi,

I have to launch 2 million queries against a Wikidata instance.
I have loaded Wikidata in Virtuoso 7 (512 RAM, 32 cores, SSD disks with RAID 0).
The queries are simple, just 2 types.

select ?s ?p ?o {
?s ?p ?o.
filter (?s = ?param)
}

select ?s ?p ?o {
?s ?p ?o.
filter (?o = ?param)
}

If I use a Java ThreadPoolExecutor takes 6 hours.
How can I speed up the queries processing even more?


Perhaps I am a bit late to respond.

It's not really clear to me what you are aiming for, but if this is a 
once-off task, I would recommend to download the dump in Turtle or 
N-Triples, load your two million parameters in memory in a sorted or 
hashed data structure in the programming language of your choice (should 
take considerably less than 1GB of memory assuming typical constants), 
use a streaming RDF parser for that language, and for each 
subject/object, check if its in your list in memory. This solution is 
about as good as you can get in terms of once-off batch processing.


If your idea is to index the data so you can do 2 million lookups in 
"interactive time", your problem is not what software to use, it's what 
hardware to use.


Traditional hard disks have a physical arm that takes maybe 5-10 ms to 
move. Sold state disks are quite a bit better but still have seeks in 
the range of 0.1 ms. Multiply those seek times by 2 million and you have 
a long wait (caching will help, as will multiple disks, but not by 
nearly enough). You would need to get the data into main memory (RAM) to 
have any chance of approximating interactive times, and even still you 
will probably not get interactive runtimes without leveraging some 
further assumptions about what you want to do to optimise further (e.g., 
if you're only interesting in Q ids, you can use integers or bit 
vectors, etc). In the most general case, you would probably need to 
pre-filter the data as much as you can, and also use as much compression 
as you can (ideally with compact data structures) to get the data into 
memory on one machine, or you might think about something like Redis 
(in-memory key-value store) on lots of machines. Essentially, if your 
goal is interactive times on millions of lookups, you very likely need 
to look at options purely in RAM (unless you have thousands of disks 
available at least). The good news is that 512GB(?) sounds like a lot of 
space to store stuff in.


Best,
Aidan


I was thinking :

a) to implement a Virtuoso cluster to distribute the queries or
b) to load Wikidata in a Spark dataframe (since Sansa framework is
very slow, I would use my own implementation) or
c) to load Wikidata in a Postgresql table and use Presto to distribute
the queries or
d) to load Wikidata in a PG-Strom table to use GPU parallelism.

What do you think? I am looking for ideas.
Any suggestion will be appreciated.

Best,

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] 2 million queries against a Wikidata instance

2020-07-13 Thread Kingsley Idehen
On 7/13/20 1:41 PM, Adam Sanchez wrote:
> Hi,
>
> I have to launch 2 million queries against a Wikidata instance.
> I have loaded Wikidata in Virtuoso 7 (512 RAM, 32 cores, SSD disks with RAID 
> 0).
> The queries are simple, just 2 types.
>
> select ?s ?p ?o {
> ?s ?p ?o.
> filter (?s = ?param)
> }
>
> select ?s ?p ?o {
> ?s ?p ?o.
> filter (?o = ?param)
> }
>
> If I use a Java ThreadPoolExecutor takes 6 hours.
> How can I speed up the queries processing even more?
>
> I was thinking :
>
> a) to implement a Virtuoso cluster to distribute the queries or
> b) to load Wikidata in a Spark dataframe (since Sansa framework is
> very slow, I would use my own implementation) or
> c) to load Wikidata in a Postgresql table and use Presto to distribute
> the queries or
> d) to load Wikidata in a PG-Strom table to use GPU parallelism.
>
> What do you think? I am looking for ideas.
> Any suggestion will be appreciated.
>
> Best,


Hi Adam,

You need to increase the memory available to Virtuoso. If you are at
your limits that's when the Cluster Edition will come in handy i.e.,
enabling you build a large pool or memory from a sharded DB horizontally
partitioning over of collection of commodity computers.

There is a public Google Spreadsheet covering a variety of public
Virtuoso instances that should aid you in this process [1].

Links:

[1]
https://docs.google.com/spreadsheets/d/1-stlTC_WJmMU3xA_NxA1tSLHw6_sbpjff-5OITtrbFw/edit#gid=812792186

-- 
Regards,

Kingsley Idehen   
Founder & CEO 
OpenLink Software   
Home Page: http://www.openlinksw.com
Community Support: https://community.openlinksw.com
Weblogs (Blogs):
Company Blog: https://medium.com/openlink-software-blog
Virtuoso Blog: https://medium.com/virtuoso-blog
Data Access Drivers Blog: 
https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs):
Medium Blog: https://medium.com/@kidehen
Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/
  http://kidehen.blogspot.com

Profile Pages:
Pinterest: https://www.pinterest.com/kidehen/
Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen
Twitter: https://twitter.com/kidehen
Google+: https://plus.google.com/+KingsleyIdehen/about
LinkedIn: http://www.linkedin.com/in/kidehen

Web Identities (WebID):
Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
: 
http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this




smime.p7s
Description: S/MIME Cryptographic Signature
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] 2 million queries against a Wikidata instance

2020-07-13 Thread Amirouche Boubekki
Le lun. 13 juil. 2020 à 21:22, Adam Sanchez  a écrit :
>
> I have 14T SSD  (RAID 0)
>
> Le lun. 13 juil. 2020 à 21:19, Amirouche Boubekki
>  a écrit :
> >
> > Le lun. 13 juil. 2020 à 19:42, Adam Sanchez  a écrit 
> > :
> > >
> > > Hi,
> > >
> > > I have to launch 2 million queries against a Wikidata instance.
> > > I have loaded Wikidata in Virtuoso 7 (512 RAM, 32 cores, SSD disks with 
> > > RAID 0).
> > > The queries are simple, just 2 types.
> >
> > How much SSD in Gigabytes do you have?
> >
> > > select ?s ?p ?o {
> > > ?s ?p ?o.
> > > filter (?s = ?param)
> > > }

Can you confirm that the above query is the same as:

select ?p ?o {
  param ?p ?o
}

Where param is one of the two million params.

Also, did you investigate where the bottleneck is? Look into disk
usage and CPU load. glances [0] can provide that information.

Can you run the thread pool on another machine?

Some back of the envelope calculation 2 000 000 queries in 6 hours,
means your system achieve 10 milliseconds per query: AFAIK, that is
good.

[0] https://github.com/nicolargo/glances/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] 2 million queries against a Wikidata instance

2020-07-13 Thread Adam Sanchez
I have 14T SSD  (RAID 0)

Le lun. 13 juil. 2020 à 21:19, Amirouche Boubekki
 a écrit :
>
> Le lun. 13 juil. 2020 à 19:42, Adam Sanchez  a écrit :
> >
> > Hi,
> >
> > I have to launch 2 million queries against a Wikidata instance.
> > I have loaded Wikidata in Virtuoso 7 (512 RAM, 32 cores, SSD disks with 
> > RAID 0).
> > The queries are simple, just 2 types.
>
> How much SSD in Gigabytes do you have?
>
> > select ?s ?p ?o {
> > ?s ?p ?o.
> > filter (?s = ?param)
> > }
>
> Is that the same as:
>
> select ?p ?o {
>  param ?p ?o
> }
>
> Where param is one of the two million params.
>
> > select ?s ?p ?o {
> > ?s ?p ?o.
> > filter (?o = ?param)
> > }
> >
> > If I use a Java ThreadPoolExecutor takes 6 hours.
> > How can I speed up the queries processing even more?
> >
> > I was thinking :
> >
> > a) to implement a Virtuoso cluster to distribute the queries or
> > b) to load Wikidata in a Spark dataframe (since Sansa framework is
> > very slow, I would use my own implementation) or
> > c) to load Wikidata in a Postgresql table and use Presto to distribute
> > the queries or
> > d) to load Wikidata in a PG-Strom table to use GPU parallelism.
> >
> > What do you think? I am looking for ideas.
> > Any suggestion will be appreciated.
> >
> > Best,
> >
> > ___
> > Wikidata mailing list
> > Wikidata@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
>
> --
> Amirouche ~ https://hyper.dev
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] 2 million queries against a Wikidata instance

2020-07-13 Thread Amirouche Boubekki
Le lun. 13 juil. 2020 à 19:42, Adam Sanchez  a écrit :
>
> Hi,
>
> I have to launch 2 million queries against a Wikidata instance.
> I have loaded Wikidata in Virtuoso 7 (512 RAM, 32 cores, SSD disks with RAID 
> 0).
> The queries are simple, just 2 types.

How much SSD in Gigabytes do you have?

> select ?s ?p ?o {
> ?s ?p ?o.
> filter (?s = ?param)
> }

Is that the same as:

select ?p ?o {
 param ?p ?o
}

Where param is one of the two million params.

> select ?s ?p ?o {
> ?s ?p ?o.
> filter (?o = ?param)
> }
>
> If I use a Java ThreadPoolExecutor takes 6 hours.
> How can I speed up the queries processing even more?
>
> I was thinking :
>
> a) to implement a Virtuoso cluster to distribute the queries or
> b) to load Wikidata in a Spark dataframe (since Sansa framework is
> very slow, I would use my own implementation) or
> c) to load Wikidata in a Postgresql table and use Presto to distribute
> the queries or
> d) to load Wikidata in a PG-Strom table to use GPU parallelism.
>
> What do you think? I am looking for ideas.
> Any suggestion will be appreciated.
>
> Best,
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata



-- 
Amirouche ~ https://hyper.dev

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] 2 million queries against a Wikidata instance

2020-07-13 Thread Imre Samu
> How can I speed up the queries processing even more?

imho: drop the unwanted data as early as you can ...  (~
aggressive prefiltering ;  ~ not import  )

> Any suggestion will be appreciated.

in your case ..
- I will check the RDF dumps ..
https://www.wikidata.org/wiki/Wikidata:Database_download#RDF_dumps
- I will try to write a custom filter for pre-filter for 2 million
parameters  ... ( simple text parsing ..  in GoLang; using multiple cores
... or with other fast code  )
- and just  load the results to PostgreSQL ..

I have a good experience - parsing the and filtering the wikidata json dump
(gzipped) .. and loading the result to PostgreSQL database ..
I can run the full code on my laptop  and the result in my case ~
12 GB in the PostgreSQL ...

the biggest problem .. the memory requirements of  "2 million parameters"
 .. but you can choose some fast key-value storage .. like RocksDB ...
but there are other low tech parsing solutions ...

Regards,
 Imre



Best,
 Imre



Adam Sanchez  ezt írta (időpont: 2020. júl. 13., H,
19:42):

> Hi,
>
> I have to launch 2 million queries against a Wikidata instance.
> I have loaded Wikidata in Virtuoso 7 (512 RAM, 32 cores, SSD disks with
> RAID 0).
> The queries are simple, just 2 types.
>
> select ?s ?p ?o {
> ?s ?p ?o.
> filter (?s = ?param)
> }
>
> select ?s ?p ?o {
> ?s ?p ?o.
> filter (?o = ?param)
> }
>
> If I use a Java ThreadPoolExecutor takes 6 hours.
> How can I speed up the queries processing even more?
>
> I was thinking :
>
> a) to implement a Virtuoso cluster to distribute the queries or
> b) to load Wikidata in a Spark dataframe (since Sansa framework is
> very slow, I would use my own implementation) or
> c) to load Wikidata in a Postgresql table and use Presto to distribute
> the queries or
> d) to load Wikidata in a PG-Strom table to use GPU parallelism.
>
> What do you think? I am looking for ideas.
> Any suggestion will be appreciated.
>
> Best,
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] 2 million queries against a Wikidata instance

2020-07-13 Thread Adam Sanchez
Hi,

I have to launch 2 million queries against a Wikidata instance.
I have loaded Wikidata in Virtuoso 7 (512 RAM, 32 cores, SSD disks with RAID 0).
The queries are simple, just 2 types.

select ?s ?p ?o {
?s ?p ?o.
filter (?s = ?param)
}

select ?s ?p ?o {
?s ?p ?o.
filter (?o = ?param)
}

If I use a Java ThreadPoolExecutor takes 6 hours.
How can I speed up the queries processing even more?

I was thinking :

a) to implement a Virtuoso cluster to distribute the queries or
b) to load Wikidata in a Spark dataframe (since Sansa framework is
very slow, I would use my own implementation) or
c) to load Wikidata in a Postgresql table and use Presto to distribute
the queries or
d) to load Wikidata in a PG-Strom table to use GPU parallelism.

What do you think? I am looking for ideas.
Any suggestion will be appreciated.

Best,

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata