Re: Named graph performance issue

Puja Valiyil Wed, 08 May 2019 09:32:51 -0700

Hi Boris,
Did you try configuring accumulo to use locality groups?  I think that groups 
cf values in the same files, which may help in your case.  Sorry if I’m 
completely off base here— I’ve been in mongodb land for so long I may have lost 
touch on how the accumulo version of Rya works.
Thanks,
Puja


Sent from my iPhone

> On May 8, 2019, at 12:01 PM, Boris Pelakh <boris.pel...@semanticarts.com> 
> wrote:
> 
> Hi,
>  
> One of our customers has an application that has highly localized graph 
> patterns, so they partition their data into named graphs and primarily 
> perform graph-level read and replace operations. They are trying to make a 
> transition from RDF4J and Neptune to Rya and seen some performance issues. I 
> stood a local Rya instance in my Ubuntu VM and got some measurements:
>  
> Loaded 11K datasets averaging about 120 triples each (total 1.4 million 
> triples)
> Post insertion named graph fetch - 3.9 seconds. (RDF4J time was less than a 
> second)
> Compacted all the tables, average fetch of a graph - 1.9 seconds
> Rya stores the graph name in the column family, so a full fetch of a named 
> graph is range-less scan with a specified column family. Removed Rya from the 
> equation, wrote a small test program that did an equivalent column family 
> scan. Average time - 1.9 seconds, so it appears Rya overhead is negligible. 
> Tried variations with using a single range scanner, then a batch scanner with 
> a single range specified, just column family - same results
> Furthermore, query did not speed with repetition, i.e. no index warming effect
> Modified my graph fetch query from
> construct { ?s ?p ?o } where { graph <http://my/graph> { ?s ?p ?o }}
> to
> construct { ?s ?p ?o } where { graph <http://my/graph> { ?s a ?type; ?p ?o }}
> (which produced the exact same RDF output)
> This would execute as a range scan on the po table (using the rdf:type 
> predicate prefix), followed by a guided batch scan on the spo table on the 
> found subjects.
> Total execution time = 0.85 seconds. After repetition = 0.46 seconds as the 
> indices warmed
>  
> So, what I see is Accumulo is much better about a range scan than a column 
> family scan, so much so that even running 2 scans and a join is still faster. 
> It seems that if we wanted to get decent performance on graph fetches, we 
> would have to generate a `gspo` table or something similar.
>  
> Any ideas of another approach to improve the performance of this type of 
> query?
>  
> PS. Here is my test code,
> import org.apache.accumulo.core.client.*;
> import org.apache.accumulo.core.client.security.tokens.PasswordToken;
> import org.apache.accumulo.core.data.Key;
> import org.apache.accumulo.core.data.Range;
> import org.apache.accumulo.core.data.Value;
> import org.apache.accumulo.core.security.Authorizations;
> import org.apache.hadoop.io.Text;
>  
> import java.util.Collections;
> import java.util.Map;
>  
> public class ScanPerfTest {
>  
>     public static void main(String[] args) {
>         String instanceName = "accumulo";
>         String zooServers = "localhost";
>         Instance inst = new ZooKeeperInstance(instanceName, zooServers);
>  
>         try {
>             Connector con = inst.getConnector("rya", new 
> PasswordToken("rya"));
>             Scanner s = con.createScanner("sa_ts_spo", new Authorizations());
>             try {
> //                s.setRange(new Range(
> //                        new Key(new Text(new byte[]{})),
> //                        new Key(new Text(new byte[]{(byte) 0xff}))));
>                 s.fetchColumnFamily(new Text("http://my/graph";));
>                 long start = System.currentTimeMillis();
>                 int triples = 0;
>                 for (Map.Entry<Key, Value> e : s) {
>                     // System.out.println(e.getKey().getRow().toString());
>                     triples++;
>                 }
>                 System.out.println("Read " + triples + " triples in " + 
> (System.currentTimeMillis() - start) + "ms");
>             } finally {
>                 s.close();
>             }
>         } catch (Exception e) {
>             e.printStackTrace();
>         }
>     }
> }
>  
>  
> Boris Pelakh
> Ontologist, Developer, Software Architect
> boris.pel...@semanticarts.com
> +1-321-243-3804
> 
>

Re: Named graph performance issue

Reply via email to