Hi Boris, Did you try configuring accumulo to use locality groups? I think that groups cf values in the same files, which may help in your case. Sorry if I’m completely off base here— I’ve been in mongodb land for so long I may have lost touch on how the accumulo version of Rya works. Thanks, Puja
Sent from my iPhone > On May 8, 2019, at 12:01 PM, Boris Pelakh <boris.pel...@semanticarts.com> > wrote: > > Hi, > > One of our customers has an application that has highly localized graph > patterns, so they partition their data into named graphs and primarily > perform graph-level read and replace operations. They are trying to make a > transition from RDF4J and Neptune to Rya and seen some performance issues. I > stood a local Rya instance in my Ubuntu VM and got some measurements: > > Loaded 11K datasets averaging about 120 triples each (total 1.4 million > triples) > Post insertion named graph fetch - 3.9 seconds. (RDF4J time was less than a > second) > Compacted all the tables, average fetch of a graph - 1.9 seconds > Rya stores the graph name in the column family, so a full fetch of a named > graph is range-less scan with a specified column family. Removed Rya from the > equation, wrote a small test program that did an equivalent column family > scan. Average time - 1.9 seconds, so it appears Rya overhead is negligible. > Tried variations with using a single range scanner, then a batch scanner with > a single range specified, just column family - same results > Furthermore, query did not speed with repetition, i.e. no index warming effect > Modified my graph fetch query from > construct { ?s ?p ?o } where { graph <http://my/graph> { ?s ?p ?o }} > to > construct { ?s ?p ?o } where { graph <http://my/graph> { ?s a ?type; ?p ?o }} > (which produced the exact same RDF output) > This would execute as a range scan on the po table (using the rdf:type > predicate prefix), followed by a guided batch scan on the spo table on the > found subjects. > Total execution time = 0.85 seconds. After repetition = 0.46 seconds as the > indices warmed > > So, what I see is Accumulo is much better about a range scan than a column > family scan, so much so that even running 2 scans and a join is still faster. > It seems that if we wanted to get decent performance on graph fetches, we > would have to generate a `gspo` table or something similar. > > Any ideas of another approach to improve the performance of this type of > query? > > PS. Here is my test code, > import org.apache.accumulo.core.client.*; > import org.apache.accumulo.core.client.security.tokens.PasswordToken; > import org.apache.accumulo.core.data.Key; > import org.apache.accumulo.core.data.Range; > import org.apache.accumulo.core.data.Value; > import org.apache.accumulo.core.security.Authorizations; > import org.apache.hadoop.io.Text; > > import java.util.Collections; > import java.util.Map; > > public class ScanPerfTest { > > public static void main(String[] args) { > String instanceName = "accumulo"; > String zooServers = "localhost"; > Instance inst = new ZooKeeperInstance(instanceName, zooServers); > > try { > Connector con = inst.getConnector("rya", new > PasswordToken("rya")); > Scanner s = con.createScanner("sa_ts_spo", new Authorizations()); > try { > // s.setRange(new Range( > // new Key(new Text(new byte[]{})), > // new Key(new Text(new byte[]{(byte) 0xff})))); > s.fetchColumnFamily(new Text("http://my/graph")); > long start = System.currentTimeMillis(); > int triples = 0; > for (Map.Entry<Key, Value> e : s) { > // System.out.println(e.getKey().getRow().toString()); > triples++; > } > System.out.println("Read " + triples + " triples in " + > (System.currentTimeMillis() - start) + "ms"); > } finally { > s.close(); > } > } catch (Exception e) { > e.printStackTrace(); > } > } > } > > > Boris Pelakh > Ontologist, Developer, Software Architect > boris.pel...@semanticarts.com > +1-321-243-3804 > >