[ 
https://issues.apache.org/jira/browse/JENA-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15145939#comment-15145939
 ] 

Andy Seaborne commented on JENA-1140:
-------------------------------------

The problem seems to be the quality of the hash code. The attached data is an 
extract from the report - it is all the triples with an {{xsd:dateTimeStamp}}, 
with the datatype changed to {{xsd:dateTime}}. It is slow on Jena 2.13.0 - 
about 300s to load. {{xsd:dateTimeStamp}} loads fast on 2.13.0 because 
{{xsd:dateTimeStamp}} is not indexed by value.

With the improved hash code, the load time is 2.6s (2.13.0 or 3.0.1).

The original code (below) generates close hash codes for close XSD dateTimes 
and XSD dateTimeStamps (Jena 3 onwards).

The bad hash code leads to hash table collision and long probe chains due to 
slots already in-use. The hash table is a closed hash table (AKA open 
addressing) with linear probing.

This was the hash code:
{noformat}
    @Override
    public int hashCode() {
        int hash = 0;
        for ( int aData : data )
        {
            hash = ( hash << 1 ) ^ aData;
        }
        return hash;
    }    
{noformat}

"<< 1" does not work well : datetimes that are close in value, lead to close 
has codes. Array items that are close by index need to be spread out more.

Changing to "<< 3" worked slightly better than "<< 2".  "<< 4" did not make 
much difference.

Changing the order so the hashing mixing date and time slots made "<< 2" work 
as well as "<< 3" (e.g. hash in the order (constants from 
[AbstractDateTime|https://github.com/apache/jena/blob/master/jena-core/src/main/java/org/apache/jena/datatypes/xsd/AbstractDateTime.java])
 CY=0, h=, M, m, D, s ....). The key is that, for example, m and s are 2 slot 
hashs apart i.e. "<<4" when the "<<2" is used.

 Using the Eclipse generated hashCode works just as well and is probably better 
researched. This is the fixed committed.

> Jena 3.0.1 model halts reading large rdf file partway through
> -------------------------------------------------------------
>
>                 Key: JENA-1140
>                 URL: https://issues.apache.org/jira/browse/JENA-1140
>             Project: Apache Jena
>          Issue Type: Bug
>          Components: Jena, RDF API
>    Affects Versions: Jena 3.0.1
>         Environment: Eclipse on Windows 7, 8, Mac
>            Reporter: HÃ¥vard Wanvik Stenersen
>            Assignee: Andy Seaborne
>              Labels: Model, rdf, read, slow, stop
>             Fix For: Jena 2.7.4
>
>         Attachments: data-500K.nt.gz
>
>
> The progress halts, or becomes slow to the point where progress is 
> unnoticable, without execution stopping or crashing, when attempting to read 
> a large (~250MB) turtle rdf file into a Jena model, created with 
> org.apache.jena.rdf.model.ModelFactory.createDefaultModel(), using 
> org.apache.jena.rdf.model.Model's read() method (tested with both the methods 
> using String url and InputStream in).
> The progress will continue until the process uses 1-1.5GB RAM, and progress 
> halts, but execution neither stops nor crashes. The code on the bottom 
> displays the behaviour with a progress bar for the file being read.
> This has been the case for my laptop running Windows 10 using
> Eclipse
> Version: Mars.1 Release (4.5.1)
> Build id: 20150924-1200
> My desktop running Windows 7 using 
> Eclipse
> Version: Kepler Service Release 2
> Build id: 20140224-627
> My professor's Mac using Eclipse, however I don't know which versions.
> All three systems were employing Apache Jena 3.0.1, and all of them 
> experienced the same issue.
> I have attempted to manually set the max heap size of the JVM by using the 
> -Xmx3G, however the result did not change.
> Employing Apache Jena Version 2.7.4, and using the same resources in the 
> com.hp.hpl package instead of org.apache fixed the problem on all three 
> systems.
> Here is the java test code:
> {code:title=ReadLotsOfRDF.java|borderStyle=solid}
> import java.io.BufferedInputStream;
> import java.io.FileInputStream;
> import com.hp.hpl.jena.rdf.model.Model;
> import com.hp.hpl.jena.rdf.model.ModelFactory;
> import javax.swing.JFrame;
> import javax.swing.ProgressMonitorInputStream;
> public class ReadLotsOfRDF {
>       public static void main(String[] args) throws java.io.IOException {
>               // create a test frame with a "press me" button
>               final JFrame f = new JFrame("Sample");
>               Model m = ModelFactory.createDefaultModel();
>               m.read(new BufferedInputStream(
>                               new ProgressMonitorInputStream(f,"Progress",
>                                               new 
> FileInputStream("LSQ-BM.ttl"))), null, "TTL");
>               System.out.println(m.size());
>       }
> }
> {code}
> The "LSQ-BM.ttl" file can be (and was) retrieved from 
> [here|https://drive.google.com/file/d/0B1tUDhWNTjO-UGhDTWx5U1EyWTg/view].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to