[jira] [Updated] (JENA-1771) Spilling combined with DISTINCT .. ORDER BY returns rows in the wrong order

2019-10-17 Thread Shawn Smith (Jira)


 [ 
https://issues.apache.org/jira/browse/JENA-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shawn Smith updated JENA-1771:
--
Description: 
It looks like Jena assumes that OpDistinct preserves order, but order is not 
preserved when spilling occurs. This is only a problem when the 
ARQ.spillToDiskThreshold setting has been configured.

Consider the following query:
{code:java}
PREFIX : 
SELECT DISTINCT  *
WHERE
  { ?x  :p  ?v }
ORDER BY ASC(?v)
{code}
Here's the query plan for this query:
{code:java}
(distinct
  (order ((asc ?v))
(bgp (triple ?x  ?v
{code}
Jena executes the ORDER BY ASC(?v) before the DISTINCT, relying on the SPARQL 
requirement:
{quote}The order of Distinct(Ψ) must preserve any ordering given by OrderBy.
{quote}
But, when spilling, QueryIterDistinct (which executes OpDistinct) creates a 
DistinctDataBag with a BindingComparator without any sort conditions. As a 
result, the DISTINCT operation sorts using "compareBindingsSyntactic()" and 
doesn't preserve the ORDER BY ASC(?v) requirement.

Note that some query plans will reorder the ORDER BY and DISTINCT, making 
things work correctly. For example, adding a LIMIT 5 clause to the query above 
results in a "(top (5 (asc ?v))" operation that doesn't suffer from the bug.

You can reproduce this by injecting the following into QueryTest.java then 
running the ARQTestRefEngine tests:
{code:java}
void runTestSelect(Query query, QueryExecution qe)
{
qe.getContext().set(ARQ.spillToDiskThreshold, 4);   // add this line
...
{code}
For example, "ARQTestRefEngine -> Algebra optimizations -> 
QueryTest.opt-top-05" will fail with:
{code:java}
Query: 
PREFIX  : 

SELECT DISTINCT  *
WHERE
  { ?x  :p  ?v }
ORDER BY ASC(?v)
LIMIT   5

Got: 5 
-
| x| v  |
=
| :x1  | 1  |
| :x2  | 2  |
| :x10 | 10 |
| :x11 | 11 |
| :x12 | 12 |
-
Expected: 5 -

| x| v |

| :x1  | 1 |
| :x2  | 2 |
| :x3  | 3 |
| :x3a | 3 |
| :x4  | 4 |


junit.framework.AssertionFailedError: Results do not match
at junit.framework.Assert.fail(Assert.java:57)
at junit.framework.Assert.assertTrue(Assert.java:22)
at junit.framework.TestCase.assertTrue(TestCase.java:192)
at 
org.apache.jena.sparql.junit.QueryTest.runTestSelect(QueryTest.java:284)
at 
org.apache.jena.sparql.junit.QueryTest.runTestForReal(QueryTest.java:201)
at 
org.apache.jena.sparql.junit.EarlTestCase.runTest(EarlTestCase.java:88)
at junit.framework.TestCase.runBare(TestCase.java:141)
at junit.framework.TestResult$1.protect(TestResult.java:122)
at junit.framework.TestResult.runProtected(TestResult.java:142)
at junit.framework.TestResult.run(TestResult.java:125)
at junit.framework.TestCase.run(TestCase.java:129)
{code}

  was:
It looks like Jena assumes that OpDistinct preserves order, but order is not 
preserved when spilling occurs. This is only a problem when the 
ARQ.spillToDiskThreshold setting has been configured.

Consider the following query:
{code:java}
PREFIX : 
SELECT DISTINCT  *
WHERE
  { ?x  :p  ?v }
ORDER BY ASC(?v)
{code}
Here's the query plan for this query:
{code:java}
(distinct
  (order ((asc ?v))
(bgp (triple ?x  ?v
{code}
Jena executes the ORDER BY ASC(?v) before the DISTINCT, relying on the SPARQL 
requirement:
{quote}The order of Distinct(Ψ) must preserve any ordering given by OrderBy.
{quote}
But, when spilling, QueryIterDistinct (which executes OpDistinct) creates a 
DistinctDataBag with a BindingComparator without any sort conditions. As a 
result, the DISTINCT operation sorts using "compareBindingsSyntactic()" and 
doesn't preserve the ORDER BY ASC(?v) requirement.

Note that some query plans will reorder the ORDER BY and DISTINCT, making 
things work correctly. For example, adding a LIMIT clause to the query above 
results in a "(top (5 (asc ?v))" operation that doesn't suffer from the bug.

You can reproduce this by injecting the following into QueryTest.java then 
running the ARQTestRefEngine tests:
{code:java}
void runTestSelect(Query query, QueryExecution qe)
{
qe.getContext().set(ARQ.spillToDiskThreshold, 4);   // add this line
...
{code}
For example, "ARQTestRefEngine -> Algebra optimizations -> 
QueryTest.opt-top-05" will fail with:
{code:java}
Query: 
PREFIX  : 

SELECT DISTINCT  *
WHERE
  { ?x  :p  ?v }
ORDER BY ASC(?v)
LIMIT   5

Got: 5 
-
| x| v  |
=
| :x1  | 1  |
| :x2  | 2  |
| :x10 | 10 |
| :x11 | 11 |
| :x12 | 12 |
-
Expected: 5 -

| x| v |

| :x1  | 1 |
| :x2  | 2 |
| :x3  | 3 |
| :x3a | 3 |
| :x4  | 4 |



[jira] [Updated] (JENA-1771) Spilling combined with DISTINCT .. ORDER BY returns rows in the wrong order

2019-10-17 Thread Shawn Smith (Jira)


 [ 
https://issues.apache.org/jira/browse/JENA-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shawn Smith updated JENA-1771:
--
Description: 
It looks like Jena assumes that OpDistinct preserves order, but order is not 
preserved when spilling occurs. This is only a problem when the 
ARQ.spillToDiskThreshold setting has been configured.

Consider the following query:
{code:java}
PREFIX : 
SELECT DISTINCT  *
WHERE
  { ?x  :p  ?v }
ORDER BY ASC(?v)
{code}
Here's the query plan for this query:
{code:java}
(distinct
  (order ((asc ?v))
(bgp (triple ?x  ?v
{code}
Jena executes the ORDER BY ASC(?v) before the DISTINCT, relying on the SPARQL 
requirement:
{quote}The order of Distinct(Ψ) must preserve any ordering given by OrderBy.
{quote}
But, when spilling, QueryIterDistinct (which executes OpDistinct) creates a 
DistinctDataBag with a BindingComparator without any sort conditions. As a 
result, the DISTINCT operation sorts using "compareBindingsSyntactic()" and 
doesn't preserve the ORDER BY ASC(?v) requirement.

Note that some query plans will reorder the ORDER BY and DISTINCT, making 
things work correctly. For example, adding a LIMIT clause to the query above 
results in a "(top (5 (asc ?v))" operation that doesn't suffer from the bug.

You can reproduce this by injecting the following into QueryTest.java then 
running the ARQTestRefEngine tests:
{code:java}
void runTestSelect(Query query, QueryExecution qe)
{
qe.getContext().set(ARQ.spillToDiskThreshold, 4);   // add this line
...
{code}
For example, "ARQTestRefEngine -> Algebra optimizations -> 
QueryTest.opt-top-05" will fail with:
{code:java}
Query: 
PREFIX  : 

SELECT DISTINCT  *
WHERE
  { ?x  :p  ?v }
ORDER BY ASC(?v)
LIMIT   5

Got: 5 
-
| x| v  |
=
| :x1  | 1  |
| :x2  | 2  |
| :x10 | 10 |
| :x11 | 11 |
| :x12 | 12 |
-
Expected: 5 -

| x| v |

| :x1  | 1 |
| :x2  | 2 |
| :x3  | 3 |
| :x3a | 3 |
| :x4  | 4 |


junit.framework.AssertionFailedError: Results do not match
at junit.framework.Assert.fail(Assert.java:57)
at junit.framework.Assert.assertTrue(Assert.java:22)
at junit.framework.TestCase.assertTrue(TestCase.java:192)
at 
org.apache.jena.sparql.junit.QueryTest.runTestSelect(QueryTest.java:284)
at 
org.apache.jena.sparql.junit.QueryTest.runTestForReal(QueryTest.java:201)
at 
org.apache.jena.sparql.junit.EarlTestCase.runTest(EarlTestCase.java:88)
at junit.framework.TestCase.runBare(TestCase.java:141)
at junit.framework.TestResult$1.protect(TestResult.java:122)
at junit.framework.TestResult.runProtected(TestResult.java:142)
at junit.framework.TestResult.run(TestResult.java:125)
at junit.framework.TestCase.run(TestCase.java:129)
{code}

  was:
It looks like Jena assumes that OpDistinct preserves order, but order is not 
preserved when spilling occurs.  This is only a problem when the 
ARQ.spillToDiskThreshold setting has been configured.

Consider the following query:
{code:java}
SELECT DISTINCT  *
WHERE
  { ?x  :p  ?v }
ORDER BY ASC(?v)
{code}
Jena executes the ORDER BY ASC(?v) before the DISTINCT, relying on the SPARQL 
requirement:

bq. The order of Distinct(Ψ) must preserve any ordering given by OrderBy.

But, when spilling, QueryIterDistinct (which executes OpDistinct) creates a 
DistinctDataBag with a BindingComparator without any sort conditions.  As a 
result, the DISTINCT operation doesn't preserve the ORDER BY ASC(?v) 
requirement.

You can reproduce this by injecting the following into QueryTest.java then 
running the ARQTestRefEngine tests:
{code:java}
void runTestSelect(Query query, QueryExecution qe)
{
qe.getContext().set(ARQ.spillToDiskThreshold, 4);   // add this line
...
{code}

For example, "ARQTestRefEngine -> Algebra optimizations -> 
QueryTest.opt-top-05" will fail with:
{code}
Query: 
PREFIX  : 

SELECT DISTINCT  *
WHERE
  { ?x  :p  ?v }
ORDER BY ASC(?v)
LIMIT   5

Got: 5 
-
| x| v  |
=
| :x1  | 1  |
| :x2  | 2  |
| :x10 | 10 |
| :x11 | 11 |
| :x12 | 12 |
-
Expected: 5 -

| x| v |

| :x1  | 1 |
| :x2  | 2 |
| :x3  | 3 |
| :x3a | 3 |
| :x4  | 4 |


junit.framework.AssertionFailedError: Results do not match
at junit.framework.Assert.fail(Assert.java:57)
at junit.framework.Assert.assertTrue(Assert.java:22)
at junit.framework.TestCase.assertTrue(TestCase.java:192)
at 
org.apache.jena.sparql.junit.QueryTest.runTestSelect(QueryTest.java:284)
at 
org.apache.jena.sparql.junit.QueryTest.runTestForReal(QueryTest.java:201)
at 

[jira] [Created] (JENA-1771) Spilling combined with DISTINCT .. ORDER BY returns rows in the wrong order

2019-10-17 Thread Shawn Smith (Jira)
Shawn Smith created JENA-1771:
-

 Summary: Spilling combined with DISTINCT .. ORDER BY returns rows 
in the wrong order
 Key: JENA-1771
 URL: https://issues.apache.org/jira/browse/JENA-1771
 Project: Apache Jena
  Issue Type: Bug
  Components: ARQ
Affects Versions: Jena 3.13.1
Reporter: Shawn Smith


It looks like Jena assumes that OpDistinct preserves order, but order is not 
preserved when spilling occurs.  This is only a problem when the 
ARQ.spillToDiskThreshold setting has been configured.

Consider the following query:
{code:java}
SELECT DISTINCT  *
WHERE
  { ?x  :p  ?v }
ORDER BY ASC(?v)
{code}
Jena executes the ORDER BY ASC(?v) before the DISTINCT, relying on the SPARQL 
requirement:

bq. The order of Distinct(Ψ) must preserve any ordering given by OrderBy.

But, when spilling, QueryIterDistinct (which executes OpDistinct) creates a 
DistinctDataBag with a BindingComparator without any sort conditions.  As a 
result, the DISTINCT operation doesn't preserve the ORDER BY ASC(?v) 
requirement.

You can reproduce this by injecting the following into QueryTest.java then 
running the ARQTestRefEngine tests:
{code:java}
void runTestSelect(Query query, QueryExecution qe)
{
qe.getContext().set(ARQ.spillToDiskThreshold, 4);   // add this line
...
{code}

For example, "ARQTestRefEngine -> Algebra optimizations -> 
QueryTest.opt-top-05" will fail with:
{code}
Query: 
PREFIX  : 

SELECT DISTINCT  *
WHERE
  { ?x  :p  ?v }
ORDER BY ASC(?v)
LIMIT   5

Got: 5 
-
| x| v  |
=
| :x1  | 1  |
| :x2  | 2  |
| :x10 | 10 |
| :x11 | 11 |
| :x12 | 12 |
-
Expected: 5 -

| x| v |

| :x1  | 1 |
| :x2  | 2 |
| :x3  | 3 |
| :x3a | 3 |
| :x4  | 4 |


junit.framework.AssertionFailedError: Results do not match
at junit.framework.Assert.fail(Assert.java:57)
at junit.framework.Assert.assertTrue(Assert.java:22)
at junit.framework.TestCase.assertTrue(TestCase.java:192)
at 
org.apache.jena.sparql.junit.QueryTest.runTestSelect(QueryTest.java:284)
at 
org.apache.jena.sparql.junit.QueryTest.runTestForReal(QueryTest.java:201)
at 
org.apache.jena.sparql.junit.EarlTestCase.runTest(EarlTestCase.java:88)
at junit.framework.TestCase.runBare(TestCase.java:141)
at junit.framework.TestResult$1.protect(TestResult.java:122)
at junit.framework.TestResult.runProtected(TestResult.java:142)
at junit.framework.TestResult.run(TestResult.java:125)
at junit.framework.TestCase.run(TestCase.java:129)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (JENA-1770) Spilling bindings with OPTIONAL leads to wrong answers

2019-10-17 Thread Shawn Smith (Jira)
Shawn Smith created JENA-1770:
-

 Summary: Spilling bindings with OPTIONAL leads to wrong answers
 Key: JENA-1770
 URL: https://issues.apache.org/jira/browse/JENA-1770
 Project: Apache Jena
  Issue Type: Bug
  Components: ARQ
Affects Versions: Jena 3.13.1
Reporter: Shawn Smith


A query like the following where some variables are optional may lead to wrong 
answers when spilling occurs: 
{code:java}
PREFIX  foaf: 
SELECT  ?name ?mbox
WHERE
  { ?x  foaf:name  ?name
OPTIONAL
  { ?x  foaf:mbox  ?mbox }
  }
ORDER BY ASC(?mbox)
{code}
This is only a problem when the ARQ.spillToDiskThreshold setting has been 
configured.

The root cause is that BindingOutputStream emits a VARS row based on the first 
binding, but it doesn't emit a new VARS row when a subsequent binding contains 
additional variables.  

The BindingOutputStream.needVars() method will cause a second VARS row to be 
emitted when a new binding is missing variables, but not when it has extras.  
This logic may be inverted from what was intended.

There's a TestDistinctDataBag test case below that reproduces the problem. It 
generates a spill file like this:
{code}
VARS ?1 .
"A" .
"A" .
{code}
when a correct spill file would be:
{code}
VARS ?1 .
"A" .
VARS ?2 ?1 .
"B" "A" .
{code}

If you run it, you may notice that it fails with a spill threshold of 2 but 
passes with a higher threshold:
{code:java}
@Test public void testOptionalVariables()
{
// Setup a situation where the second binding in a spill file binds more
// variables than the first binding
BindingMap binding1 = BindingFactory.create();
binding1.add(Var.alloc("1"), NodeFactory.createLiteral("A"));

BindingMap binding2 = BindingFactory.create();
binding2.add(Var.alloc("1"), NodeFactory.createLiteral("A"));
binding2.add(Var.alloc("2"), NodeFactory.createLiteral("B"));

List undistinct = Arrays.asList(binding1, binding2, binding1);
List control = Iter.toList(Iter.distinct(undistinct.iterator()));
List distinct = new ArrayList<>();

DistinctDataBag db = new DistinctDataBag<>(
new ThresholdPolicyCount(2),
SerializationFactoryFinder.bindingSerializationFactory(),
new BindingComparator(new ArrayList()));
try
{
db.addAll(undistinct);
Iterator iter = db.iterator();
while (iter.hasNext())
{
distinct.add(iter.next());
}
Iter.close(iter);
}
finally
{
db.close();
}

assertEquals(control.size(), distinct.size());
assertTrue(ResultSetCompare.equalsByTest(control, distinct, 
NodeUtils.sameTerm));
}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (JENA-1769) Dataset#listNames slow for large TDB2 datasets

2019-10-17 Thread Damien Obrist (Jira)


[ 
https://issues.apache.org/jira/browse/JENA-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953861#comment-16953861
 ] 

Damien Obrist commented on JENA-1769:
-

[~andy] thanks for the information and for looking into this!

I tested using a sample dataset that I had created for a previous issue, 
JENA-1619. The dataset is contained in the attachment in that issue 
({{jena-transaction-exception-master.zip}}), inside the {{sample-data}} folder. 
It consists of 1'000'000 dummy quads of the form
{noformat}
  
 
{noformat}

> Dataset#listNames slow for large TDB2 datasets
> --
>
> Key: JENA-1769
> URL: https://issues.apache.org/jira/browse/JENA-1769
> Project: Apache Jena
>  Issue Type: Bug
>  Components: TDB2
>Affects Versions: Jena 3.13.0
>Reporter: Damien Obrist
>Assignee: Andy Seaborne
>Priority: Major
>  Labels: performance
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> With Jena 3.13.0, the running time of {{Dataset#listNames}} has increased 
> significantly for TDB2 datasets.
> I have compared the running times for a sample TDB2 dataset containing 
> *1'000'000 triples*. I have observed a running time of *~270ms* with Jena 
> 3.12.0 and *~13.5s* with Jena 3.13.0.
> We're using a dataset with many millions of triples and for our use case, the 
> running time has increased from seconds to minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (JENA-1769) Dataset#listNames slow for large TDB2 datasets

2019-10-17 Thread Andy Seaborne (Jira)


[ 
https://issues.apache.org/jira/browse/JENA-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953792#comment-16953792
 ] 

Andy Seaborne commented on JENA-1769:
-

[PR#619](https://github.com/apache/jena/pull/619) fixes this problem.

I have found the range of performance is dependent on the shape of the data - 
[~dobrist], are you able to share the test data you have please? (send it to me 
offline, or send a link)

> Dataset#listNames slow for large TDB2 datasets
> --
>
> Key: JENA-1769
> URL: https://issues.apache.org/jira/browse/JENA-1769
> Project: Apache Jena
>  Issue Type: Bug
>  Components: TDB2
>Affects Versions: Jena 3.13.0
>Reporter: Damien Obrist
>Assignee: Andy Seaborne
>Priority: Major
>  Labels: performance
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> With Jena 3.13.0, the running time of {{Dataset#listNames}} has increased 
> significantly for TDB2 datasets.
> I have compared the running times for a sample TDB2 dataset containing 
> *1'000'000 triples*. I have observed a running time of *~270ms* with Jena 
> 3.12.0 and *~13.5s* with Jena 3.13.0.
> We're using a dataset with many millions of triples and for our use case, the 
> running time has increased from seconds to minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [jena] afs opened a new pull request #619: JENA-1769: TDB2-specific code for listGraphNodes

2019-10-17 Thread GitBox
afs opened a new pull request #619: JENA-1769: TDB2-specific code for 
listGraphNodes
URL: https://github.com/apache/jena/pull/619
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services