[jira] [Commented] (JENA-624) Develop a new in-memory RDF Dataset implementation

2015-11-17 Thread A. Soroka (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15008825#comment-15008825
 ] 

A. Soroka commented on JENA-624:


[~andy.seaborne], did that commit f00e659c52d3b9daac4a0ffacf19be1b90c03d60 
include the new tests in connection with  JENA-1064 to which you refer above?

> Develop a new in-memory RDF Dataset implementation
> --
>
> Key: JENA-624
> URL: https://issues.apache.org/jira/browse/JENA-624
> Project: Apache Jena
>  Issue Type: Improvement
>Reporter: Andy Seaborne
>Assignee: A. Soroka
>  Labels: java, linked_data, rdf
>
> The current (Jan 2014) Jena in-memory dataset uses a general purpose 
> container that works for any storage technology for graphs together with 
> in-memory graphs.  
> This project would develop a new implementation design specifically for RDF 
> datasets (triples and quads) and efficient SPARQL execution, for example, 
> using multi-core parallel operations and/or multi-version concurrent 
> datastructures to maximise true parallel operation.
> This is a system project suitable for someone interested in datatbase 
> implementation, datastructure design and implementation, operating systems or 
> distributed systems.
> Note that TDB can operate in-memory using a simulated disk with 
> copy-in/copy-out semantics for disk-level operations.  It is for faithful 
> testing TDB infrastructure and is not designed performance, general in-memory 
> use or use at scale.  While lesson may be learnt from that system, TDB 
> in-memory is not the answer here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: meaning of org.apache.jena.sparql.core.DatasetChanges.listen

2015-11-17 Thread Andy Seaborne

On 17/11/15 10:21, Claude Warren wrote:

Looks to me like we need a set of contract tests for Dataset.  It would
make extension/implementaiton validation simple.


There are various abstract classes already for datasets and we are using 
them for JENA-624 e.g. AbstractDatasetGraphTests.


We have identified an area where testing caught an issue very late (the 
SPARQL scripted tests picked up an index ordering issue). We are pulling 
that back into the more basic DatasetGraph tests.


DatasetChanges is not part of the Dataset(Graph) API.  It's the 
interface to handle signalled changes to attach behaviour, e.g log 
changes, keep a text index up to date (Chris's UC), generate RDF patch 
files, ...


Andy



Claude

On Mon, Nov 16, 2015 at 5:12 PM, Andy Seaborne  wrote:


On 16/11/15 14:36, Chris Dollin wrote:


Dear All



Hi Chris,



Some time recently org.apache.jena.sparql.core.DatasetChanges



git log
jena-arq/src/main/java/org/apache/jena/sparql/core/DatasetChanges.java

grew a listen() method with the comment "Release any resources".




s/listen\(\)/reset\(\)/



What sort of any are the released resources? Presumably finish()
does resource cleanup, so what is reset doing that finish doesn't
do? My best guess is that it is for abandoning state that is
handling an incomplete series of triples without abandoning
the entire state of the DatasetChanges implementation.



I can't find any use of reset().

But a sequence of changes might be several start-finish to group things
but part of a larger process that is across the same internal resources in
which case a final "reset()" indicates that's all over e.g. a commit. it
decouples the app needs for grouping (e.g. a small set of related changes)
to a larger grouping like a transaction.

start-finish-start-finish...-start-finish-reset

Advance notice:

It looks like DatasetChanges or an interface extending DatasetChanges or a
better-parallel interface needs to reflect transaction boundaries properly.

This has now come up a couple of times in different places so it is
indicative that DatasetChanges isn't the right design.

[I'm asking because ppd-index implements TextDocProducerBatch




not part of Jena

which implements DatasetChanges and I want to know what the

expectation of callers of TextDocProducerBatch.reset)() may have.]



Any experience to report especially regarding transactions and
DatasetChanges changes or replacement?



Chris



 Andy










[jira] [Created] (JENA-1071) Warnings when using XML 1.0 5th edition codepoints in rdf:ID.

2015-11-17 Thread Andy Seaborne (JIRA)
Andy Seaborne created JENA-1071:
---

 Summary: Warnings when using XML 1.0 5th edition codepoints in 
rdf:ID.
 Key: JENA-1071
 URL: https://issues.apache.org/jira/browse/JENA-1071
 Project: Apache Jena
  Issue Type: Bug
Reporter: Andy Seaborne
Priority: Minor


Report on users@ https://pony-poc.apache.org/thread.html/Znx1topkrk8ykbr 

Workaround: 
* Use {{rdf:about}}
* Ignore or disable the warning

The causing character is [Character 
U+0370|http://www.fileformat.info/info/unicode/char/0370/index.htm] (Greek 
Capital Heta). It was added to unicode at version 5.1. 
https://en.wikipedia.org/wiki/Heta

Greek letters e.g.  ΑΒ..., (Capital Letters alpha and beta) U+0391 and αβ...  
(lower case).

Jena code {{ParserSupport.checkXMLName}} calls Xerces {{XMLChar.isValidNCName}}.

Xerces supports "XML 1.0 Fourth Edition" which does not permit U+0370.

Java8 also only supports XML 1.0 fourth edition.

Both Xerces 2.11.0 (Jena since 2.10.1) and Java8 support XML 1.1 with 
{{XML11Char.isXML11ValidNCName}} which does include this character.

Jena could "upgrade" to using the XML11Char for the additional checks it 
performs.  This is not XML11 support.

Uses of {{XMLChar}}:

* {{BaseXMLWriter.java}}
* {{ParserSupport.java}}
* {{Unparser.java}}
* {{PrefixMappingImpl}} -- URI splitting?
* {{Util}} - controls URI splitting
* {{schemagen}}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] jena pull request: JENA-1062: configurable Lucene analyzer for jen...

2015-11-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/jena/pull/97


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: JENA-1062: configurable Lucene analyzer for jen...

2015-11-17 Thread osma
Github user osma commented on the pull request:

https://github.com/apache/jena/pull/97#issuecomment-157324689
  
Rebased on current master and squashed my commits into one, preparing to 
merge to Apache git


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (JENA-1062) add ConfigurableAnalyzer to jena-text

2015-11-17 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15008435#comment-15008435
 ] 

ASF subversion and git services commented on JENA-1062:
---

Commit 1714748 from o...@apache.org in branch 'site/trunk'
[ https://svn.apache.org/r1714748 ]

update jena-text documentation for JENA-1062 (ConfigurableAnalyzer)

> add ConfigurableAnalyzer to jena-text
> -
>
> Key: JENA-1062
> URL: https://issues.apache.org/jira/browse/JENA-1062
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Reporter: Osma Suominen
>Assignee: Osma Suominen
>
> This is an alternative to JENA-1058 (which implemented a very specific Lucene 
> Analyzer for jena-text). The idea here, based on a comment by Claude Warren 
> on JENA-1058, is to provide a ConfigurableAnalyzer that can be configured 
> with a Tokenizer and (optionally) one or more TokenFilters, like this:
> text:analyzer [
>   a text:ConfigurableAnalyzer ;
>   text:tokenizer text:KeywordTokenizer ;
>   text:filters (text:ASCIIFoldingFilter, text:LowerCaseFilter)
> ]
> I have some code ready to implement this and will open a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (JENA-1062) add ConfigurableAnalyzer to jena-text

2015-11-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15008410#comment-15008410
 ] 

ASF GitHub Bot commented on JENA-1062:
--

Github user asfgit closed the pull request at:

https://github.com/apache/jena/pull/97


> add ConfigurableAnalyzer to jena-text
> -
>
> Key: JENA-1062
> URL: https://issues.apache.org/jira/browse/JENA-1062
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Reporter: Osma Suominen
>Assignee: Osma Suominen
>
> This is an alternative to JENA-1058 (which implemented a very specific Lucene 
> Analyzer for jena-text). The idea here, based on a comment by Claude Warren 
> on JENA-1058, is to provide a ConfigurableAnalyzer that can be configured 
> with a Tokenizer and (optionally) one or more TokenFilters, like this:
> text:analyzer [
>   a text:ConfigurableAnalyzer ;
>   text:tokenizer text:KeywordTokenizer ;
>   text:filters (text:ASCIIFoldingFilter, text:LowerCaseFilter)
> ]
> I have some code ready to implement this and will open a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: meaning of org.apache.jena.sparql.core.DatasetChanges.listen

2015-11-17 Thread Claude Warren
Looks to me like we need a set of contract tests for Dataset.  It would
make extension/implementaiton validation simple.

Claude

On Mon, Nov 16, 2015 at 5:12 PM, Andy Seaborne  wrote:

> On 16/11/15 14:36, Chris Dollin wrote:
>
>> Dear All
>>
>
> Hi Chris,
>
>
>> Some time recently org.apache.jena.sparql.core.DatasetChanges
>>
>
> git log
> jena-arq/src/main/java/org/apache/jena/sparql/core/DatasetChanges.java
>
> grew a listen() method with the comment "Release any resources".
>>
>
> s/listen\(\)/reset\(\)/
>
>
>> What sort of any are the released resources? Presumably finish()
>> does resource cleanup, so what is reset doing that finish doesn't
>> do? My best guess is that it is for abandoning state that is
>> handling an incomplete series of triples without abandoning
>> the entire state of the DatasetChanges implementation.
>>
>
> I can't find any use of reset().
>
> But a sequence of changes might be several start-finish to group things
> but part of a larger process that is across the same internal resources in
> which case a final "reset()" indicates that's all over e.g. a commit. it
> decouples the app needs for grouping (e.g. a small set of related changes)
> to a larger grouping like a transaction.
>
> start-finish-start-finish...-start-finish-reset
>
> Advance notice:
>
> It looks like DatasetChanges or an interface extending DatasetChanges or a
> better-parallel interface needs to reflect transaction boundaries properly.
>
> This has now come up a couple of times in different places so it is
> indicative that DatasetChanges isn't the right design.
>
> [I'm asking because ppd-index implements TextDocProducerBatch
>>
>
> not part of Jena
>
> which implements DatasetChanges and I want to know what the
>> expectation of callers of TextDocProducerBatch.reset)() may have.]
>>
>
> Any experience to report especially regarding transactions and
> DatasetChanges changes or replacement?
>
>
>> Chris
>>
>>
> Andy
>
>
>


-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren


[jira] [Commented] (JENA-1062) add ConfigurableAnalyzer to jena-text

2015-11-17 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15008409#comment-15008409
 ] 

ASF subversion and git services commented on JENA-1062:
---

Commit 9c35b680626f164578a8b1c2a3ea9c5cd85e0868 in jena's branch 
refs/heads/master from [~osma]
[ https://git-wip-us.apache.org/repos/asf?p=jena.git;h=9c35b68 ]

JENA-1062: configurable Lucene analyzer for jena-text


> add ConfigurableAnalyzer to jena-text
> -
>
> Key: JENA-1062
> URL: https://issues.apache.org/jira/browse/JENA-1062
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Reporter: Osma Suominen
>Assignee: Osma Suominen
>
> This is an alternative to JENA-1058 (which implemented a very specific Lucene 
> Analyzer for jena-text). The idea here, based on a comment by Claude Warren 
> on JENA-1058, is to provide a ConfigurableAnalyzer that can be configured 
> with a Tokenizer and (optionally) one or more TokenFilters, like this:
> text:analyzer [
>   a text:ConfigurableAnalyzer ;
>   text:tokenizer text:KeywordTokenizer ;
>   text:filters (text:ASCIIFoldingFilter, text:LowerCaseFilter)
> ]
> I have some code ready to implement this and will open a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (JENA-1062) add ConfigurableAnalyzer to jena-text

2015-11-17 Thread Osma Suominen (JIRA)

 [ 
https://issues.apache.org/jira/browse/JENA-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Osma Suominen closed JENA-1062.
---
   Resolution: Fixed
Fix Version/s: Jena 3.0.1

Merged the PR and updated documentation in SVN. All done.

> add ConfigurableAnalyzer to jena-text
> -
>
> Key: JENA-1062
> URL: https://issues.apache.org/jira/browse/JENA-1062
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Reporter: Osma Suominen
>Assignee: Osma Suominen
> Fix For: Jena 3.0.1
>
>
> This is an alternative to JENA-1058 (which implemented a very specific Lucene 
> Analyzer for jena-text). The idea here, based on a comment by Claude Warren 
> on JENA-1058, is to provide a ConfigurableAnalyzer that can be configured 
> with a Tokenizer and (optionally) one or more TokenFilters, like this:
> text:analyzer [
>   a text:ConfigurableAnalyzer ;
>   text:tokenizer text:KeywordTokenizer ;
>   text:filters (text:ASCIIFoldingFilter, text:LowerCaseFilter)
> ]
> I have some code ready to implement this and will open a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (JENA-1062) add ConfigurableAnalyzer to jena-text

2015-11-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15008403#comment-15008403
 ] 

ASF GitHub Bot commented on JENA-1062:
--

Github user osma commented on the pull request:

https://github.com/apache/jena/pull/97#issuecomment-157324689
  
Rebased on current master and squashed my commits into one, preparing to 
merge to Apache git


> add ConfigurableAnalyzer to jena-text
> -
>
> Key: JENA-1062
> URL: https://issues.apache.org/jira/browse/JENA-1062
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Reporter: Osma Suominen
>Assignee: Osma Suominen
>
> This is an alternative to JENA-1058 (which implemented a very specific Lucene 
> Analyzer for jena-text). The idea here, based on a comment by Claude Warren 
> on JENA-1058, is to provide a ConfigurableAnalyzer that can be configured 
> with a Tokenizer and (optionally) one or more TokenFilters, like this:
> text:analyzer [
>   a text:ConfigurableAnalyzer ;
>   text:tokenizer text:KeywordTokenizer ;
>   text:filters (text:ASCIIFoldingFilter, text:LowerCaseFilter)
> ]
> I have some code ready to implement this and will open a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (JENA-1070) SPARQL: Cast from xsd:double to xsd:decimal fails

2015-11-17 Thread Andy Seaborne (JIRA)

 [ 
https://issues.apache.org/jira/browse/JENA-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Seaborne resolved JENA-1070.
-
   Resolution: Duplicate
Fix Version/s: Jena 3.0.1

> SPARQL: Cast from xsd:double to xsd:decimal fails
> -
>
> Key: JENA-1070
> URL: https://issues.apache.org/jira/browse/JENA-1070
> Project: Apache Jena
>  Issue Type: Bug
>Reporter: Richard Cyganiak
>Assignee: Andy Seaborne
>Priority: Minor
> Fix For: Jena 3.0.1
>
>
> Casting from xsd:double to xsd:decimal apparently doesn't work if the 
> xsd:double is in exponent notation. Example:
> {noformat}
> PREFIX xsd: 
> SELECT (xsd:decimal("1e0"^^xsd:double) AS ?x) WHERE {}
> {noformat}
> I tried running this on sparql.org. I expect this to return 
> {{"1"^^xsd:decimal}} or {{"1.0"^^xsd:decimal}}. It returns nothing.
> It returns the expected xsd:decimal values when changing the lexical form 
> from "1e0" to "1.0" or "1", although these all represent the same legal 
> double value.
> The same problem occurs when casting to xsd:integer, or when the input is 
> xsd:float.
> I think the correct behaviour of the xsd:decimal and xsd:integer casting 
> functions are specified in 
> http://www.w3.org/TR/xpath-functions/#casting-to-numerics, and I read them as 
> stating that xsd:double inputs should work.
> Since pretty much any maths on xsd:double (including ROUND and FLOOR) returns 
> xsd:double in e notation, this issue makes it very hard to produce “pretty” 
> number output if the input contains xsd:doubles. It looks like one has to 
> resort to truncating the e0 part with string operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)