Re: [Neo] Strange behavior of LuceneFulltextIndexService

2009-12-21 Thread Mattias Persson
I think I fixed it. It'll be available from our maven repo soon!

2009/12/21 Mattias Persson matt...@neotechnology.com:
 Ok great, I'll look into this error and see if I can locate that bug.

 2009/12/21 Sebastian Stober sto...@ovgu.de:
 Hello Mattias,

 thank you for your quick reply. The new behavior you describe looks like
 what I would expect. (I think fulltext queries should generally be
 treated case-insensitive)

 The original junit-test now completes without error. However, there
 still seems to be something odd.

 If I modify the setup code (before I run any test queries) like this,
 the LuceneFulltextIndexService is messed up:

 // using LuceneFulltextIndexService

 andy.setProperty( name, Andy Wachowski );
 andy.setProperty( title, Director );
 //      larry.setProperty( name, Larry Wachowski ); //old
 larry.setProperty( name, Andy Wachowski ); //new(deliberately wrong)
 larry.setProperty( title, Director );
 index.index( andy, name, andy.getProperty( name ) );
 index.index( andy, title, andy.getProperty( title ) );
 index.index( larry, name, larry.getProperty( name ) );
 index.index( larry, title, larry.getProperty( title ) );

 // new: fixing the name of larry
 index.removeIndex( larry, name, larry.getProperty( name ) );
 larry.setProperty( name, Larry Wachowski );
 index.index( larry, name, larry.getProperty( name ) );

 // start the test...
 index.getNodes( name, wachowski )
 now returns only larry instead of both nodes.

 Any ideas? It looks like the index entry for andy is removed as well.

 Cheers,
 Sebastian

 Message: 4
 Date: Fri, 18 Dec 2009 10:16:33 +0100
 From: Mattias Persson matt...@neotechnology.com
 Subject: Re: [Neo] Strange behavior of LuceneFulltextIndexService
 To: Neo user discussions user@lists.neo4j.org
 Message-ID:
       acdd47330912180116l7d5fe082g2322f45712906...@mail.gmail.com
 Content-Type: text/plain; charset=UTF-8

 I've made some changes to make LuceneFulltextIndexService and
 LuceneFulltextQueryIndexService behave more natural. So this is the
 new (and better) deal (copied from the javadoc, from your example!):

 LuceneFulltextIndexService:
     /**
      * Since this is a fulltext index it changes the contract of this 
 method
      * slightly. It treats the {...@code value} more like a query in than 
 you can
      * query for individual words in your indexed values.
      *
      * So if you've indexed node (1) with value Andy Wachowski and node 
 (2)
      * with Larry Wachowski you can expect this behaviour if you query 
 for:
      *
      * o andy            -- (1)
      * o Andy            -- (1)
      * o wachowski       -- (1), (2)
      * o andy larry      --
      * o larry Wachowski -- (2)
      * o wachowski Andy  -- (1)
      */

 LuceneFulltextQueryIndexService:
     /**
      * Here the {...@code value} is treated as a lucene query,
      * http://lucene.apache.org/java/2_9_1/queryparsersyntax.html
      *
      * So if you've indexed node (1) with value Andy Wachowski and node 
 (2)
      * with Larry Wachowski you can expect this behaviour if you query 
 for:
      *
      * o andy            -- (1)
      * o Andy            -- (1)
      * o wachowski       -- (1), (2)
      * o andy AND larry  --
      * o andy OR larry   -- (1), (2)
      * o larry Wachowski -- (1), (2) // lucene's default operator is OR
      *
      * The default AND/OR behaviour can be changed by overriding
      * {...@link #getDefaultQueryOperator(String, Object)}.
      */


 Does this make more sense?

 2009/12/17 Mattias Persson matt...@neotechnology.com:
 That is indeed a behaviour which needs to be straightened out, I see
 that it doesn't behave as expected all the time. I'll look into this
 as soon as possible.

 Btw. is it a good idea to have LuceneFulltextIndexService and
 LuceneFulltextQueryIndexService be case-insensitive, should it be
 configurable or would it be nice with case-sensitivity instead (so
 that you'd have to run .toLowerCase(), or something, on your strings
 and queries to get such behaviour)?

 2009/12/17 Sebastian Stober sto...@ovgu.de:
 Hello,

 I ran into some strange behavior of the LuceneFulltextIndexService in
 the application I am building. So I put together a junit test based on
 the example from
 http://wiki.neo4j.org/content/Indexing_with_IndexService#Wachowski_brothers_example

 Here's what I found out using 0.9-SNAPSHOT of index-util (version 0.8
 wasn't any better):

  snip



 ___
 Neo mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user




 --
 Mattias Persson, [matt...@neotechnology.com]
 Neo Technology, www.neotechnology.com




-- 
Mattias Persson, [matt...@neotechnology.com]
Neo Technology, www.neotechnology.com
___
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo] neo-rdf-sail + BSBM

2009-12-21 Thread Andy Seaborne
Hi Mattias,

I tried the DenseTripleStore as well but that crashes (see below for the 
stacktrace).

All this is for version org=org.neo4j name=neo-rdf-sail 
rev=0.5-SNAPSHOT (I use Ivy).

The DenseTripleStore wouldn't help me anyway as I need named graph 
support and reading the wiki, it would seem I need the VerboseQuadStore 
is the only way to get that.  BSBM does not use named graphs but I would 
need that support for other applications.

Going direct to neo-rdf doesn't get me SPARQL does it?  Nor an RDF 
parser?  I'm loading data from an N-triples file.

Is the bottleneck the Lucene index?  I've found Lucene slow to index in 
other uses so it's going to be hard to get close to native store speeds.

(what is that B-Tree code the codebase?)

I'll when the bulk loader when it's RDF aware.

Thanks,
Andy

Loading BSBM 250K, DenseTripleStore

java.lang.IllegalArgumentException: Start node equals end node
 at 
org.neo4j.impl.core.RelationshipImpl.init(RelationshipImpl.java:58)
 at 
org.neo4j.impl.core.NodeManager.createRelationship(NodeManager.java:293)
 at 
org.neo4j.impl.core.NodeImpl.createRelationshipTo(NodeImpl.java:357)
 at 
org.neo4j.impl.core.NodeProxy.createRelationshipTo(NodeProxy.java:177)
 at 
org.neo4j.rdf.store.representation.standard.UriBasedExecutor.ensureConnected(UriBasedExecutor.java:262)
 at 
org.neo4j.rdf.store.representation.standard.UriBasedExecutor.addToNodeSpace(UriBasedExecutor.java:89)
 at 
org.neo4j.rdf.store.RdfStoreImpl.addStatement(RdfStoreImpl.java:79)
 at 
org.neo4j.rdf.store.RdfStoreImpl.addStatements(RdfStoreImpl.java:59)
 at 
org.neo4j.rdf.sail.NeoSailConnection.internalAddStatement(NeoSailConnection.java:625)
 at 
org.neo4j.rdf.sail.NeoSailConnection.innerAddStatement(NeoSailConnection.java:442)
 at 
org.neo4j.rdf.sail.NeoSailConnection.addStatement(NeoSailConnection.java:480)
 at 
org.openrdf.repository.sail.SailRepositoryConnection.addWithoutCommit(SailRepositoryConnection.java:235)
 at 
org.openrdf.repository.base.RepositoryConnectionBase.add(RepositoryConnectionBase.java:405)
 at 
org.openrdf.repository.util.RDFInserter.handleStatement(RDFInserter.java:196)
 at 
org.openrdf.rio.ntriples.NTriplesParser.parseTriple(NTriplesParser.java:260)
 at 
org.openrdf.rio.ntriples.NTriplesParser.parse(NTriplesParser.java:170)
 at 
org.openrdf.rio.ntriples.NTriplesParser.parse(NTriplesParser.java:112)
 at 
org.openrdf.repository.base.RepositoryConnectionBase.addInputStreamOrReader(RepositoryConnectionBase.java:303
)
 at 
org.openrdf.repository.base.RepositoryConnectionBase.add(RepositoryConnectionBase.java:253)
 at run.LoadData.main(LoadData.java:80)

On 21/12/2009 11:47, Mattias Persson wrote:
 Hi Andy,

 Great to hear you trying out neo4j and the neo-rdf-sail component.
 We're aware that the bulk-insert performance isn't what it should be
 and some of the performance caveats is in the neo-rdf-sail component,
 which is a layer around the neo-rdf component. So if you could try to
 go directly towards neo-rdf you could gain some performance there.

 The next step however would be to use the BatchInserter, a NeoService
 for bulk-inserts. See http://wiki.neo4j.org/content/Batch_Insert for
 more info. But since that's another interface we'll have to make some
 adjustments for the neo-rdf component to be friends with it.

 I'll put some time into this in the intermediate days between
 Christmas and New Years Day and see how we can make neo-rdf(-sail) do
 a performance leap for bulk-inserts.

 Happy Holidays!

 / Mattias

 2009/12/21 Andy Seaborneandy.seabo...@talis.com:
 I'm trying to get neo-rdf-sail to run through the Berlin SPARQL
 Benchmark [1].

 It's taking about 21 mins to load 1e6 triples for data and 115 mins for
 5 million triples.  This is a bit slow - projecting from that, 100M is
 at least 30 hours.

 This on EC2 m1.large, ubuntu server, Java heap size 6G, nothing else
 running, using IcedTea - this is my fixed setup for BSBM.

 My initial sense is that it is the indexing that is the significant cost
 but this is just an educated guess at preent. I'm using the
 LuceneIndexService as per the example. The NeoIndexService is marked not
 ready for general usage.

 Any tips for optimizing performance?  I don't need transactionality, for
 example, because it's a one-time bulk load.

 I see also component sparql-engine-neo which is based on the
 leaving.name SPARQL engine (and parts of Sesame 1?). Would this be  better?

  Andy

 [1] http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V5/
 ___
 Neo mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user




___
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo] LuceneIndexBatchInserter doubt

2009-12-21 Thread Núria Trench
Hi again Mattias,

I'm still trying to parse all the data in order to create the graph. I will
report the results as soon as possible.
Thank you very much for your interest.

Núria.

2009/12/21 Mattias Persson matt...@neotechnology.com

 Hi again,

 any luck with this yet?

 2009/12/11 Núria Trench nuriatre...@gmail.com:
  Thank you very much Mattias. I will test it as soon as possible and I'll
  will tell you something.
 
  Núria.
 
  2009/12/11 Mattias Persson matt...@neotechnology.com
 
  I've tried this a couple of times now and first of all I see some
  problems in your code:
 
  1) In the method createRelationsTitleImage you have an inverted head
  != -1 check where it should be head == -1
 
  2) You index relationships in createRelationsBetweenTitles method,
  this isn't ok since the index can only manage nodes.
 
  And I recently committed a fix which removed the caching layer in
  the LuceneIndexBatchInserterImpl (and therefore also
  LuceneFulltextIndexBatchInserter). This probably fixes your problems.
  I'm also working on a performance fix which makes consecutive getNodes
  calls faster.
 
  So I think that with these fixes (1) and (2) and the latest index-util
  0.9-SNAPSHOT your sample will run fine. Also you could try without
  calling optimize. See more information at
  http://wiki.neo4j.org/content/Indexing_with_BatchInserter
 
  2009/12/10 Mattias Persson matt...@neotechnology.com:
   To continue this thread in the user list:
  
   Thanks Núria, I've gotten your samples code/files and I'm running it
   now to try to reproduce you problem.
  
   2009/12/9 Núria Trench nuriatre...@gmail.com:
   I have finished uploading the 4 csv files. You'll see an e-mail with
 the
   other 3 csv files packed in a rar file.
   Thanks,
  
   Núria.
  
   2009/12/9 Núria Trench nuriatre...@gmail.com
  
   Yes, you are right. But there is one csv file that is too big to be
  packed
   with other files and I am reducing it.
   I am sending the other files now.
  
   2009/12/9 Mattias Persson matt...@neotechnology.com
  
   By the way, you might consider packing those files (with zip or
 tar.gz
   or something) cause they will shrink quite well
  
   2009/12/9 Mattias Persson matt...@neotechnology.com:
Great, but I only got the images.csv file... I'm starting to test
  with
that at least
   
2009/12/9 Núria Trench nuriatre...@gmail.com:
Hi again,
   
The errors show up after being parsed 2 csv files to create all
 the
nodes,
just in the moment of calling the method getSingleNode for
  looking
up the
tail and head node for creating all the edges by reading the
 other
  two
csv
files.
   
I am sending with Sprend the four csv files that will help you
 to
trigger
index behaviour.
   
Thank you,
   
Núria.
   
2009/12/9 Mattias Persson matt...@neotechnology.com
   
Hmm, I've no idea... but does the errors show up early in the
  process
or do you have to insert a LOT of data to trigger it? In such
 case
you
could send me a part of them... maybe using
 http://www.sprend.se,
WDYT?
   
2009/12/9 Núria Trench nuriatre...@gmail.com:
 Hi Mattias,

 The data isn't confident but the files are very big (5,5 GB).
 How can I send you this data?

 2009/12/9 Mattias Persson matt...@neotechnology.com

 Yep I got the java code, thanks. Yeah if the data is
 confident
  or
 sensitive you can just send me the formatting, else consider
 sending
 the files as well (or a subset if they are big).

 2009/12/9 Núria Trench nuriatre...@gmail.com:
 
 
 
  --
  Mattias Persson, [matt...@neotechnology.com]
  Neo Technology, www.neotechnology.com
  ___
  Neo mailing list
  User@lists.neo4j.org
  https://lists.neo4j.org/mailman/listinfo/user
 
  ___
  Neo mailing list
  User@lists.neo4j.org
  https://lists.neo4j.org/mailman/listinfo/user
 



 --
 Mattias Persson, [matt...@neotechnology.com]
 Neo Technology, www.neotechnology.com
 ___
 Neo mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user