Re: [Neo] Strange behavior of LuceneFulltextIndexService
I think I fixed it. It'll be available from our maven repo soon! 2009/12/21 Mattias Persson matt...@neotechnology.com: Ok great, I'll look into this error and see if I can locate that bug. 2009/12/21 Sebastian Stober sto...@ovgu.de: Hello Mattias, thank you for your quick reply. The new behavior you describe looks like what I would expect. (I think fulltext queries should generally be treated case-insensitive) The original junit-test now completes without error. However, there still seems to be something odd. If I modify the setup code (before I run any test queries) like this, the LuceneFulltextIndexService is messed up: // using LuceneFulltextIndexService andy.setProperty( name, Andy Wachowski ); andy.setProperty( title, Director ); // larry.setProperty( name, Larry Wachowski ); //old larry.setProperty( name, Andy Wachowski ); //new(deliberately wrong) larry.setProperty( title, Director ); index.index( andy, name, andy.getProperty( name ) ); index.index( andy, title, andy.getProperty( title ) ); index.index( larry, name, larry.getProperty( name ) ); index.index( larry, title, larry.getProperty( title ) ); // new: fixing the name of larry index.removeIndex( larry, name, larry.getProperty( name ) ); larry.setProperty( name, Larry Wachowski ); index.index( larry, name, larry.getProperty( name ) ); // start the test... index.getNodes( name, wachowski ) now returns only larry instead of both nodes. Any ideas? It looks like the index entry for andy is removed as well. Cheers, Sebastian Message: 4 Date: Fri, 18 Dec 2009 10:16:33 +0100 From: Mattias Persson matt...@neotechnology.com Subject: Re: [Neo] Strange behavior of LuceneFulltextIndexService To: Neo user discussions user@lists.neo4j.org Message-ID: acdd47330912180116l7d5fe082g2322f45712906...@mail.gmail.com Content-Type: text/plain; charset=UTF-8 I've made some changes to make LuceneFulltextIndexService and LuceneFulltextQueryIndexService behave more natural. So this is the new (and better) deal (copied from the javadoc, from your example!): LuceneFulltextIndexService: /** * Since this is a fulltext index it changes the contract of this method * slightly. It treats the {...@code value} more like a query in than you can * query for individual words in your indexed values. * * So if you've indexed node (1) with value Andy Wachowski and node (2) * with Larry Wachowski you can expect this behaviour if you query for: * * o andy -- (1) * o Andy -- (1) * o wachowski -- (1), (2) * o andy larry -- * o larry Wachowski -- (2) * o wachowski Andy -- (1) */ LuceneFulltextQueryIndexService: /** * Here the {...@code value} is treated as a lucene query, * http://lucene.apache.org/java/2_9_1/queryparsersyntax.html * * So if you've indexed node (1) with value Andy Wachowski and node (2) * with Larry Wachowski you can expect this behaviour if you query for: * * o andy -- (1) * o Andy -- (1) * o wachowski -- (1), (2) * o andy AND larry -- * o andy OR larry -- (1), (2) * o larry Wachowski -- (1), (2) // lucene's default operator is OR * * The default AND/OR behaviour can be changed by overriding * {...@link #getDefaultQueryOperator(String, Object)}. */ Does this make more sense? 2009/12/17 Mattias Persson matt...@neotechnology.com: That is indeed a behaviour which needs to be straightened out, I see that it doesn't behave as expected all the time. I'll look into this as soon as possible. Btw. is it a good idea to have LuceneFulltextIndexService and LuceneFulltextQueryIndexService be case-insensitive, should it be configurable or would it be nice with case-sensitivity instead (so that you'd have to run .toLowerCase(), or something, on your strings and queries to get such behaviour)? 2009/12/17 Sebastian Stober sto...@ovgu.de: Hello, I ran into some strange behavior of the LuceneFulltextIndexService in the application I am building. So I put together a junit test based on the example from http://wiki.neo4j.org/content/Indexing_with_IndexService#Wachowski_brothers_example Here's what I found out using 0.9-SNAPSHOT of index-util (version 0.8 wasn't any better): snip ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Mattias Persson, [matt...@neotechnology.com] Neo Technology, www.neotechnology.com -- Mattias Persson, [matt...@neotechnology.com] Neo Technology, www.neotechnology.com ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo] neo-rdf-sail + BSBM
Hi Mattias, I tried the DenseTripleStore as well but that crashes (see below for the stacktrace). All this is for version org=org.neo4j name=neo-rdf-sail rev=0.5-SNAPSHOT (I use Ivy). The DenseTripleStore wouldn't help me anyway as I need named graph support and reading the wiki, it would seem I need the VerboseQuadStore is the only way to get that. BSBM does not use named graphs but I would need that support for other applications. Going direct to neo-rdf doesn't get me SPARQL does it? Nor an RDF parser? I'm loading data from an N-triples file. Is the bottleneck the Lucene index? I've found Lucene slow to index in other uses so it's going to be hard to get close to native store speeds. (what is that B-Tree code the codebase?) I'll when the bulk loader when it's RDF aware. Thanks, Andy Loading BSBM 250K, DenseTripleStore java.lang.IllegalArgumentException: Start node equals end node at org.neo4j.impl.core.RelationshipImpl.init(RelationshipImpl.java:58) at org.neo4j.impl.core.NodeManager.createRelationship(NodeManager.java:293) at org.neo4j.impl.core.NodeImpl.createRelationshipTo(NodeImpl.java:357) at org.neo4j.impl.core.NodeProxy.createRelationshipTo(NodeProxy.java:177) at org.neo4j.rdf.store.representation.standard.UriBasedExecutor.ensureConnected(UriBasedExecutor.java:262) at org.neo4j.rdf.store.representation.standard.UriBasedExecutor.addToNodeSpace(UriBasedExecutor.java:89) at org.neo4j.rdf.store.RdfStoreImpl.addStatement(RdfStoreImpl.java:79) at org.neo4j.rdf.store.RdfStoreImpl.addStatements(RdfStoreImpl.java:59) at org.neo4j.rdf.sail.NeoSailConnection.internalAddStatement(NeoSailConnection.java:625) at org.neo4j.rdf.sail.NeoSailConnection.innerAddStatement(NeoSailConnection.java:442) at org.neo4j.rdf.sail.NeoSailConnection.addStatement(NeoSailConnection.java:480) at org.openrdf.repository.sail.SailRepositoryConnection.addWithoutCommit(SailRepositoryConnection.java:235) at org.openrdf.repository.base.RepositoryConnectionBase.add(RepositoryConnectionBase.java:405) at org.openrdf.repository.util.RDFInserter.handleStatement(RDFInserter.java:196) at org.openrdf.rio.ntriples.NTriplesParser.parseTriple(NTriplesParser.java:260) at org.openrdf.rio.ntriples.NTriplesParser.parse(NTriplesParser.java:170) at org.openrdf.rio.ntriples.NTriplesParser.parse(NTriplesParser.java:112) at org.openrdf.repository.base.RepositoryConnectionBase.addInputStreamOrReader(RepositoryConnectionBase.java:303 ) at org.openrdf.repository.base.RepositoryConnectionBase.add(RepositoryConnectionBase.java:253) at run.LoadData.main(LoadData.java:80) On 21/12/2009 11:47, Mattias Persson wrote: Hi Andy, Great to hear you trying out neo4j and the neo-rdf-sail component. We're aware that the bulk-insert performance isn't what it should be and some of the performance caveats is in the neo-rdf-sail component, which is a layer around the neo-rdf component. So if you could try to go directly towards neo-rdf you could gain some performance there. The next step however would be to use the BatchInserter, a NeoService for bulk-inserts. See http://wiki.neo4j.org/content/Batch_Insert for more info. But since that's another interface we'll have to make some adjustments for the neo-rdf component to be friends with it. I'll put some time into this in the intermediate days between Christmas and New Years Day and see how we can make neo-rdf(-sail) do a performance leap for bulk-inserts. Happy Holidays! / Mattias 2009/12/21 Andy Seaborneandy.seabo...@talis.com: I'm trying to get neo-rdf-sail to run through the Berlin SPARQL Benchmark [1]. It's taking about 21 mins to load 1e6 triples for data and 115 mins for 5 million triples. This is a bit slow - projecting from that, 100M is at least 30 hours. This on EC2 m1.large, ubuntu server, Java heap size 6G, nothing else running, using IcedTea - this is my fixed setup for BSBM. My initial sense is that it is the indexing that is the significant cost but this is just an educated guess at preent. I'm using the LuceneIndexService as per the example. The NeoIndexService is marked not ready for general usage. Any tips for optimizing performance? I don't need transactionality, for example, because it's a one-time bulk load. I see also component sparql-engine-neo which is based on the leaving.name SPARQL engine (and parts of Sesame 1?). Would this be better? Andy [1] http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V5/ ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo] LuceneIndexBatchInserter doubt
Hi again Mattias, I'm still trying to parse all the data in order to create the graph. I will report the results as soon as possible. Thank you very much for your interest. Núria. 2009/12/21 Mattias Persson matt...@neotechnology.com Hi again, any luck with this yet? 2009/12/11 Núria Trench nuriatre...@gmail.com: Thank you very much Mattias. I will test it as soon as possible and I'll will tell you something. Núria. 2009/12/11 Mattias Persson matt...@neotechnology.com I've tried this a couple of times now and first of all I see some problems in your code: 1) In the method createRelationsTitleImage you have an inverted head != -1 check where it should be head == -1 2) You index relationships in createRelationsBetweenTitles method, this isn't ok since the index can only manage nodes. And I recently committed a fix which removed the caching layer in the LuceneIndexBatchInserterImpl (and therefore also LuceneFulltextIndexBatchInserter). This probably fixes your problems. I'm also working on a performance fix which makes consecutive getNodes calls faster. So I think that with these fixes (1) and (2) and the latest index-util 0.9-SNAPSHOT your sample will run fine. Also you could try without calling optimize. See more information at http://wiki.neo4j.org/content/Indexing_with_BatchInserter 2009/12/10 Mattias Persson matt...@neotechnology.com: To continue this thread in the user list: Thanks Núria, I've gotten your samples code/files and I'm running it now to try to reproduce you problem. 2009/12/9 Núria Trench nuriatre...@gmail.com: I have finished uploading the 4 csv files. You'll see an e-mail with the other 3 csv files packed in a rar file. Thanks, Núria. 2009/12/9 Núria Trench nuriatre...@gmail.com Yes, you are right. But there is one csv file that is too big to be packed with other files and I am reducing it. I am sending the other files now. 2009/12/9 Mattias Persson matt...@neotechnology.com By the way, you might consider packing those files (with zip or tar.gz or something) cause they will shrink quite well 2009/12/9 Mattias Persson matt...@neotechnology.com: Great, but I only got the images.csv file... I'm starting to test with that at least 2009/12/9 Núria Trench nuriatre...@gmail.com: Hi again, The errors show up after being parsed 2 csv files to create all the nodes, just in the moment of calling the method getSingleNode for looking up the tail and head node for creating all the edges by reading the other two csv files. I am sending with Sprend the four csv files that will help you to trigger index behaviour. Thank you, Núria. 2009/12/9 Mattias Persson matt...@neotechnology.com Hmm, I've no idea... but does the errors show up early in the process or do you have to insert a LOT of data to trigger it? In such case you could send me a part of them... maybe using http://www.sprend.se, WDYT? 2009/12/9 Núria Trench nuriatre...@gmail.com: Hi Mattias, The data isn't confident but the files are very big (5,5 GB). How can I send you this data? 2009/12/9 Mattias Persson matt...@neotechnology.com Yep I got the java code, thanks. Yeah if the data is confident or sensitive you can just send me the formatting, else consider sending the files as well (or a subset if they are big). 2009/12/9 Núria Trench nuriatre...@gmail.com: -- Mattias Persson, [matt...@neotechnology.com] Neo Technology, www.neotechnology.com ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Mattias Persson, [matt...@neotechnology.com] Neo Technology, www.neotechnology.com ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user