I'm still in the evaluation of *Neo4j *vs. *OrientDB*. Most importantly I 
need Lucene as full-text index engine. So I created on both databases the 
same schema with the same data (300Mio lines). I'm also experienced with 
querying different things in both systems. I used the Standard Analyzer on 
both sides. The OrientDB test query results are all fine and really good in 
terms of reliability and speed. The speed of Neo4j is also ok but the 
results are kind of bad in most of the cases. So let's come to the 
different issues I have with Neo4j Lucene indexing. I always give you an 
example of how it would look in OrientDB and which result set you should be 
getting out of the query.

So in these examples, there are Applns that have title(s). Titles are 
indexed with Lucene in both databases. Applns also have an ID just to 
demonstrate the ordering. At the end of each query I have some *questions* 
about them. It would be great to get some *feedback *or even *answers *about 
them.

Query #0: One word query with no order

Well this query is very simple. It shall be tested how the database behave 
if there is just a simple word and nothing else. As you can see the Neo4j 
result is way longer then the one from OrientDB. OrientDB is using TFIDF to 
keep the results short and more reliable to the actual search. As you can 
see as first result in OrientDB, there is title with SOLAR. That is totally 
missing in Neo4j, too.

In Neo4j: *START n=node:titles('title:solar') RETURN n.title,n.ID LIMIT 10*

   1. SOLAR RADIATION SHIELDING PARTICULATE AND SOLAR RADIATION SHIELDING 
   RESIN MATERIAL DISPERSED WITH ...    38321319
   2. Solar module for cooling solar cells on the underside of a solar 
   panel has air inlet and outlet openings ...    12944121
   3. Solar construction component for solar thermal assemblies, solar 
   thermal assembly, method for operating a solar...    324146113
   4. ...
   
In OrientDB: *SELECT title,ID FROM Appln WHERE title LUCENE "solar" LIMIT 
10*

   1. SOLAR    24900187
   2. Solar unit and solar apparatus    1876343
   3. Solar module with solar concentrator    13496706
   4. ...
   
Questions:

   1. Why is Neo4j not using TFIDF or what do they use instead?
   2. Is Neo4j able to use some ordering of the keyword match?
   3. Is it possible to change TFIDF to somethign else in OrientDB?
   
Query #1: One word query with order by ID

Neo4j is ordering the ID's before using TFIDF. As known from Query#0 Neo4j 
is not using TFIDF so it's basically just searching via first results of 
the Lucene query. In OrientDB besides it's still searching by good TFIDF's 
and then ordering.

In Neo4j:* START n=node:titles('title:solar') RETURN n.title,n.ID ORDER BY 
n.ID ASC LIMIT 10*

   1. Stackable flat-roof/floor frame for solar panels    318
   2. Method for producing contact for solar cells    636
   3. Solar cell and fabrication method thereof    1217
   4. ...
   
In OrientDB: *SELECT title,ID FROM Appln WHERE title LUCENE "solar" ORDER 
BY ID ASC LIMIT 10*

   1. Solar unit and solar apparatus     1876343
   2. Solar module with solar concentrator    13496706
   3. SOLAR TRACKER FOR SOLAR COLLECTOR    16543688
   4. ...
   
Questions:

   1. How would a search in OrientDB look like that should be ordered by 
   the ID and still matching the best TFIDF of them.
   2. Is there a way in Neo4j to order the Lucene match before ordering by 
   the ID?
   
Query #2: One word with using a star search

Star search had no influence on the Neo4j results. OrientDB results changed 
in a good way.

In Neo4j: *START n=node:titles('title:solar*') RETURN n.title,n.ID ORDER BY 
n.ID ASC LIMIT 10*

   1. Stackable flat-roof/floor frame for solar panels    318
   2. Method for producing contact for solar cells    636
   3. Solar cell and fabrication method thereof    1217
   4. ...
   
In OrientDB: *SELECT title,ID FROM Appln WHERE title LUCENE "solar*" ORDER 
BY ID ASC LIMIT 10*

   1. High performance solar methane generator    8354701
   2. All-plastic honeycomb solar water-heater    8355379
   3. Plate type solar energy heat collector plate core and its 
   manufacturing method    8356173
   4. ...
   
Questions:

   1. Does Neo4j ignore star searches?

Query #3: Searching for 2 words devided by a space

The strange here is that you need to change 'title:solar panel' to that 
query here. Otherwhise you just get errors. OrientDB seems good so far.

In Neo4j:* START n=node:titles(title="solar panel") RETURN n.title,n.ID 
ORDER BY n.ID ASC LIMIT 10*

   1. Returned 0 rows in 817 ms

In OrientDB: *SELECT title,ID FROM Appln WHERE title LUCENE "solar panel" 
ORDER BY ID ASC LIMIT 10*

   1. SOLAR PANEL    1584567
   2. SOLAR PANEL    1616547
   3. SOLAR PANEL    2078382
   4. SOLAR PANEL    2078383
   5. Solar panel    2178466
   6. ...
   
Questions:

   1. Why does Neo4j need a special Query here to at least don't throw any 
   error?
   2. Why is the query failing and not giving anything back? I know that 
   Neo4j is searching here for lower letters, so it's case sensitive. But why 
   it is like this? I mean I use the default analyzer and the doc of Neo4j 
   Lucene says it's true, so it means to_lower_letter.
   
Query #4: Now searching for the same query in capital letters

The same issue like in #3. In Neo4j just searching returning the capital 
letters results of the words. OrientDB results looking fine again.

In Neo4j: *START n=node:titles(title="SOLAR PANEL") RETURN n.title,n.ID 
ORDER BY n.ID ASC LIMIT 10*

   1. SOLAR PANEL    348800
   2. SOLAR PANEL    420683
   3. SOLAR PANEL    1393804
   4. SOLAR PANEL    1584567
   5. SOLAR PANEL    1616547
   6. ...
   
In OrientDB: *SELECT title,ID FROM Appln WHERE title LUCENE "SOLAR PANEL" 
ORDER BY ID ASC LIMIT 10*

   1. SOLAR PANEL    1584567
   2. SOLAR PANEL    1616547
   3. SOLAR PANEL    2078382
   4. SOLAR PANEL    2078383
   5. Solar panel    2178466
   6. ...
   
Questions:

   1. Same question like in #3, how to search with to_lower_letter?

Query #5: Combining two words and using the star search

Here I want to combine words search with star search. But with the equal 
search I'm not able to find matches because he expects the star as usual 
sign in the title. But I'm not able to say 'title:SOLAR PANEL*'. That's 
also forbidden. In OrientDB everything is fine.

In Neo4j: *START n=node:titles(title="SOLAR PANEL*") RETURN n.title,n.ID 
ORDER BY n.ID ASC LIMIT 10*

   1. Returned 0 rows in 895 ms

In OrientDB:* SELECT title,ID FROM Appln WHERE title LUCENE "SOLAR PANEL*" 
ORDER BY ID ASC LIMIT 10*

   1. SOLAR PANELS     1405717
   2. SOLAR PANEL     1584567
   3. SOLAR PANEL     1616547
   4. SOLAR PANEL     2705081
   5. Solar Panel     2766555
   6. ...
   
Questions:

   1. How can you combine some words with the star search in Neo4j?

Query #6: Counting query results

The last thing I really need is a fast lookup how many results are there 
overall. Here Neo4j is finding a result way faster but always finding less 
matches then OrientDB. Searching for Solar is kind of close to each other. 
But another test was not that close.

In Neo4j: *START n=node:titles("title:Solar") RETURN count(*)*

143211 in 220 sec

In OrientDB: *SELECT count(*) title FROM Appln WHERE title LUCENE "Solar" 
LIMIT -1*

148029 in 50 sec

Questions:

   1. How can that lookup times be improved on both systems?
   2. Why does both systems find different number of matches? Also happens 
   on other keywords. Maybe other indexing eninge used?
   
Well that is everything for now. If you need any other query just tell me 
and I deliver it.
I think it's very important to compare the Lucene implementation because 
with Millions of nodes Lucene has to many advantages. Thanks for any small 
tip.

Btw: please don't give tips about using Java code instead for the query. I 
want to use Cypher because the request shall be done in the browser, like 
in OrientDB. I know that everything here is easily be done with Java code. 
Thank you.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to orient-database+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to