Re: [orientdb] Re: Performance of Distributed (3 nodes) cluster with one billion edges

Phillip Henry Fri, 23 Sep 2016 01:51:07 -0700

> How big is your file the sort cannot write?

One bil-ee-on lines... :-P


> ...This should help a lot. 

The trouble is that the size of a block of contiguous accounts in the real 
data is not-uniform (even if it might be with my test data). Therefore, it 
is highly likely a contiguous block of account numbers will span 2 or more 
batches. This will lead to a lot of contention. In your example, if Account 
2 spills over into the next batch, chances are I'll have to rollback that 
batch.

Don't you also have a problem that if X, Y, Z and W in your example are 
account numbers in the next batch, you'll also get contention? Admittedly, 
randomization doesn't solve this problem either.

> you can use the special Batch Importer: OGraphBatchInsert

Would this not be subject to the same contention problems?
At what point is it flushed to disk? (Obviously, it can't live in heap 
forever).

> You should definitely using transactions with batch size of 100 items. 

I thought I read somewhere else (can't find the link at the moment) that 
you said only use transactions when using the remote protocol?

> Please use last 2.2.10. ... try to define 50GB of DISKCACHE and 14GB of 
Heap

Will do on the next run.

> If happens again, could you please send a thread dump?

I have the full thread dump but it's on my work machine so can't post it in 
this forum (all access to Google Groups is banned by the bank so I am 
writing this on my personal computer). Happy to email them to you. Which 
email shall I use?

Phill

On Friday, September 23, 2016 at 7:41:29 AM UTC+1, l.garulli wrote:
>
> On 23 September 2016 at 00:49, Phillip Henry <phill...@gmail.com 
> <javascript:>> wrote:
>
>> Hi, Luca.
>>
>
> Hi Phillip.
>  
>
>> I have:
>>
>> 4. sorting is an overhead, albeit outside of Orient. Using the Unix sort 
>> command failed with "No space left on device". Oops. OK, so I ran my 
>> program to generate the data again, this time it is ordered by the first 
>> account number. Performance was much slower as there appeared to be a lot 
>> of contention for this account (ie, all writes were contending for this 
>> account, even if the other account had less contention). More randomized 
>> data was faster.
>>
>
> How big is your file the sort cannot write? Anyway, if you have the 
> accounts sorted, you should have transactions of about 100 items where the 
> bank account and edges are in the same transaction. This should help a lot. 
> Example:
>
> Account 1 -> Payment 1 -> Account X
> Account 1 -> Payment 2 -> Account Y
> Account 1 -> Payment 3 -> Account Z
> Account 2 -> Payment 1 -> Account X
> Account 2 -> Payment 1 -> Account W
>
> If the transaction batch is 5 (I suggest you to start with 100), all the 
> operations are executed in one transaction. In another thread has:
>
> Account 99 -> Payment 1 -> Account W
>
> It could go in conflict because the shared Account W.
>
> If you can export Account's IDs that are numbers and incremental, you can 
> use the special Batch Importer: OGraphBatchInsert. Example:
>
> OGraphBatchInsert batch = new OGraphBatchInsert("plocal:/temp/mydb", "admin", 
> "admin");
> batch.begin();
>
> batch.createEdge(0L, 1L, null); // CREATE EDGES BETWEEN VERTEX 0 and 1. IF 
> VERTICES
>
>                                 // DON'T EXISTS, ARE CREATED IMPLICITELY
> batch.createEdge(1L, 2L, null);
> batch.createEdge(2L, 0L, null);
>
>
> batch.createVertex(3L); // CREATE AN NON CONNECTED VERTEX
>
>
> Map<String, Object> vertexProps = new HashMap<String, Object>();
> vertexProps.put("foo", "foo");
> vertexProps.put("bar", 3);
> batch.setVertexProperties(0L, vertexProps); // SET PROPERTY FOR VERTEX 0
> batch.end();
>
> This is blazing fast, but uses Heap so run it with a lot of it.
>  
>
>>
>> 6. I've mutlithreaded my loader. The details are now:
>>
>> - using plocal
>> - using 30 threads
>> - not using transactions (OrientGraphFactory.getNoTx)
>>
>
> You should definitely using transactions with batch size of 100 items. 
> This speeds up things.
>  
>
>> - retrying forever upon write collisions.
>> - using Orient 2.2.7.
>>
>
> Please use last 2.2.10.
>  
>
>> - using -XX:MaxDirectMemorySize:258040m
>>
>
> This is not really important, it's just an upper bound for the JVM. Please 
> set it to 512GB so you can forget about it. The 2 most important values are 
> DISKCACHE and JVM heap. The sum must lower than the available RAM in the 
> server before you run OrientDB.
>
> If you have 64GB, try to define 50GB of DISKCACHE and 14GB of Heap.
>
> If you use the Batch Importer, you should use more Heap and less DISKCACHE.
>  
>
>> The good news is I've achieved an initial write throughput of about 
>> 30k/second.
>>
>> The bad news is I've tried several runs and only been able to achieve 
>> 200mil < number of writes < 300mil.
>>
>> The first time I tried it, the loader deadlocked. Using jstat showed that 
>> the deadlock was between 3 threads at:
>> - 
>> OOneKeyEntryPerKeyLockManager.acquireLock(OOneKeyEntryPerKeyLockManager.java:173)
>> - 
>> OPartitionedLockManager.acquireExclusiveLock(OPartitionedLockManager.java:210)
>> - 
>> OOneKeyEntryPerKeyLockManager.acquireLock(OOneKeyEntryPerKeyLockManager.java:171)
>>
>
> If happens again, could you please send a thread dump?
>  
>
>> The second time it failed was due to a NullPointerException at 
>> OByteBufferPool.java:297. I've looked at the code and the only way I can 
>> see this happening is if OByteBufferPool.allocateBuffer throws an error 
>> (perhaps an OutOfMemoryError in java.nio.Bits.reserveMemory). This 
>> StackOverflow posting (
>> http://stackoverflow.com/questions/8462200/examples-of-forcing-freeing-of-native-memory-direct-bytebuffer-has-allocated-us)
>>  
>> seems to indicate that this can happen if the underlying DirectByteBuffer's 
>> Cleaner doesn't have its clean() method called. 
>>
>
> This is because the database was bigger than this setting: - using 
> -XX:MaxDirectMemorySize:258040m. Please set this at 512GB (see above).
>  
>
>> Alternatively, I followed the SO suggestion and lowered the heap space to 
>> a mere 1gb (it was 50gb) to make the GC more active. Unfortunately, after a 
>> good start, the job is still running some 15 hours later with a hugely 
>> reduced write throughput (~ 7k/s). Jstat shows 4292 full GCs taking a total 
>> time of 4597s - not great but not hugely awful either. At this rate, the 
>> remaining 700mil or so payments are going to take another 30 hours.
>>
>
> See above the suggested settings.
>  
>
>> 7. Even with the highest throughput I have achieved, 30k writes per 
>> second, I'm looking at about 20 hours of loading. We've taken the same data 
>> and, after trial and error that was not without its own problems, put it 
>> into Neo4J in 37 minutes. This is a significant difference. It appears that 
>> they are approaching the problem differently to avoid contention on 
>> updating the vertices during an edge write.
>>
>
> With all this suggestion you should be able to have much better numbers. 
> If you can use the Batch Importer the number should be close to Neo4j.
>  
>
>>
>> Thoughts?
>>
>> Regards,
>>
>> Phillip
>>
>>
>
> Best Regards,
>
> Luca Garulli
> Founder & CEO
> OrientDB LTD <http://orientdb.com/>
>
> Want to share your opinion about OrientDB?
> Rate & review us at Gartner's Software Review 
> <https://www.gartner.com/reviews/survey/home>
>
>
>  
>
>>
>> On Thursday, September 15, 2016 at 10:06:44 PM UTC+1, l.garulli wrote:
>>>
>>> On 15 September 2016 at 09:54, Phillip Henry <phill...@gmail.com> wrote:
>>>
>>>> Hi, Luca.
>>>>
>>>
>>> Hi Phillip,
>>>
>>> 3. Yes, default configuration. Apart from adding an index for ACCOUNTS, 
>>>> I did nothing further.
>>>>
>>>
>>> Ok, so you have writeQuorum="majority" that means 2 sycnhronous writes 
>>> and 1 asynchronous per transaction.
>>>  
>>>
>>>> 4. Good question. With real data, we expect it to be as you suggest: 
>>>> some nodes with the majority of the payments (eg, supermarkets). However, 
>>>> for the test data, payments were assigned randomly and, therefore, should 
>>>> be uniformly distributed.
>>>>
>>>
>>> What's your average in terms of number of edges? <10, <50, <200, <1000?
>>>  
>>>
>>>> 2. Yes, I tried plocal minutes after posting (d'oh!). I saw a good 
>>>> improvement. It started about 3 times faster and got faster still (about 
>>>> 10 
>>>> times faster) by the time I checked this morning on a job running 
>>>> overnight. However, even though it is now running at about 7k transactions 
>>>> per second, a billion edges is still going to take about 40 hours. So, I 
>>>> ask myself: is there anyway I can make it faster still?
>>>>
>>>
>>> Here it's missing the usage of AUTO-SHARDING INDEX. Example:
>>>
>>> accountClass.createIndex("Account.number", 
>>> OClass.INDEX_TYPE.UNIQUE.toString(), (OProgressListener) null, (ODocument) 
>>> null,
>>>     "AUTOSHARDING", new String[] { "number" });
>>>
>>> In this way you should go more in parallel, because the index is 
>>> distributed across all the shards (clusters) of Account class. you should 
>>> have 32 of them by default because you have 32 cores. 
>>>
>>> Please let me know if by sorting the from_accounts and with this change 
>>> if it's much faster.
>>>
>>> This is the best you can have out of the box. To push numbers up it's 
>>> slightly more complicated: you should be sure that transactions go in 
>>> parallel and they aren't serialized. This is possible by playing with 
>>> internal OrientDB settings (mainly the distributed workerThreads), by 
>>> having many clusters per class (You could try with 128 first and see how 
>>> it's going).
>>>  
>>>
>>>> I assume when I start the servers up in distributed mode once more, the 
>>>> data will then be distributed across all nodes in the cluster?
>>>>
>>>
>>> That's right.
>>>  
>>>
>>>> 3. I'll return to concurrent, remote inserts when this job has 
>>>> finished. Hopefully, a smaller batch size will mean there is no 
>>>> degradation 
>>>> in performance either... FYI: with a somewhat unscientific approach, I was 
>>>> polling the server JVM with JStack and saw only a single thread doing all 
>>>> the work and it *seemed* to spend a lot of its time in ODirtyManager on 
>>>> collection manipulation.
>>>>
>>>
>>> I think it's because you didn't use the AUTO-SHARDING index. Furthermore 
>>> running distributed, unfortunately, means the tree ridbag is not available 
>>> (we will support it in the future), so every change to the edges takes a 
>>> lot of CPU to demarshall and marshall the entire edge list everytime you 
>>> update a vertex. That's why my recommendation about sorting the vertices.
>>>  
>>>
>>>> I totally appreciate that performance tuning is an empirical science, 
>>>> but do you have any opinions as to which would probably be faster: 
>>>> single-threaded plocal or multithreaded remote? 
>>>>
>>>
>>> With v2.2 yo can go in parallel, by using the tips above. For sure the 
>>> replication has a cost. I'm sure you can go much faster with just one node 
>>> and then start the other 2 nodes to have the database replicated 
>>> automatically. At least for the first massive insertion.
>>>  
>>>
>>>>
>>>> Regards,
>>>>
>>>> Phillip
>>>>
>>>
>>> Luca
>>>
>>>  
>>>
>>>>
>>>> On Wednesday, September 14, 2016 at 3:48:56 PM UTC+1, Phillip Henry 
>>>> wrote:
>>>>>
>>>>> Hi, guys.
>>>>>
>>>>> I'm conducting a proof-of-concept for a large bank (Luca, we had a 
>>>>> 'phone conf on August 5...) and I'm trying to bulk insert a humongous 
>>>>> amount of data: 1 million vertices and 1 billion edges.
>>>>>
>>>>> Firstly, I'm impressed about how easy it was to configure a cluster. 
>>>>> However, the performance of batch inserting is bad (and seems to get 
>>>>> considerably worse as I add more data). It starts at about 2k 
>>>>> vertices-and-edges per second and deteriorates to about 500/second after 
>>>>> only about 3 million edges have been added. This also takes ~ 30 minutes. 
>>>>> Needless to say that 1 billion payments (edges) will take over a week at 
>>>>> this rate. 
>>>>>
>>>>> This is a show-stopper for us.
>>>>>
>>>>> My data model is simply payments between accounts and I store it in 
>>>>> one large file. It's just 3 fields and looks like:
>>>>>
>>>>> FROM_ACCOUNT TO_ACCOUNT AMOUNT
>>>>>
>>>>> In the test data I generated, I had 1 million accounts and 1 billion 
>>>>> payments randomly distributed between pairs of accounts.
>>>>>
>>>>> I have 2 classes in OrientDB: ACCOUNTS (extending V) and PAYMENT 
>>>>> (extending E). There is a UNIQUE_HASH_INDEX on ACCOUNTS for the account 
>>>>> number (a string).
>>>>>
>>>>> We're using OrientDB 2.2.7.
>>>>>
>>>>> My batch size is 5k and I am using the "remote" protocol to connect to 
>>>>> our cluster.
>>>>>
>>>>> I'm using JDK 8 and my 3 boxes are beefy machines (32 cores each) but 
>>>>> without SSDs. I wrote the importing code myself but did nothing 'clever' 
>>>>> (I 
>>>>> think) and used the Graph API. This client code has been given lots of 
>>>>> memory and using jstat I can see it is not excessively GCing.
>>>>>
>>>>> So, my questions are:
>>>>>
>>>>> 1. what kind of performance can I realistically expect and can I 
>>>>> improve what I have at the moment?
>>>>>
>>>>> 2. what kind of degradation should I expect as the graph grows?
>>>>>
>>>>> Thanks, guys.
>>>>>
>>>>> Phillip
>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>
>>>> --- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "OrientDB" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to orient-databa...@googlegroups.com.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> -- 
>>
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "OrientDB" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to orient-databa...@googlegroups.com <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to orient-database+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [orientdb] Re: Performance of Distributed (3 nodes) cluster with one billion edges

Reply via email to