I would first try to resolve the 75 min startup for clean cluster issue.
Are you seeing the above recovery messages for a clean start? If so, then
the start does not look clean.
Also, the idea behind the disk store concept was to utilize multiple disks
on a single machine to get better throughput for writing and recovery. I
don't know if you still get that advantage on mounted volumes in cloud, but
you could try mounting two disks and then point the disk stores to one/disk.


On Wed, Oct 17, 2018 at 7:15 AM Pieter van Zyl <[email protected]>
wrote:

> Hi Jens.
>
> I am using GCP to fire up 3 servers. The import is quick enough and the
> cluster and network looks ok then.
> Speed also looks fine between the 3 nodes.
>
> I have these properties enabled when I start the server:
>
> java -server -agentpath:/home/r2d2/yourkit/bin/linux-x86-64/libyjpagent.so
> -javaagent:lib/aspectj/lib/aspectjweaver.jar -Dgemfire.EXPIRY_THREADS=20
> -Dgemfire.PREFER_SERIALIZED=false 
> *-Dgemfire.enable.network.partition.detection=false
> *-Dgemfire.autopdx.ignoreConstructor=true
> -Dgemfire.ALLOW_PERSISTENT_TRANSACTIONS=true
> -Dgemfire.member-timeout=600000 -Xmx90G -Xms90G -Xmn30G -XX:SurvivorRatio=1
> -XX:MaxTenuringThreshold=15 -XX:CMSInitiatingOccupancyFraction=78
> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled
> -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC
> -XX:+PrintGCDetails -XX:+PrintTenuringDistribution -XX:+PrintGCTimeStamps
> -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -verbose:gc
> -Xloggc:/home/r2d2/rdb-geode-server/gc/gc-server.log
> -Djava.rmi.server.hostname='localhost'
> -Dcom.sun.management.jmxremote.port=9010
> -Dcom.sun.management.jmxremote.rmi.port=9010
> -Dcom.sun.management.jmxremote.local.only=false
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dcom.sun.management.jmxremote.ssl=false
> .....org.rdb.geode.server.GeodeServer
>
> Could this setting influence the cluster:
> *Dgemfire.enable.network.partition.detection=false*
>
> *I am seeing a lot of recovery messages:*
>
> [info 2018/10/16 15:32:26.867 UTC  <Recovery thread for bucket
>> _B__net.lautus.gls.domain.life.instruction.instruction.rebalance.
>> AggregatePortfolioRebalanceChoice_92> tid=0x42c9] Initialization of
>> region _B__net.lautus.gls.domain.life.instruction.instruction.rebalance.
>> AggregatePortfolioRebalanceChoice_92 completed
>> [info 2018/10/14 11:19:17.329 SAST  <RedundancyLogger for region
>> net.lautus.gls.domain.life.additionalfields.AdditionalFieldConfiguration>
>> tid=0x1858] Region
>> /net.lautus.gls.domain.life.additionalfields.AdditionalFieldConfiguration
>> (and any colocated sub-regions) has potentially stale data.  Buckets [3]
>> are waiting for another offline member to recover the latest data.
>>   My persistent id is:
>>     DiskStore ID: 932530bc-4c45-4926-b4a1-6fe5fe1f0493
>>     Name:
>>     Location: /10.154.0.2:/home/r2d2/rdb-geode-server/geode/tauDiskStore
>>
>>   Offline members with potentially new data:
>>   [
>>     DiskStore ID: c09e4cce-51e9-4111-8643-fe582677f49f
>>     Location: /10.154.0.4:/home/r2d2/rdb-geode-server/geode/tauDiskStore
>>     Buckets: [3]
>>   ]
>>   Use the "gfsh show missing-disk-stores" command to see all disk stores
>> that are being waited on by other members.
>> [info 2018/10/14 11:19:35.250 SAST  <Pooled Waiting Message Processor 7>
>> tid=0x1318] Configured redundancy of 1 copies has been restored to
>> /net.lautus.gls.domain.life.additionalfields.AdditionalFieldConfiguration
>
>
> Btw using Apache Geode 1.7.0.
>
> Kindly
>
> Pieter
>
>
> On Wed, Oct 17, 2018 at 3:56 PM Jens Deppe <[email protected]> wrote:
>
>> Hi Pieter,
>>
>> Your startup times are definitely  too long - probably at least an order
>> of magnitude. My first guess is that this is network related. This may
>> either be a DNS lookup issue or, if the the cluster is isolated from the
>> internet, it may be some problem with XSD validation needing internet
>> access (even though we do bundle the XSD files with Geode - should be the
>> same for Spring too). I will see if I can find any potential XSD issue.
>>
>> --Jens
>>
>> On Wed, Oct 17, 2018 at 3:22 AM Pieter van Zyl <[email protected]>
>> wrote:
>>
>>> Good day.
>>>
>>> We are currently running a 3 node Geode cluster.
>>>
>>> We are running the locator from gfsh and then staring up 3 servers with
>>> Spring that connects to the central locator.
>>>
>>> We are using persistence on all the regions and have basically one data
>>> and pdx store per node.
>>>
>>> The problem  we are experiencing is that with no data aka clean cluster
>>> it take 75minutes to start up.
>>>
>>> Once data has been imported into the cluster and we shutdown all
>>> nodes/server and startup again it takes 128 to 160 minutes
>>> This is very slow.
>>>
>>> Question is is there anyway to improve the startup speed? Is this normal
>>> and expected speed?
>>>
>>> We have a 100gig database distributed across the 3 nodes.
>>> Server 1: 100 gig memory and 90 gig assigned heap and db size of 49gig
>>> and 32 cores.
>>> Server 2: 64 gig memory and 60 gig assigned heap and db size of 34gig
>>> and 16 cores
>>> Server 3: 64 gig memory and 60 gig assigned heap and db size of 34gig
>>> and 16 cores
>>>
>>> Should we have more data stores? Maybe separate stores for the partition
>>> vs replicated regions?
>>>
>>> <gfe:disk-store id="pdx-disk-store" allow-force-compaction="true"
>>> auto-compact="true" max-oplog-size="1024">
>>>    * <gfe:disk-dir location="geode/pdx"/>*
>>> </gfe:disk-store>
>>>
>>> <gfe:disk-store id="tauDiskStore" allow-force-compaction="true"
>>> auto-compact="true" max-oplog-size="5120"
>>>                 compaction-threshold="90">
>>>   *  <gfe:disk-dir location="geode/tauDiskStore"/>*
>>> </gfe:disk-store>
>>>
>>> We have a mix of regions:
>>>
>>> Example partitioned region:
>>>
>>> <gfe:replicated-region
>>> id="net.lautus.gls.domain.life.accounting.Account"
>>> disk-store-ref="tauDiskStore"
>>>                        statistics="true"
>>> persistent="true"><!--<gfe:cache-listener ref="cacheListener"/>-->
>>>     <gfe:eviction type="HEAP_PERCENTAGE" action="OVERFLOW_TO_DISK"/>
>>> </gfe:replicated-region>
>>>
>>> Example replicated region:
>>> <gfe:replicated-region
>>> id="org.rdb.internal.session.rootmap.RootMapHolder"
>>>                        disk-store-ref="tauDiskStore"
>>>                        statistics="true" persistent="true"
>>> >
>>>     <!--<gfe:cache-listener ref="cacheListener"/>-->
>>>     <gfe:eviction type="ENTRY_COUNT" action="OVERFLOW_TO_DISK"
>>> threshold="100">
>>>         <gfe:object-sizer ref="objectSizer"/>
>>>     </gfe:eviction>
>>> </gfe:replicated-region>
>>>
>>>
>>> Any advice would be appreciated
>>>
>>> Kindly
>>> Pieter
>>>
>>

Reply via email to