Ali,
"not recommended to dedicate more than 8-10 GM to JVM heap space" by
whom? Do you have links/references establishing this? I couldn't find
anyone saying this or why.
Russ
On 10/13/2016 05:47 PM, Ali Nazemian wrote:
Hi,
I have another question regarding the hardware recommendation. As far
as I found out, Nifi uses on-heap memory currently, and it will not
try to load the whole object in memory. From the garbage collection
perspective, it is not recommended to dedicate more than 8-10 GB to
JVM heap space. In this case, may I say spending money on system
memory is useless? Probably 16 GB per each system is enough according
to this architecture. Unless some architecture changes appear in the
future to use off-heap memory as well. However, I found some articles
about best practices, and in terms of memory recommendation it does
not make sense. Would you please clarify this part for me?
Thank you very much.
Best regards,
Ali
On Thu, Oct 13, 2016 at 11:38 PM, Ali Nazemian <[email protected]
<mailto:[email protected]>> wrote:
Thank you very much.
I would be more than happy to provide some benchmark results after
the implementation.
Sincerely yours,
Ali
On Thu, Oct 13, 2016 at 11:32 PM, Joe Witt <[email protected]
<mailto:[email protected]>> wrote:
Ali,
I agree with your assumption. It would be great to test that
out and provide some numbers but intuitively I agree.
I could envision certain scatter/gather data flows that could
challenge that sequential access assumption but honestly with
how awesome disk caching is in Linux these days in think
practically speaking this is the right way to think about it.
Thanks
Joe
On Thu, Oct 13, 2016 at 8:29 AM, Ali Nazemian
<[email protected] <mailto:[email protected]>> wrote:
Dear Joe,
Thank you very much. That was a really great explanation.
I investigated the Nifi architecture, and it seems that
most of the read/write operations for flow file repo and
provenance repo are random. However, for content repo most
of the read/write operations are sequential. Let's say
cost does not matter. In this case, even choosing SSD for
content repo can not provide huge performance gain instead
of HDD. Am I right? Hence, it would be better to spend
content repo SSD money on network infrastructure.
Best regards,
Ali
On Thu, Oct 13, 2016 at 10:22 PM, Joe Witt
<[email protected] <mailto:[email protected]>> wrote:
Ali,
You have a lot of nice resources to work with there.
I'd recommend the series of RAID-1 configuration
personally provided you keep in mind this means you
can only lose a single disk for any one partition. As
long as they're being monitored and would be quickly
replaced this in practice works well. If there could
be lapses in monitoring or time to replace then it is
perhaps safer to go with more redundancy or an
alternative RAID type.
I'd say do the OS, app installs w/user and audit db
stuff, application logs on one physical RAID volume.
Have a dedicated physical volume for the flow file
repository. It will not be able to use all the space
but it certainly could benefit from having no other
contention. This could be a great thing to have SSDs
for actually. And for the remaining volumes split
them up for content and provenance as you have. You
get to make the overall performance versus retention
decision. Frankly, you have a great system to work
with and I suspect you're going to see excellent
results anyway.
Conservatively speaking expect say 50MB/s of
throughput per volume in the content repository so if
you end up with 8 of them could achieve upwards of
400MB/s sustained. You'll also then want to make sure
you have a good 10G based network setup as well. Or,
you could dial back on the speed tradeoff and simply
increase retention or disk loss tolerance. Lots of
ways to play the game.
There are no published SSD vs HDD performance
benchmarks that I am aware of though this is a good
idea. Having a hybrid of SSDs and HDDs could offer a
really solid performance/retention/cost tradeoff. For
example having SSDs for the
OS/logs/provenance/flowfile with HDDs for the content
- that would be quite nice. At that rate to take full
advantage of the system you'd need to have very strong
network infrastructure between NiFi and any systems it
is interfacing with and your flows would need to be
well tuned for GC/memory efficiency.
Thanks
Joe
On Thu, Oct 13, 2016 at 2:50 AM, Ali Nazemian
<[email protected] <mailto:[email protected]>>
wrote:
Dear Nifi Users/ developers,
Hi,
I was wondering is there any benchmark about the
question that is it better to dedicate disk
control to Nifi or using RAID for this purpose?
For example, which of these scenarios is
recommended from the performance point of view?
Scenario 1:
24 disk in total
2 disk- raid 1 for OS and fileflow repo
2 disk- raid 1 for provenance repo1
2 disk- raid 1 for provenance repo2
2 disk- raid 1 for content repo1
2 disk- raid 1 for content repo2
2 disk- raid 1 for content repo3
2 disk- raid 1 for content repo4
2 disk- raid 1 for content repo5
2 disk- raid 1 for content repo6
2 disk- raid 1 for content repo7
2 disk- raid 1 for content repo8
2 disk- raid 1 for content repo9
Scenario 2:
24 disk in total
2 disk- raid 1 for OS and fileflow repo
4 disk- raid 10 for provenance repo1
18 disk- raid 10 for content repo1
Moreover, is there any benchmark for SSD vs HDD
performance for Nifi?
Thank you very much.
Best regards,
Ali
--
A.Nazemian
--
A.Nazemian
--
A.Nazemian