Even 10G is a rather small amount of data. Setting up a bulk loading framework is a bit more complicated than it appears at first glance. Take your pick of course, but I probably wouldn't consider bulk loading unless you were regularly processing 10-100x that amount of data :)

[email protected] wrote:
The bulk import seemed to be a good option since the bson file generated about 
10g data. The problem with my code was that I wasn't releasing memory which 
eventually became the bottleneck.

Sent from my iPhone

On Oct 11, 2016, at 9:39 PM, Josh Elser<[email protected]>  wrote:

For only 4GB of data, you don't need to do bulk ingest. That is serious 
overkill.

I don't know why the master would have died/become unresponsive. It is 
minimally involved with the write-pipeline.

Can you share your current accumulo-env.sh/accumulo-site.xml? Have you followed 
the Accumulo user manual to change the configuration to match the available 
resources you have on your 3 nodes where Accumulo is running?

http://accumulo.apache.org/1.7/accumulo_user_manual.html#_pre_splitting_new_tables

http://accumulo.apache.org/1.7/accumulo_user_manual.html#_native_map

http://accumulo.apache.org/1.7/accumulo_user_manual.html#_troubleshooting

Yamini Joshi wrote:
Hello

I am trying to import data from a bson file to a 3 node Accumulo cluster
using pyaccumulo. The bson file is 4G and has a lot of records, all to
be stored into one table. I tried a very naive approach and used
pyaccumulo batch writer to write to the table. After parsing some
records, my master became unresonsive and shut down with the tserver
threads stuck on low memory error. I am assuming that the records are
created faster than what the proxy/master can handle. Is there ant other
way to go about it? I am thinking of using bulk ingest but I am not sure
how exactly.

Best regards,
Yamini Joshi

Reply via email to