Hi Mukul, that is great news, I checked into the document, I haven't seen it before, it seems satisfactory regarding rename, but I am unsure about delete. The doc still says that delete will be a client driven things, and with that the client will need to iterate through the things, and remove everything. So that will be similar to what we have for now, but driven from the client side as I understand. I think I have an idea on this, let me post it to the JIRA, I guess this pretty much answers my concerns, hopefully it will be something we reach in short term.
My gut feeling is that performance wise we got slower during TPCDS data generation mostly because of renames/deletes probably taking longer, but I don't have hard facts so far... I have 6 nodes in the cluster where I am testing now, compared to the 28 I had for the perf tests... So there are a few other things that contribute to the slowdown. In this cluster I have 3 DNs, in the previous I had 14, but that does not seem to explain the slowdown (I don't have the exact number but the slowdown is 2-3 orders of magnitude...). To give you some numbers on this: - with the mid jan state of the project on 14 DNs with 1 OM and 1 SCM generating the catalog_returns table for the 100GB dataset took: 1086 seconds, with the ~2weeks old ofs branch on 3 DNs with 1 OM and 1 SCM the same table for the same 100GB dataset size took 7638 seconds to generate - the same for the inventory table is was 653 seconds vs 702 seconds The problem is to determine what has changed that much... there were changes in the tpcds scripts, in the Ozone code, in the cluster layout, in the amount of nodes available, and the slowdown is also very different for different tables. The overall process took 2 days now for the 100GB dataset generation to finish on 3 DN which is completely unacceptable for sure, but as I am checking on functionality at the moment, I haven't spent too much on diagnosing what has happened as it was running over a weekend... also I did not do any legit comparison so far with o3fs, but I plan to check into this one... I can also run a comparison with the january state on the same hardware if it becomes necessary to see clearly... It is too early to say anything meaningful about the real issue, but renames and deletes seems to be a contributing factor for sure, as these operations are happening for a long time after the mappers and reducers finished to copy the data from one place to the other. Pifta Mukul Kumar Singh <[email protected]> ezt írta (időpont: 2020. máj. 29., P, 14:31): > Hi Pifta, > > The problems with renames and deletes with multiple file will be fixed > via HDDS-2939. > > There is an attached design doc on the jira which lists the problem. > > Another followup question, was there any significant performance > difference between o3fs vs ofs ? > > Thanks, > > Mukul > > > On 28/05/20 7:15 pm, István Fajth wrote: > > Hello everyone, > > > > recently I am working on to test the o3fs/ofs implementation with Hive, > and > > with some other things as well. > > I have ran into a few surprisingly slow operations and some interesting > > file system states during data preparation, all of which seems to be a > real > > problem in at least some of the data loading scenarios. Let see them one > by > > one: > > > > 1. When you load data to Hive by copying a data source to Ozone, and use > it > > as an external table to load it into an other table with either create > > table as select, or insert based on a select, Hive does temporary files > > first, and (even though it think it is a problem) it renames the > temporary > > data folder 3-4 times during getting it to the final location. If the > table > > is partitioned to a lot of files, this can get to extremes... (In one > run, > > it took 4 hours to get through this stage for some tables). > > > > 2. When you have a folder with a lot of files in it, deleting the folder > > (also dropping the table from beeline, or deleting it via rm) is blocking > > the client and the request does not get a response until the deletion of > > the blocks are happening (or maybe until the last batch is processed by > > SCM) as it seems. A folder that was created during 4 hours of renames by > > Hive got deleted in like 30 minutes or so. > > > > 3. During the rename of a folder that contains a large number of files, > the > > filesystem is in an interesting state for other clients. An ls -R running > > on the parent of the folder being renamed, throws a FileNotFoundException > > for the path. So the listStatus seems to contain the path being renamed, > > while getting the status or accessing the path being renamed throws the > > FNFException. > > > > > > For 1 and 2, there is HDDS-1301 which was aiming to optimise these APIs > in > > OM though the patch never got committed, as it is not clear what load > will > > this put on OM, and how long it will lock things there. I don't think > there > > is an easy solution for the problem with the current architecture, but > > would like to kickoff a discussion in the community. Also if you can help > > me with some documents about possible solutions if already exists some, I > > would be very happy to check into those. > > > > For 3, One would think for the first sight that we should at least do a > > change in the listStatus API to do not add the folders being renamed to > the > > response, as they can not be accessed at the moment. Though this is > > problematic, as if someone would like to create the same folder during > the > > rename, then that will fail as the path already exists, but the folder > will > > not be usable after such a failure and the path will not exists after the > > rename finishes... This is a problem now as well but I haven't seen this > > causing troubles so far. So... there might not be a good solution for > this > > unless the renames are getting atomic in our filesystem implementations > > somehow, which goes back to the previous point. > > I would like to hear your opinion on this, as I am hesitant and can not > > decide on whether we should do anything about this problem. Probably this > > was discussed earlier and there is a definite answer I am unaware of, > that > > is fine for me as well if someone can share it, or point me to a design > doc > > that mentions our approach on this phenomenon. > > > -- Pifta
