cypherpunks list not loading for me.
also many emails from it missing to inbox at the moment.
here is last spam that i tried to send during connectivity issues:
troubleshooting deepseek inference failure [on remote hardware]
transformers/modeling_utils.py 4788
`p` is an mlp weight, "model.layers.61.self_attn.q_a_proj.weight"
`param_device_map[p]` does not exist
`p` is enumerated from `weight_map`
transformers modeling_utils.py 4785:
- `weight_map` has mlp weights and `param_device_map` does not
- an mlp weight is "model.layers.61.self_attn.q_a_proj.weight"
- this is in PreTrainedModel._load_pretrained_model
0352
what conditions cause this block to execute?
where do weight_map and param_device_map come from?
`weight_map` is constructed in previous block.
`else` block in line 4783 indent depth 3
which weight map is constructed?
go up file. indent depth 3 is weight map condition
indent depth 2 is offload code condition
weightmap condition is `if sharded_metadata is None`
(Pdb) p sharded_metadata is None
False
so we have `weight_map = {p: os.path.join(folder, f) for p, f in
sharded_metadata["weight_map"].items()}`
-> weight map is constructed from `sharded_metadata`. if sharded_metadata were
None, it would be constructed from `original_loaded_keys` and would still
contain mlp weights.
it looks like a good avenue would either be to figure out why
`param_device_map` does not have mlp keys, or why the larger block is being
executed.
0358
line 4773 indent depth 2: `if device_map is not None and is_safetensors`
so basically this block is only run if there is both a device map, and
is_safetensors is set.
i think i'm manually setting is_safetensors; maybe i'll try disabling it and
see if i can generate the data then.
0359
0400 ok while that is loading lets see if we can figure out where
param_device_map comes from
0402: removing `use_safetensors` did not resolve the crash. param_device_map is
set on line 4774:
4773 if device_map is not None and is_safetensors:
4774 param_device_map = expand_device_map(device_map,
original_loaded_keys, start_prefix)
basically, `device_map` is expanded to `model.layers.[i]` but does not have an
entry for layer 61 which is the mlp layer.
so when it is expanded it doesn't have any of the weights in that layer.
this probably happens when the device map is autogenerated, which happened
outside this function.
0405
but rather in the calling function: .from_pretrained()
likely line 4259 device_map = infer_auto_device_map(...)
right now:
(Pdb) p device_map_kwargs
{'no_split_module_classes': ['DeepseekV3DecoderLayer'], 'special_dtypes':
{'lm_head.weight': torch.bfloat16}, 'max_memory': {'cpu': 85212960085}}
0407
so basically it sounds like these weights are not present in the model
enumeration
but are present on disk
i have run the model before as have many other so there's some way to make it
work.
it looks like the easiest way is to disable device_map which may mean fitting
the entire model on one device, or it may mean manually calling offload code
after construction.
i could maybe put it on cpu, then set the dtype and offloading after
or maybe i can set the offloading for the whole model without using a device
map somehow .... maybe not
- set a breakpint on infer_auto_device_map ? (i confirmed the layer is not in
the model)
- look at the model source code again to see if the layer can be enabled for
this step
- try calling without a device map
some confusion. it looks like the model has _62_ layers, whereas ....
uhhh ...
so num_hidden_layers is 61 and num_nextn_predict_layers is 1.
the ModuleList .layers is constructed with num_hidden_layers
and it has names that range from 0 to 60.
so the layer that is named "61" is the mlp layer. and it's #62.
confusing because there are 61 hidden layers
and it seemed like the kind of community that might use 1-based numbering
but nope! layer 61 is the 62nd layer, the mlp layer, and it's not in the list
of layers
so i don't see any way for layer 61 to be instantiated here :/ which is strange
cause i've thought i've seen it eval'd
maybe i can look at my logits and see what happened !
0417
0424
no, the log doesn't show layer 61 ever used. it does show expert 61 used a lot,
maybe i missaw that
ok hmm
so the huggingface device_map code assumes that what's on-disk matches what's
in the model ...
but i know elsewhere in the code they often handle that mismatching, so maybe
something just needs to be set for something to mismatch ...?
0425
0427
looks like the mismatched key code might be after this code; the present
assumption might be that sharded device mapped models are kind of tuned for use
hmm there's an unused function _load_pretrained_model_low_mem that looks
intended for people like me to try out
the keys come from the state_dict parameter. so i could either look into the
function for loading that, or preload a custom state dict, or not use a device
map
it looks like it might work to call transformers.modeling_utils.load_state_dict
in advance and filter the unused keys.
oh no that function is only used if it's not sharded
the keylist comes from get_checkpoint_shard_files
hrm >(
ok options:
- likely a way by passing a custom state dict
- likely a way by not using a device map
- likely a way by engaging internals, one option is get_checkpoint_shard_files
- likely a way by modifying the model to add the unused layers in
that last option might be _easiest and quickest_ here while it's kind of a
unique quirk just for generating test data
i'd just list all the keys in the weights that are on layer 61 and patch them
in i guess
when i run without a device map the warning says i'm supposed to use
"device_map = 'cuda'".
it seems happy to load on cpu
hmm device_map='cuda' seems to work. why is this?
ok i'll try on an H100 again. last time i checked i had $6 on vast.ai . an H100
is maybe $2.50/hr .
0516
ok device_map='cuda' works fine but then i run out of gpu memory ...
0526
so i stepped into device_map='cuda' and i'm around line 4586 and it did
actually enumerate missing_keys and unexpected_keys way back on line 4582 ...
there is also a list of unexpected keys to accept:
4620 # Some models may have keys that are not in the state by
design, removing them before needlessly warning
4621 # the user.
4622 -> if cls._keys_to_ignore_on_load_missing is not None:
4623 for pat in cls._keys_to_ignore_on_load_missing:
4624 missing_keys = [k for k in missing_keys if
re.search(pat, k) is None]
4625
4626 if cls._keys_to_ignore_on_load_unexpected is not None:
4627 for pat in cls._keys_to_ignore_on_load_unexpected:
4628 unexpected_keys = [k for k in unexpected_keys if
re.search(pat, k) is None]
4629 if hf_quantizer is not None:
4630 missing_keys = hf_quantizer.update_missing_keys(model,
missing_keys, prefix)
however, layer 61 is still in loaded_keys after, despite being detected as
unexpected
ok so on line 4773 is_safetensors is _false_ and the failing block isn't
executed. that's basically why it worked.
so why is is_safetensors false?
looks like on line 4534 that is_safetensors is only set if device_map contains
"disk".
it sounds like deepseek will run if i offload to cpu and not to disk.
maybe if i can get a VM running i can use swap. i haven't gotten VMs working on
vast.ai, it won't let me connect to them. hrm
maybe i'll just patch those lines to run the model! i can add a check for the
key to be present. lemme see how that works. line 4788 of modeling_utils.py 0535
0556 well now i get an error in get_disk_only_shard_files
i might want to just capture some weights manually at this point
- initially config.quantization_config = {'activation_scheme': 'dynamic',
'fmt': 'e4m3', 'quant_method': 'fp8', 'weight_block_size': [128, 128]}
- then config.quantization_config =
AutoHfQuantizer.merge_quantization_configs(config.quantization_config,
quantization_config=None) =
FineGrainedFP8Config(quant_method=<QuantizationMethod.FP8: 'fp8'>)
- then
3691 -> hf_quantizer = AutoHfQuantizer.from_config(
3692 config.quantization_config,
3693 pre_quantized=pre_quantized, // = True
3694 )
3699 hf_quantizer.validate_environment(
3700 torch_dtype=torch_dtype,
3701 from_tf=from_tf,
3702 -> from_flax=from_flax,
3703 device_map=device_map,
3704 weights_only=weights_only,
3705 )
3706 torch_dtype = hf_quantizer.update_torch_dtype(torch_dtype)
3707 device_map = hf_quantizer.update_device_map(device_map)
(... the model is constructed with empty weights ...)
4200 -> hf_quantizer.preprocess_model(
4201 model=model, device_map=device_map,
keep_in_fp32_modules=keep_in_fp32_modules
4202 )
it looks like preprocess_model is replacing Linear modules with FP8Linear
modules, before weights are loaded.
so that's likely a really important step my code was missing
... now it's doing the weight loading code i engaged so much ...
[hey one thing i could do is forward with saving weights, but only save them
for e.g. the first layer]
it looked like some of the param quantization initialization could have been in
_load_state_dict_into_meta_model or somesuch
so here's this, but it doesn't look properly initialized:
(Pdb) p model_to_load.model.layers[0].self_attn.q_a_proj.weight.cpu()
tensor([[ -22.0000, -72.0000, 88.0000, ..., -9.0000, -208.0000,
-28.0000],
[ 128.0000, 14.0000, 16.0000, ..., 104.0000, -64.0000,
26.0000],
[ 72.0000, -36.0000, 64.0000, ..., -120.0000, 80.0000,
-72.0000],
...,
[-144.0000, 80.0000, 48.0000, ..., -72.0000, -96.0000,
72.0000],
[ -80.0000, 120.0000, 72.0000, ..., -44.0000, 112.0000,
112.0000],
[ 224.0000, 4.5000, -56.0000, ..., 160.0000, -64.0000,
36.0000]], dtype=torch.float8_e4m3fn)
these are much higher magnitude numbers than i'd expect, i don't think they've
been scaled here
ok it's in weight_scale_inv:
(Pdb) p model_to_load.model.layers[0].self_attn.q_a_proj.weight_scale_inv.cpu()
tensor([[0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002,
0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0002,
0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0001, 0.0001, 0.0002, 0.0002,
0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002,
0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0004,
0.0002, 0.0002],
[0.0002, 0.0001, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0001,
0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002,
0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0001, 0.0002,
0.0001, 0.0002, 0.0003, 0.0001, 0.0003, 0.0003, 0.0002, 0.0002, 0.0001,
0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0003, 0.0002,
0.0002, 0.0001],
[0.0003, 0.0001, 0.0002, 0.0003, 0.0001, 0.0003, 0.0005, 0.0002, 0.0002,
0.0002, 0.0003, 0.0003, 0.0003, 0.0002, 0.0004, 0.0004, 0.0002, 0.0004,
0.0003, 0.0002, 0.0002, 0.0005, 0.0002, 0.0003, 0.0005, 0.0001, 0.0002,
0.0002, 0.0004, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
0.0003, 0.0003, 0.0005, 0.0002, 0.0005, 0.0004, 0.0002, 0.0001, 0.0001,
0.0002, 0.0003, 0.0003, 0.0002, 0.0002, 0.0003, 0.0002, 0.0005, 0.0004,
0.0002, 0.0002],
[0.0004, 0.0002, 0.0002, 0.0003, 0.0001, 0.0003, 0.0005, 0.0002, 0.0003,
0.0002, 0.0003, 0.0003, 0.0004, 0.0002, 0.0004, 0.0004, 0.0003, 0.0004,
0.0003, 0.0001, 0.0002, 0.0005, 0.0002, 0.0003, 0.0005, 0.0002, 0.0003,
0.0002, 0.0004, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001,
0.0003, 0.0003, 0.0005, 0.0002, 0.0005, 0.0005, 0.0002, 0.0002, 0.0001,
0.0001, 0.0003, 0.0003, 0.0002, 0.0002, 0.0003, 0.0002, 0.0005, 0.0004,
0.0002, 0.0001],
[0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002,
0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0003, 0.0002, 0.0003,
0.0002, 0.0001, 0.0002, 0.0003, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002,
0.0001, 0.0003, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001,
0.0002, 0.0002, 0.0003, 0.0002, 0.0004, 0.0003, 0.0002, 0.0002, 0.0002,
0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0003, 0.0002, 0.0004, 0.0004,
0.0002, 0.0002],
[0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001,
0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0001, 0.0001, 0.0002,
0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0001, 0.0002, 0.0001, 0.0002,
0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0001, 0.0002, 0.0002, 0.0001,
0.0002, 0.0001, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
0.0001, 0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002,
0.0002, 0.0001],
[0.0003, 0.0001, 0.0002, 0.0003, 0.0002, 0.0002, 0.0004, 0.0002, 0.0002,
0.0002, 0.0002, 0.0002, 0.0003, 0.0001, 0.0003, 0.0004, 0.0002, 0.0003,
0.0003, 0.0002, 0.0002, 0.0004, 0.0002, 0.0002, 0.0004, 0.0002, 0.0002,
0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002,
0.0002, 0.0003, 0.0004, 0.0002, 0.0004, 0.0004, 0.0001, 0.0002, 0.0002,
0.0001, 0.0003, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0004, 0.0003,
0.0002, 0.0002],
[0.0003, 0.0001, 0.0002, 0.0003, 0.0002, 0.0002, 0.0003, 0.0002,
0.0002,
[3/1806]
0.0002, 0.0003, 0.0002, 0.0002, 0.0001, 0.0004, 0.0003, 0.0002, 0.0003,
0.0003, 0.0002, 0.0002, 0.0004, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002,
0.0002, 0.0003, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
0.0002, 0.0003, 0.0004, 0.0002, 0.0004, 0.0004, 0.0002, 0.0002, 0.0002,
0.0002, 0.0002, 0.0003, 0.0002, 0.0001, 0.0003, 0.0002, 0.0005, 0.0004,
0.0002, 0.0002],
[0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002,
0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0003,
0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002,
0.0002, 0.0002],
[0.0003, 0.0002, 0.0002, 0.0003, 0.0002, 0.0003, 0.0005, 0.0002, 0.0002,
0.0002, 0.0003, 0.0003, 0.0003, 0.0002, 0.0004, 0.0004, 0.0002, 0.0004,
0.0003, 0.0002, 0.0002, 0.0004, 0.0002, 0.0003, 0.0005, 0.0002, 0.0002,
0.0002, 0.0004, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001,
0.0003, 0.0003, 0.0005, 0.0002, 0.0005, 0.0004, 0.0002, 0.0002, 0.0002,
0.0001, 0.0003, 0.0003, 0.0002, 0.0002, 0.0003, 0.0002, 0.0005, 0.0004,
0.0002, 0.0002],
[0.0001, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001,
0.0002, 0.0001, 0.0001, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002,
0.0002, 0.0001, 0.0002, 0.0002, 0.0001, 0.0001, 0.0002, 0.0002, 0.0001,
0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002,
0.0002, 0.0001],
[0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0001, 0.0002, 0.0002, 0.0002,
0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002,
0.0001, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0001, 0.0002, 0.0001,
0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0001, 0.0002, 0.0001,
0.0001, 0.0002]])
and of course i could have made mistakes copying that by hand from pdb