[MXNet Forum] [Discussion] Siamese network issue, No gradient, No clue why

plemeur via MXNet Forum Thu, 02 Jul 2020 02:26:56 -0700


Hello, I've been trying to reproduce this experiment 
https://github.com/adambielski/siamese-triplet/blob/master/Experiments_MNIST.ipynb
 using MXNet.
The classification part was done, no problem there, but the siamese part is not 
converging as expected.


First thing first, my dataset is based on the MNIST dataset available in gluon

```
class SiameseNetworkDataset(datasets.MNIST):
    def __init__(self, root, train, transform=None):
        super().__init__(root,train, transform=transform)
        self.root = root
        self.transform = transform
        
        self._data = self._data.transpose((0, 3, 1,2)).astype('float32')/255
        self._label_indexes = {
            i : np.where(self._label==i)[0] for i in range(10)
        }
        
    def __getitem__(self, index):
        items_with_index = list(enumerate(self._label))
        img0_index, img0_tuple = random.choice(items_with_index)
        # we need to make sure approx 50% of images are in the same class
        should_get_same_class = random.randint(0, 1)
        if should_get_same_class:
            img1_index = random.choice(self._label_indexes[img0_tuple])
            img1_tuple = self._label[img1_index]
                
        else:
            img1_index, img1_tuple = random.choice(items_with_index)

        img0 = self._data[img0_index]
        img1 = self._data[img1_index]
        
        
        return img0, img1, mx.nd.array(mx.nd.array([int(img1_tuple != 
img0_tuple)]))

    def __len__(self):
        return super().__len__()
```

It does what I want and give back 3 output two of shape `(batch_size, 1, 28, 
28)` and of shape `(batch_size, 1)`. So I doubt the error comes from here.

Then, there is the model, I'm using the same as the sus-mentionned exemple 
```class SiameseNetwork(HybridBlock):
    def __init__(self):
        super(SiameseNetwork, self).__init__()
        with self.name_scope():
            self.CNN = nn.HybridSequential()
            self.CNN.add(nn.Conv2D(channels=32, kernel_size=5),
                         nn.PReLU(),
                         nn.MaxPool2D(pool_size=2, strides=2),
                         nn.Conv2D(channels=64, kernel_size=5),
                         nn.PReLU(),
                         nn.MaxPool2D(pool_size=2, strides=2),
                        nn.Flatten(),
                        nn.Dense(256),
                         nn.PReLU(),
                        nn.Dense(256),
                         nn.PReLU(),
                        nn.Dense(2))
           
        
    def hybrid_forward(self, F, input1, input2):
        
        output1 = self.CNN(input1)
        output2 = self.CNN(input2)

        return output1, output2
```
Originally the `CNN` was another custom `HybridBlock`, but right now it's this 
(I thought that maybe I had an issue because of nested models, but looks like 
no) 

Then there is the loss function 
```
class ContrastiveLoss(Loss):
    def __init__(self, margin=2.0, weight=None, batch_axis=0, **kwargs):
        super(ContrastiveLoss, self).__init__(weight, batch_axis, **kwargs)
        self.margin = margin

    def hybrid_forward(self, F, output1, output2, label):
        euclidean_distance = F.sum(F.square(output1-output2),axis=1)
        loss_contrastive = F.mean((1-label) * F.square(euclidean_distance) +
                                   label * F.square(F.sqrt(F.clip((self.margin 
- euclidean_distance), 0.0, 10))))
        return loss_contrastive
```
I tried many things here, using different ndarray function (norm, relu, sum, 
etc...) basically always the same error function but written in a different 
way. I checked the shape at every step of the process to be sure that my vector 
product were doing the right thing (not going to matrix of shape `(batch_size, 
batch_size)` for instance.
It might be one of the culprit, but if it is, I might be not familiar enough 
with mxnet to see it.

Finally the training loop is 
```
for epoch in range(0, Config.train_number_epochs):
        if (epoch+1)%5 == 0:
            lr *= 0.1
            print(lr)
            trainer.set_learning_rate(lr)
        for i, data in enumerate(train_dataloader, 0):
               
            with autograd.record():
                img0, img1, label = data
                img0 = img0.copyto(mx.gpu())
                img1 = img1.copyto(mx.gpu())

                label = label.squeeze().copyto(mx.gpu())
                output1, output2 = net(img0, img1)
                
   
                loss_contrastive = loss(output1, output2, label)
                loss_contrastive.backward()
                
        
                if mx.nd.contrib.isnan(loss_contrastive):
                    print(loss_contrastive)
                    return label, output1, output2
            
            trainer.step(256)
            
            if i % 20 == 0:
                print("Epoch number {}  : Current loss = {}".format(epoch, 
loss_contrastive.mean().asscalar()))
```
I tried with the autograd before and after copying the batch to the GPU, no 
difference here, I would think that best practice is to have it after because I 
don't care about this operation.
`label` is squeezed because the shape `(64,1)` is problematic in the loss 
function. 
I checked the gradient recorded in the model with `net.CNN[0].weight.grad` and 
they were non-null.
However, I did try to check the loss gradient w.r.t `output1` and `output2` and 
got an error saying that they were not in a computational graph.

I'm open to any suggestion (especially if you have some on the dataset, the 
dataloading is slow as hell), if you have any idea on why it is not training.

I will try to transfer learn from the classification to the metric learning, 
maybe it's just lost in a local minima, but I really doubt it because when I 
plot the embedding in a graph, it's just a point cloud without any order.

Thanks for reading, It's the first time I post something in any technical 
forum, so if something is not in the form of this post, please do tell me as 
well :slight_smile:





---
[Visit 
Topic](https://discuss.mxnet.io/t/siamese-network-issue-no-gradient-no-clue-why/6372/1)
 or reply to this email to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.mxnet.io/email/unsubscribe/7d2dfb7e9252317bcfaf32aefc2c794a2d5a2c871f98be56f47ee80eea400d72).

[MXNet Forum] [Discussion] Siamese network issue, No gradient, No clue why

Reply via email to